Wikipedia is an open source online encyclopedia, with the english wikipedia host to more than 6.5 million articles. With it being open source, many users are at liberty to make edits, updates, and changes to articles pertaining to a plethora of topics ranging from science, pop culture, politics and many more. Those who make edits may have many reasons, but ofcourse when an event happens one expects that editors flock to article relevant to particular event and make edits as to capture or reflect the news. In this project we set out to study the relationship of wikipedia edits made and important events. Here and through out, we restrict our consideration to a handful of political figures in the United States.
The goal of this project is to investigate time series trends in wikipedia page edits for key US political figures. In particular, we are interested in answer the question if wikipedia edits are indicators for importnat events. We address correlation between types of wikipedia edits made (Major, Minor, and IPs) and their relationship with important events. Moreover, we test a unsupervised model for classifying important events.
Since important event is a rather subjective notion, we measure the importance of a date with respect to a particular person, with the number of times that date appears on their wikipedia page. Although, we can rank importance with this number, through out we consider a date importance to be a binary class. That is, a date is considered important with respect to an individual, if that date appears any where on their wikipedia page.
We use a classic unsupervised learning technique, Isolation Forrests, to classify dates as important. We demonstrate that these models preform with high precision, but with low recall as our definition of importance is ill posed and over counts.
This Jupyter notebook is structured as follows:
The project considers two data web scraped from two seperate sources.
Wikipedia Page Edits
This data sets is scraped from https://xtools.wmflabs.org for a handful of wiki pages.
For each page the scraped data consists of the following information:
Dates found on Wikipedia Page This data sets is scraped from https://en.wikipedia.org for the following individuals
For each page the scraped data consists of the following information only for the months present in the Wikipedia Page Edits data set:
First, we extract the code from https://xtools.wmflabs.org and https://en.wikipedia.org/wiki/ for each of the wikipedia pages of interest. We then save the extracted data as .csv
in the following format.
PAGENAME_monthly_edits.csv
# Import relevant libraries For Extraction
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
# Define Header for request.get to bypass 403 error code
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
# List of wikipedia page names
page_names = ['Joe_Biden','Donald_Trump','Hillary_Clinton','Barack_Obama', 'Chuck_Schumer','Mitch_McConnell']
# Create dictionary to to convert numerical month for searching web page for date
month_names = ['January', 'February', 'March','April',
'May', 'June', 'July', 'August',
'September', 'October','November', 'December']
month_number = list(range(1,13))
month_dict = dict(zip(month_number,month_names))
#Auxilary funciton to extract year nad month from date formated as str
def get_date(date):
year = date[:4]
month = month_dict[int(date[5:7])]
return year, month
##### ONLY RUN IF YOU WANT TO SCRAPE WEB DATA #####
for page in page_names:
print('\n'+'-'*50)
print(' '.join(page.split('_'))+ ' information scrape.')
print('-'*50)
url = ' https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/'+page
# Scrape web
r = requests.get(url,headers=headers)
print(f'Status code: {r.status_code} URL: {url}')
soup = BeautifulSoup( r.content )
# Find Tables
df_tables = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
df_tables.append(df_t[0])
# Save Monthly Edits Table
monthly_edits = df_tables[8]
# Srape Wikipedia pages references and body for edit dates count the number of appearances of each date
url_page = 'https://en.wikipedia.org/wiki/'+page
# Scrape web
r_page = requests.get(url_page,headers=headers)
print(f'Status code: {r_page.status_code} URL: {url_page}')
soup_page = BeautifulSoup( r_page.content )
monthly_edits['Found References']=0
monthly_edits['Found Body']=0
for i, date in enumerate(monthly_edits['Month']):
count_ref = 0
count_body = 0
year, month = get_date(str(date))
#Search through references
regex_ref = r"\(\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year) +r"\)"
c_ref=re.compile(regex_ref,re.I) # with the ignorecase option
for references in soup_page.findAll('ol'):
ref = references.getText()
split_ref = ref.split('\n')
for item in split_ref:
if len(c_ref.findall(item)):
count_ref +=1
monthly_edits.at[i,"Found References"] += count_ref
#Search through page body via paragraphs
regex_body = r"\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year)
c_body=re.compile(regex_body,re.I)
for paragraph in soup_page.findAll('p'):
par = paragraph.getText()
if len(c_body.findall(par)):
count_body +=1
monthly_edits.at[i,"Found Body"] += count_body
print(f'Page searched for dates.')
monthly_edits['Total Found']=monthly_edits['Found Body'] + monthly_edits['Found References']
#Save File
file_name = page+'_monthly_edits.csv'
monthly_edits.to_csv(file_name, index=False)
-------------------------------------------------- Joe Biden information scrape. -------------------------------------------------- Status code: 200 URL: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Joe_Biden Status code: 200 URL: https://en.wikipedia.org/wiki/Joe_Biden Page searched for dates. -------------------------------------------------- Donald Trump information scrape. -------------------------------------------------- Status code: 200 URL: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Donald_Trump Status code: 200 URL: https://en.wikipedia.org/wiki/Donald_Trump Page searched for dates. -------------------------------------------------- Hillary Clinton information scrape. -------------------------------------------------- Status code: 200 URL: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Hillary_Clinton Status code: 200 URL: https://en.wikipedia.org/wiki/Hillary_Clinton Page searched for dates. -------------------------------------------------- Barack Obama information scrape. -------------------------------------------------- Status code: 200 URL: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Barack_Obama Status code: 200 URL: https://en.wikipedia.org/wiki/Barack_Obama Page searched for dates. -------------------------------------------------- Chuck Schumer information scrape. -------------------------------------------------- Status code: 200 URL: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Chuck_Schumer Status code: 200 URL: https://en.wikipedia.org/wiki/Chuck_Schumer Page searched for dates. -------------------------------------------------- Mitch McConnell information scrape. -------------------------------------------------- Status code: 200 URL: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Mitch_McConnell Status code: 200 URL: https://en.wikipedia.org/wiki/Mitch_McConnell Page searched for dates.
Load the saved PAGENAME_monthly_edits.csv
files for each page and look at the contents and check data types. Since we are dealing with multiple .csv
files, one for each page. We save the files in a dictionary with the page name as the relevant key. This is done for to streamline edits for all datasets.
# Init blank dict
dict_of_tables = dict()
for page in page_names:
file_name = page+'_monthly_edits.csv'
monthly_edits = pd.read_csv(file_name)
dict_of_tables[page]=monthly_edits
print('\n'+'-'*50)
print(' '.join(page.split('_'))+ ' page data')
print('-'*50)
display(dict_of_tables[page])
# Check data types
print(dict_of_tables[page].dtypes,'\n')
-------------------------------------------------- Joe Biden page data --------------------------------------------------
Month | Edits | IPs | IPs % | Minor edits | Minor edits % | Edits · Minor edits · IPs | Found References | Found Body | Total Found | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2002-11 | 1 | 0 | 0% | 1 | 100% | NaN | 0 | 0 | 0 |
1 | 2002-12 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
2 | 2003-01 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
3 | 2003-02 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
4 | 2003-03 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
237 | 2022-08 | 114 | 0 | 0% | 21 | 18.4% | NaN | 13 | 3 | 16 |
238 | 2022-09 | 106 | 0 | 0% | 5 | 4.7% | NaN | 3 | 2 | 5 |
239 | 2022-10 | 37 | 0 | 0% | 12 | 32.4% | NaN | 3 | 1 | 4 |
240 | 2022-11 | 51 | 0 | 0% | 6 | 11.8% | NaN | 6 | 0 | 6 |
241 | 2022-12 | 62 | 0 | 0% | 17 | 27.4% | NaN | 2 | 1 | 3 |
242 rows × 10 columns
Month object Edits int64 IPs int64 IPs % object Minor edits int64 Minor edits % object Edits · Minor edits · IPs float64 Found References int64 Found Body int64 Total Found int64 dtype: object -------------------------------------------------- Donald Trump page data --------------------------------------------------
Month | Edits | IPs | IPs % | Minor edits | Minor edits % | Edits · Minor edits · IPs | Found References | Found Body | Total Found | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2004-01 | 6 | 0 | 0% | 4 | 66.7% | NaN | 0 | 0 | 0 |
1 | 2004-02 | 2 | 1 | 50% | 1 | 50% | NaN | 0 | 0 | 0 |
2 | 2004-03 | 1 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
3 | 2004-04 | 22 | 11 | 50% | 4 | 18.2% | NaN | 1 | 0 | 1 |
4 | 2004-05 | 10 | 4 | 40% | 1 | 10% | NaN | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
223 | 2022-08 | 215 | 0 | 0% | 22 | 10.2% | NaN | 11 | 1 | 12 |
224 | 2022-09 | 131 | 0 | 0% | 9 | 6.9% | NaN | 2 | 0 | 2 |
225 | 2022-10 | 190 | 0 | 0% | 7 | 3.7% | NaN | 0 | 0 | 0 |
226 | 2022-11 | 224 | 0 | 0% | 37 | 16.5% | NaN | 7 | 3 | 10 |
227 | 2022-12 | 81 | 0 | 0% | 13 | 16% | NaN | 0 | 0 | 0 |
228 rows × 10 columns
Month object Edits int64 IPs int64 IPs % object Minor edits int64 Minor edits % object Edits · Minor edits · IPs float64 Found References int64 Found Body int64 Total Found int64 dtype: object -------------------------------------------------- Hillary Clinton page data --------------------------------------------------
Month | Edits | IPs | IPs % | Minor edits | Minor edits % | Edits · Minor edits · IPs | Found References | Found Body | Total Found | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2001-03 | 2 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
1 | 2001-04 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
2 | 2001-05 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
3 | 2001-06 | 2 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
4 | 2001-07 | 4 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
257 | 2022-08 | 4 | 0 | 0% | 1 | 25% | NaN | 0 | 0 | 0 |
258 | 2022-09 | 8 | 0 | 0% | 3 | 37.5% | NaN | 1 | 0 | 1 |
259 | 2022-10 | 8 | 0 | 0% | 3 | 37.5% | NaN | 0 | 0 | 0 |
260 | 2022-11 | 30 | 0 | 0% | 9 | 30% | NaN | 0 | 0 | 0 |
261 | 2022-12 | 1 | 0 | 0% | 1 | 100% | NaN | 0 | 0 | 0 |
262 rows × 10 columns
Month object Edits int64 IPs int64 IPs % object Minor edits int64 Minor edits % object Edits · Minor edits · IPs float64 Found References int64 Found Body int64 Total Found int64 dtype: object -------------------------------------------------- Barack Obama page data --------------------------------------------------
Month | Edits | IPs | IPs % | Minor edits | Minor edits % | Edits · Minor edits · IPs | Found References | Found Body | Total Found | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2004-03 | 4 | 1 | 25% | 0 | 0% | NaN | 5 | 0 | 5 |
1 | 2004-04 | 2 | 2 | 100% | 0 | 0% | NaN | 0 | 0 | 0 |
2 | 2004-05 | 0 | 0 | 0% | 0 | 0% | NaN | 2 | 0 | 2 |
3 | 2004-06 | 6 | 1 | 16.7% | 0 | 0% | NaN | 1 | 0 | 1 |
4 | 2004-07 | 82 | 18 | 22% | 36 | 43.9% | NaN | 6 | 0 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
221 | 2022-08 | 39 | 0 | 0% | 17 | 43.6% | NaN | 0 | 0 | 0 |
222 | 2022-09 | 46 | 0 | 0% | 11 | 23.9% | NaN | 1 | 0 | 1 |
223 | 2022-10 | 37 | 0 | 0% | 4 | 10.8% | NaN | 0 | 0 | 0 |
224 | 2022-11 | 51 | 0 | 0% | 13 | 25.5% | NaN | 0 | 0 | 0 |
225 | 2022-12 | 18 | 0 | 0% | 13 | 72.2% | NaN | 0 | 0 | 0 |
226 rows × 10 columns
Month object Edits int64 IPs int64 IPs % object Minor edits int64 Minor edits % object Edits · Minor edits · IPs float64 Found References int64 Found Body int64 Total Found int64 dtype: object -------------------------------------------------- Chuck Schumer page data --------------------------------------------------
Month | Edits | IPs | IPs % | Minor edits | Minor edits % | Edits · Minor edits · IPs | Found References | Found Body | Total Found | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2003-09 | 2 | 1 | 50% | 0 | 0% | NaN | 0 | 0 | 0 |
1 | 2003-10 | 2 | 0 | 0% | 1 | 50% | NaN | 0 | 0 | 0 |
2 | 2003-11 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
3 | 2003-12 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
4 | 2004-01 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
227 | 2022-08 | 4 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
228 | 2022-09 | 1 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
229 | 2022-10 | 2 | 0 | 0% | 1 | 50% | NaN | 0 | 0 | 0 |
230 | 2022-11 | 13 | 0 | 0% | 2 | 15.4% | NaN | 0 | 0 | 0 |
231 | 2022-12 | 2 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
232 rows × 10 columns
Month object Edits int64 IPs int64 IPs % object Minor edits int64 Minor edits % object Edits · Minor edits · IPs float64 Found References int64 Found Body int64 Total Found int64 dtype: object -------------------------------------------------- Mitch McConnell page data --------------------------------------------------
Month | Edits | IPs | IPs % | Minor edits | Minor edits % | Edits · Minor edits · IPs | Found References | Found Body | Total Found | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2003-10 | 1 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
1 | 2003-11 | 3 | 0 | 0% | 3 | 100% | NaN | 0 | 0 | 0 |
2 | 2003-12 | 3 | 0 | 0% | 2 | 66.7% | NaN | 0 | 0 | 0 |
3 | 2004-01 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
4 | 2004-02 | 0 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
226 | 2022-08 | 6 | 0 | 0% | 1 | 16.7% | NaN | 0 | 0 | 0 |
227 | 2022-09 | 1 | 0 | 0% | 0 | 0% | NaN | 0 | 0 | 0 |
228 | 2022-10 | 8 | 0 | 0% | 1 | 12.5% | NaN | 0 | 0 | 0 |
229 | 2022-11 | 20 | 0 | 0% | 7 | 35% | NaN | 0 | 0 | 0 |
230 | 2022-12 | 3 | 0 | 0% | 1 | 33.3% | NaN | 0 | 0 | 0 |
231 rows × 10 columns
Month object Edits int64 IPs int64 IPs % object Minor edits int64 Minor edits % object Edits · Minor edits · IPs float64 Found References int64 Found Body int64 Total Found int64 dtype: object
These columns IPs %
, Minor edits %
and Edits Minor Edits IPs
do not mean anything for our analysis so we drop them. In addition, we convert the Month to a relevant datetime
format.
relevant_columns = ['Month', 'Edits','IPs', 'Minor edits','Found References','Found Body','Total Found' ]
for page in page_names:
print('\n'+'-'*50)
print(' '.join(page.split('_'))+ ' page data')
print('-'*50)
monthly_edits = dict_of_tables[page]
# Get relevant columns
monthly_edits=monthly_edits.loc[:,relevant_columns]
# Add Major edits column
monthly_edits['Major'] = monthly_edits['Edits'] - monthly_edits['IPs'] - monthly_edits['Minor edits']
#Rearange columns
monthly_edits = monthly_edits[ ['Month', 'Edits','Major','IPs', 'Minor edits','Total Found','Found References','Found Body' ]]
# Convert Month to date time
monthly_edits['Month'] = pd.to_datetime(monthly_edits['Month'], format='%Y-%m')
dict_of_tables[page]=monthly_edits
display(dict_of_tables[page].head())
print(dict_of_tables[page].dtypes)
-------------------------------------------------- Joe Biden page data --------------------------------------------------
Month | Edits | Major | IPs | Minor edits | Total Found | Found References | Found Body | |
---|---|---|---|---|---|---|---|---|
0 | 2002-11-01 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 2002-12-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2003-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2003-02-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2003-03-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Month datetime64[ns] Edits int64 Major int64 IPs int64 Minor edits int64 Total Found int64 Found References int64 Found Body int64 dtype: object -------------------------------------------------- Donald Trump page data --------------------------------------------------
Month | Edits | Major | IPs | Minor edits | Total Found | Found References | Found Body | |
---|---|---|---|---|---|---|---|---|
0 | 2004-01-01 | 6 | 2 | 0 | 4 | 0 | 0 | 0 |
1 | 2004-02-01 | 2 | 0 | 1 | 1 | 0 | 0 | 0 |
2 | 2004-03-01 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 2004-04-01 | 22 | 7 | 11 | 4 | 1 | 1 | 0 |
4 | 2004-05-01 | 10 | 5 | 4 | 1 | 0 | 0 | 0 |
Month datetime64[ns] Edits int64 Major int64 IPs int64 Minor edits int64 Total Found int64 Found References int64 Found Body int64 dtype: object -------------------------------------------------- Hillary Clinton page data --------------------------------------------------
Month | Edits | Major | IPs | Minor edits | Total Found | Found References | Found Body | |
---|---|---|---|---|---|---|---|---|
0 | 2001-03-01 | 2 | 2 | 0 | 0 | 0 | 0 | 0 |
1 | 2001-04-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2001-05-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2001-06-01 | 2 | 2 | 0 | 0 | 0 | 0 | 0 |
4 | 2001-07-01 | 4 | 4 | 0 | 0 | 0 | 0 | 0 |
Month datetime64[ns] Edits int64 Major int64 IPs int64 Minor edits int64 Total Found int64 Found References int64 Found Body int64 dtype: object -------------------------------------------------- Barack Obama page data --------------------------------------------------
Month | Edits | Major | IPs | Minor edits | Total Found | Found References | Found Body | |
---|---|---|---|---|---|---|---|---|
0 | 2004-03-01 | 4 | 3 | 1 | 0 | 5 | 5 | 0 |
1 | 2004-04-01 | 2 | 0 | 2 | 0 | 0 | 0 | 0 |
2 | 2004-05-01 | 0 | 0 | 0 | 0 | 2 | 2 | 0 |
3 | 2004-06-01 | 6 | 5 | 1 | 0 | 1 | 1 | 0 |
4 | 2004-07-01 | 82 | 28 | 18 | 36 | 6 | 6 | 0 |
Month datetime64[ns] Edits int64 Major int64 IPs int64 Minor edits int64 Total Found int64 Found References int64 Found Body int64 dtype: object -------------------------------------------------- Chuck Schumer page data --------------------------------------------------
Month | Edits | Major | IPs | Minor edits | Total Found | Found References | Found Body | |
---|---|---|---|---|---|---|---|---|
0 | 2003-09-01 | 2 | 1 | 1 | 0 | 0 | 0 | 0 |
1 | 2003-10-01 | 2 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 2003-11-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2003-12-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2004-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Month datetime64[ns] Edits int64 Major int64 IPs int64 Minor edits int64 Total Found int64 Found References int64 Found Body int64 dtype: object -------------------------------------------------- Mitch McConnell page data --------------------------------------------------
Month | Edits | Major | IPs | Minor edits | Total Found | Found References | Found Body | |
---|---|---|---|---|---|---|---|---|
0 | 2003-10-01 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 2003-11-01 | 3 | 0 | 0 | 3 | 0 | 0 | 0 |
2 | 2003-12-01 | 3 | 1 | 0 | 2 | 0 | 0 | 0 |
3 | 2004-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2004-02-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Month datetime64[ns] Edits int64 Major int64 IPs int64 Minor edits int64 Total Found int64 Found References int64 Found Body int64 dtype: object
Now consider the average statiscs for each page.
for page in page_names:
monthly_edits = dict_of_tables[page][['Edits','IPs', 'Minor edits','Total Found', 'Found Body', 'Found References']]
print('\n'+'-'*50)
print(' '.join(page.split('_'))+ ' page data summary')
print('-'*50)
print(monthly_edits.describe(datetime_is_numeric=True))
-------------------------------------------------- Joe Biden page data summary -------------------------------------------------- Edits IPs Minor edits Total Found Found Body \ count 242.000000 242.000000 242.000000 242.000000 242.000000 mean 45.231405 3.140496 9.822314 1.599174 0.107438 std 97.623611 11.855846 24.010834 2.915205 0.392918 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 7.000000 0.000000 1.000000 0.000000 0.000000 50% 17.500000 0.000000 4.000000 0.000000 0.000000 75% 39.750000 0.000000 10.000000 2.000000 0.000000 max 913.000000 136.000000 286.000000 20.000000 3.000000 Found References count 242.000000 mean 1.491736 std 2.695073 min 0.000000 25% 0.000000 50% 0.000000 75% 2.000000 max 19.000000 -------------------------------------------------- Donald Trump page data summary -------------------------------------------------- Edits IPs Minor edits Total Found Found Body \ count 228.000000 228.000000 228.000000 228.000000 228.000000 mean 173.925439 11.592105 35.337719 3.236842 0.131579 std 232.323428 26.092437 51.270939 5.724671 0.593847 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 23.000000 0.000000 6.000000 0.000000 0.000000 50% 84.500000 0.000000 16.500000 0.000000 0.000000 75% 227.000000 10.250000 43.000000 5.000000 0.000000 max 1475.000000 131.000000 373.000000 43.000000 6.000000 Found References count 228.000000 mean 3.105263 std 5.380448 min 0.000000 25% 0.000000 50% 0.000000 75% 5.000000 max 37.000000 -------------------------------------------------- Hillary Clinton page data summary -------------------------------------------------- Edits IPs Minor edits Total Found Found Body \ count 262.000000 262.000000 262.000000 262.000000 262.000000 mean 65.408397 7.480916 15.816794 0.931298 0.087786 std 103.312072 24.911291 31.531484 1.766833 0.355477 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 11.000000 0.000000 2.000000 0.000000 0.000000 50% 24.000000 0.000000 6.000000 0.000000 0.000000 75% 63.250000 0.000000 17.000000 1.000000 0.000000 max 910.000000 158.000000 394.000000 13.000000 4.000000 Found References count 262.000000 mean 0.843511 std 1.551892 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 9.000000 -------------------------------------------------- Barack Obama page data summary -------------------------------------------------- Edits IPs Minor edits Total Found Found Body \ count 226.000000 226.000000 226.000000 226.000000 226.000000 mean 128.115044 7.393805 33.703540 2.017699 0.300885 std 192.301904 25.247218 50.720723 2.716149 0.651630 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 38.250000 0.000000 10.000000 0.000000 0.000000 50% 57.500000 0.000000 15.000000 1.000000 0.000000 75% 113.750000 0.000000 29.000000 2.000000 0.000000 max 1434.000000 186.000000 451.000000 15.000000 3.000000 Found References count 226.000000 mean 1.716814 std 2.339873 min 0.000000 25% 0.000000 50% 1.000000 75% 2.000000 max 12.000000 -------------------------------------------------- Chuck Schumer page data summary -------------------------------------------------- Edits IPs Minor edits Total Found Found Body \ count 232.000000 232.000000 232.000000 232.000000 232.000000 mean 15.159483 4.810345 3.560345 0.745690 0.073276 std 15.643360 7.098459 4.235060 1.408071 0.333902 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 5.000000 0.000000 1.000000 0.000000 0.000000 50% 10.000000 2.000000 2.500000 0.000000 0.000000 75% 18.250000 7.000000 4.000000 1.000000 0.000000 max 108.000000 40.000000 30.000000 9.000000 3.000000 Found References count 232.000000 mean 0.672414 std 1.281070 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 8.000000 -------------------------------------------------- Mitch McConnell page data summary -------------------------------------------------- Edits IPs Minor edits Total Found Found Body \ count 231.000000 231.000000 231.000000 231.000000 231.000000 mean 21.142857 4.865801 4.930736 0.696970 0.086580 std 23.736258 7.237835 5.964712 1.493155 0.311158 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 6.000000 0.000000 1.000000 0.000000 0.000000 50% 14.000000 2.000000 3.000000 0.000000 0.000000 75% 27.000000 7.000000 6.000000 1.000000 0.000000 max 139.000000 47.000000 47.000000 13.000000 2.000000 Found References count 231.000000 mean 0.610390 std 1.352831 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 12.000000
We now plot the edit trends. Below we have the plotted the time series data for page edit trends - for each page we plot minor, ip, and total edits. In addition, we add marks for months found on the wikipage. Here, size corresponds to the frequence of dates referenced. I.e. the larger the mark, the more frequent the date cited. Since, at the present moment we arent interested in distinguishing between reference and page citation, we mark the count together.
import matplotlib.pyplot as plt
import datetime as dt
plt.rcParams['figure.figsize'] = [12, 7]
for page in page_names:
monthly_edits = dict_of_tables[page]
fig, ax = plt.subplots()
ax.plot(monthly_edits['Month'], monthly_edits['Major'], linewidth=2.0)
ax.plot(monthly_edits['Month'], monthly_edits['Minor edits'], linewidth=2.0)
ax.plot(monthly_edits['Month'], monthly_edits['IPs'], linewidth=2.0)
ax.scatter(monthly_edits['Month'], monthly_edits['Major'],
s=monthly_edits['Total Found']*10,marker='o', c ='purple' )
ax.legend(['Major Edits','Minor Edits','IPs Edits','Dates Found'])
ax.legend(['Major Edits','Minor Edits','IPs Edits'])
plt.xlabel('Date')
plt.title(' '.join(page.split('_'))+' page Edits')
plt.show()
We see there are some repeating 'spikes' whcih may suggest preiodicity. The natural guess is that these spikes occur near importiant political dates. Interestingly enough, where there is a large activity of edits, we see theere are also many cited dates. Suggesting, at least visually, the number of edits relates to presence of a citation on that date.
We now consider what are the events that occured on a given date for a specific page. The following function takes in a page name as well as month and year of interest, then prints out all the citations in both the body of the wikipedia page as well as references.
def print_date_info(date,page):
url_page = 'https://en.wikipedia.org/wiki/'+page
# Scrape web
r_page = requests.get(url_page,headers=headers)
#print(f'Status code: {r_page.status_code} URL: {url_page}')
soup_page = BeautifulSoup( r_page.content )
year, month = get_date(str(date))
#Search through references
regex_ref = r"\(\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year) +r"\)"
c_ref=re.compile(regex_ref,re.I) # with the ignorecase option
found_ref = False
print('Searching References...\n')
for references in soup_page.findAll('ol'):
ref = references.getText()
split_ref = ref.split('\n')
for item in split_ref:
if len(c_ref.findall(item)):
reference_Date = re.findall(r'\(.*\)',item)[0]
reference_title = re.findall(r'["].*["]',item)[0]
print(reference_Date+' '+ reference_title)
found_ref = True
if not found_ref:
print('No date in references found.')
#Search through page body via paragraphs
regex_body = r"\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year)
found_body = False
c_body=re.compile(regex_body,re.I)
print('\nSearching body...\n')
for paragraph in soup_page.findAll('p'):
par = paragraph.getText()
if len(c_body.findall(par)):
print(par)
found_body=True
if not found_body:
print('No date in body found.')
First we consider a histogram of the number of edits per day. That is, in the figure below we plot the proportion of edits per day for each page of interest. We then report the information found on each date with the maximum number of edits.
list_of_counts = []
nbins=20
for page in page_names:
monthly_edits =dict_of_tables[page]
found_count = monthly_edits['Edits']
list_of_counts.append(found_count)
plt.hist(list_of_counts,nbins,density=True, histtype='bar')
plt.legend(page_names)
plt.show()
for page in page_names:
monthly_edits = dict_of_tables[page]
max_edit_entry =monthly_edits.sort_values(by = 'Edits',ascending=False).iloc[0]
print('-'*50)
print(' '.join(page.split('_'))+ ' Max edits')
print('-'*50)
print(max_edit_entry['Month'])
print('Number of edits:', max_edit_entry['Edits'])
print('Number of citations found in references:', max_edit_entry['Found References'])
print('Number of citations found in body:', max_edit_entry['Found Body'])
print_date_info(max_edit_entry['Month'],page)
-------------------------------------------------- Joe Biden Max edits -------------------------------------------------- 2008-08-01 00:00:00 Number of edits: 913 Number of citations found in references: 19 Number of citations found in body: 1 Searching References... (August 27, 2008) "Biden's Scranton childhood left lasting impression" (August 24, 2008) "In his home state, Biden is a regular Joe" (August 24, 2008) "Jill Biden Heads Toward Life in the Spotlight" (August 25, 2008) "Parishioners not surprised to see Biden at usual Mass" (August 20, 2008) "Biden's Foreign Policy Background Carries Growing Cachet" (August 26, 2008) "For Widener Law students, a teacher aims high" (August 27, 2008) "Widener students proud of Biden" (August 24, 2008) "In Biden, Obama chooses a foreign policy adherent of diplomacy before force" (August 23, 2008) "V.P. candidate profile: Sen. Joe Biden" (August 23, 2008) "Biden and Anita Hill, Revisited" (August 24, 2008) "Joe Biden respected—if not always popular—for foreign policy record" (August 25, 2008) "Biden, McCain Have a Friendship—and More—in Common" (August 23, 2008) "Obama's veep message to supporters" (August 23, 2008) "Obama Chooses Biden as Running Mate" (August 25, 2008) "Tramps Like Us: How Joe Biden will reassure working class voters and change the tenor of this week's convention" (August 27, 2008) "Biden accepts VP nomination" (August 24, 2008) "Biden Wages 2 Campaigns At Once" (August 24, 2008) "Demographics part of calculation: Biden adds experience, yes, but he could also help with Catholics, blue-collar whites and women" (August 23, 2008) "Halperin on Biden: Pros and Cons" Searching body... Shortly after Biden withdrew from the presidential race, Obama privately told him he was interested in finding an important place for Biden in his administration.[170] In early August, Obama and Biden met in secret to discuss the possibility,[170] and developed a strong personal rapport.[169] On August 22, 2008, Obama announced that Biden would be his running mate.[171] The New York Times reported that the strategy behind the choice reflected a desire to fill out the ticket with someone with foreign policy and national security experience.[172] Others pointed out Biden's appeal to middle-class and blue-collar voters.[173][174] Biden was officially nominated for vice president on August 27 by voice vote at the 2008 Democratic National Convention in Denver.[175] -------------------------------------------------- Donald Trump Max edits -------------------------------------------------- 2016-11-01 00:00:00 Number of edits: 1475 Number of citations found in references: 11 Number of citations found in body: 1 Searching References... (November 18, 2016) "Donald Trump Agrees to Pay $25 Million in Trump University Settlement" (November 15, 2016) "Clickbait scoops and an engaged alt-right: everything to know about Breitbart News" (November 23, 2016) "Donald Trump disavows 'alt-right'" (November 11, 2016) "Donald Trump will be the only US president ever with no political or military experience" (November 9, 2016) "Trump pulls off biggest upset in U.S. history" (November 9, 2016) "Why Trump Won: Working-Class Whites" (November 9, 2016) "Republicans are poised to grasp the holy grail of governance" (November 10, 2016) "Protests against Donald Trump break out nationwide" (November 11, 2016) "Trump says protesters have 'passion for our great country' after calling demonstrations 'very unfair'" (November 9, 2016) "Trump Can Kill Obamacare With Or Without Help From Congress" (November 15, 2016) "Trump: Same-sex marriage is 'settled', but Roe v Wade can be changed" Searching body... On November 8, 2016, Trump received 306 pledged electoral votes versus 232 for Clinton. The official counts were 304 and 227 respectively, after defections on both sides.[208] Trump received nearly 2.9 million fewer popular votes than Clinton, which made him the fifth person to be elected president while losing the popular vote.[209] Trump is the only president who neither served in the military nor held any government office prior to becoming president.[210] -------------------------------------------------- Hillary Clinton Max edits -------------------------------------------------- 2005-07-01 00:00:00 Number of edits: 910 Number of citations found in references: 2 Number of citations found in body: 0 Searching References... (July 14, 2005) "Clinton among senators urging larger-sized army" (July 14, 2005) "Clinton burnishes hawkish image" Searching body... No date in body found. -------------------------------------------------- Barack Obama Max edits -------------------------------------------------- 2008-11-01 00:00:00 Number of edits: 1434 Number of citations found in references: 7 Number of citations found in body: 2 Searching References... (November 4, 2008) "Obama Elected President as Racial Barrier Falls" (November 9, 2008) "Obama stood out, even during brief 1985 NYPIRG job" (November 17, 2008) "Obama's church choice likely to be scrutinized; D.C. churches have started extending invitations to Obama and his family" (November 16, 2008) "Obama resigns Senate seat, thanks Illinois" (November 4, 2008) "Barack Obama elected 44th president" (November 5, 2008) "Change has come, says President-elect Obama" (November 30, 2008) "Obama: Oratory and originality" Searching body... In a 2006 interview, Obama highlighted the diversity of his extended family: "It's like a little mini-United Nations," he said. "I've got relatives who look like Bernie Mac, and I've got relatives who look like Margaret Thatcher."[69] Obama has a half-sister with whom he was raised (Maya Soetoro-Ng) and seven other half-siblings from his Kenyan father's family—six of them living.[70] Obama's mother was survived by her Kansas-born mother, Madelyn Dunham,[71] until her death on November 2, 2008,[72] two days before his election to the presidency. Obama also has roots in Ireland; he met with his Irish cousins in Moneygall in May 2011.[73] In Dreams from My Father, Obama ties his mother's family history to possible Native American ancestors and distant relatives of Jefferson Davis, President of the Confederate States of America during the American Civil War. He also shares distant ancestors in common with George W. Bush and Dick Cheney, among others.[74][75][76] Obama resigned his Senate seat on November 16, 2008, to focus on his transition period for the presidency.[170] -------------------------------------------------- Chuck Schumer Max edits -------------------------------------------------- 2021-01-01 00:00:00 Number of edits: 108 Number of citations found in references: 2 Number of citations found in body: 3 Searching References... (January 20, 2021) "Schumer becomes new Senate majority leader" (January 6, 2021) "Nancy Pelosi and Chuck Schumer call on Trump to demand protestors to leave the US Capitol 'immediately'" Searching body... Charles Ellis Schumer (/ˈʃuːmər/ SHOO-mər; born November 23, 1950) is an American politician serving as Senate Majority Leader since January 20, 2021.[2] A member of the Democratic Party, Schumer is in his fourth Senate term, having held his seat since 1999, and is the senior United States senator from New York. He is the dean of New York's congressional delegation. The Senate Democratic caucus elected Schumer minority leader in November 2016. Schumer had been widely expected to lead Senate Democrats after Reid announced his retirement in 2015. He is the first New Yorker, as well as the first Jewish person, to serve as a Senate leader.[36] On January 20, 2021, Democrats gained control of the Senate with the swearing-in of newly elected Georgia senators Jon Ossoff and Raphael Warnock, following the 2020–21 election runoff and special election runoff, making Schumer the majority leader, replacing Republican Mitch McConnell.[37] Schumer was participating in the certification of the 2021 United States Electoral College vote count on January 6, 2021, when Trump supporters stormed the U.S. Capitol. Schumer and other members of Congress were removed from the Senate chambers. He and Mitch McConnell joined Nancy Pelosi and Steny Hoyer in an undisclosed location. As the attack persisted, Schumer and Pelosi released a joint statement calling on Trump to demand the rioters leave the Capitol and its grounds immediately.[125] When the Senate reconvened after the Capitol was secure, Schumer gave remarks, calling it a day "that will live forever in infamy".[126] Later that day, he blamed Trump for the attack, calling on Vice President Mike Pence to invoke the Twenty-fifth Amendment to the United States Constitution to remove Trump from office. He also said he would support impeachment.[127] -------------------------------------------------- Mitch McConnell Max edits -------------------------------------------------- 2021-01-01 00:00:00 Number of edits: 139 Number of citations found in references: 4 Number of citations found in body: 1 Searching References... (January 6, 2021) "Analysis | Mitch McConnell's forceful rejection of Trump's election 'conspiracy theories'" (January 6, 2021) "Resuming electoral counting, McConnell condemns the mob assault on the Capitol as a 'failed insurrection.'" (January 12, 2021) "McConnell is said to be pleased about impeachment, believing it will be easier to purge Trump from the G.O.P." (January 13, 2021) "McConnell won't agree to reconvene Senate early for impeachment trial" Searching body... On January 12, 2021, it was reported that McConnell supported impeaching Trump for his role in inciting the 2021 storming of the United States Capitol, believing it would make it easier for Republicans to purge the party of Trump and rebuild the party.[94] On January 13, despite having the authority to call for an emergency meeting of the Senate to hold the Senate trial,[failed verification] McConnell did not reconvene the chamber, claiming unanimous consent was required.[95] McConnell called for delaying the Senate trial until after Joe Biden's inauguration.[96] Once the Senate trial started, McConnell voted to acquit Trump on February 13, 2021, and said it was unconstitutional to convict someone who was no longer in office.[97]
Reading the report we see that the dates where the maximum number of edits occured all have citations for noteworthy events. Cherry picking a few, we see that for Joe Biden, the month of maximum edits was August 2008, in particular this is when it was announced he would be Obama's running mate. On Donald Trumps page, we see that the month of maximum edits is November 2016, this is the month he won the US presidential election. Similarly, on Barack Obama's page the month of maximum edits is November 2008, when Obama won the US presidential election. This number consistent with our hypothesis that wikipedia trends are indiciative of important events.
We consider a histogram of the number of citations for a day. That is, in the figure below we plot the proportion of citation each day recieves for each page of interest. We then report the information found on each date with the maximum number of citations.
list_of_counts = []
nbins=10
for page in page_names:
monthly_edits =dict_of_tables[page]
found_count = monthly_edits[monthly_edits['Total Found']!=0]['Total Found']
list_of_counts.append(found_count)
plt.hist(list_of_counts,nbins,density=True, histtype='bar')
plt.legend(page_names)
plt.show()
for page in ['Donald_Trump', 'Joe_Biden', 'Barack_Obama','Mitch_McConnell']:
monthly_edits = dict_of_tables[page]
max_edit_entry =monthly_edits.sort_values(by = 'Total Found',ascending=False).iloc[0]
print('-'*50)
print(' '.join(page.split('_'))+ ' Max Citation')
print('-'*50)
print(max_edit_entry['Month'])
print('Number of edits:', max_edit_entry['Edits'])
print('Number of citations found in references:', max_edit_entry['Found References'])
print('Number of citations found in body:', max_edit_entry['Found Body'])
print_date_info(max_edit_entry['Month'],page)
-------------------------------------------------- Donald Trump Max Citation -------------------------------------------------- 2021-01-01 00:00:00 Number of edits: 1055 Number of citations found in references: 37 Number of citations found in body: 6 Searching References... (January 20, 2021) "Trump's presidency ends where so much of it was spent: A Trump Organization property" (January 12, 2021) "Deutsche Bank won't do any more business with Trump" (January 25, 2021) "Supreme Court dismisses emoluments cases against Trump" (January 8, 2021) "Trump will have the worst jobs record in modern U.S. history. It's not just the pandemic" (January 14, 2021). "Donald Trump Built a National Debt So Big (Even Before the Pandemic) "Donald Trump Built a National Debt So Big (Even Before the Pandemic) That It'll Weigh Down the Economy for Years" (January 30, 2021) "From building the wall to bringing back coal: Some of Trump's more notable broken promises" (January 20, 2021) "The Trump Administration Rolled Back More Than 100 Environmental Rules. Here's the Full List" (January 16, 2021) "In Trump's final days, a rush of federal executions" (January 15, 2021) "Trump administration carries out 13th and final execution" (January 20, 2021) "With Hours Left in Office, Trump Grants Clemency to Bannon and Other Allies" (January 13, 2021) "Fact check: Mexico never paid for it. But what about Trump's other border wall promises?" (January 13, 2021) "How Trump compares with other recent presidents in appointing federal judges" (January 3, 2021) "'I just want to find 11,780 votes': In extraordinary hour-long call, Trump pressures Georgia secretary of state to recalculate the vote in his favor" (January 5, 2021) "Pence Said to Have Told Trump He Lacks Power to Change Election Result" (January 20, 2021) "Trump Departs Vowing, 'We Will Be Back in Some Form'" (January 10, 2021) "Incitement to Riot? What Trump Told Supporters Before Mob Stormed Capitol" (January 9, 2021) "How one of America's ugliest days unraveled inside and outside the Capitol" (January 6, 2021) "Facebook, Twitter lock Trump's account following video addressing Washington rioters" (January 6, 2021) "Calls grow for social media platforms to silence Trump as rioters storm US Capitol" (January 6, 2021) "Congress confirms Biden's win after pro-Trump mob's assault on Capitol" (January 11, 2021) "Impeachment Resolution Cites Trump's 'Incitement' of Capitol Insurrection" (January 13, 2021) "Trump Impeached for Inciting Insurrection" (January 13, 2021) "House calls on Pence to invoke 25th Amendment, but he's already dismissed the idea" (January 13, 2021) "Trump's second impeachment is the most bipartisan one in history" (January 28, 2021) "Palm Beach considers options as Trump remains at Mar-a-Lago" (January 27, 2021) "Explainer: Why Trump's post-presidency perks, like a pension and office, are safe for the rest of his life" (January 27, 2021) "Trump opens "Office of the Former President" in Florida" (January 18, 2021) "Last Trump Job Approval 34%; Average Is Record-Low 41%" (January 16, 2021) "Trump finishes with worst first term approval rating ever" (January 16, 2021) "Inside Twitter's Decision to Cut Off Trump" (January 9, 2021) "A farewell to @realDonaldTrump, gone after 57,000 tweets" (January 11, 2021) "All the platforms that have banned or restricted Trump so far" (January 14, 2021) "Twitter ban reveals that tech companies held keys to Trump's power all along" (January 16, 2021) "Misinformation dropped dramatically the week after Twitter banned Trump and some allies" (January 23, 2021) "A term of untruths: The longer Trump was president, the more frequently he made false or misleading claims" (January 15, 2021) "6 conspiracy theories about the 2020 election – debunked" (January 16, 2021) "'Trump said to do so': Accounts of rioters who say the president spurred them to rush the Capitol could be pivotal testimony" Searching body... Trump lost the 2020 presidential election to Joe Biden but refused to concede defeat, falsely claiming widespread electoral fraud and attempting to overturn the results by pressuring government officials, mounting scores of unsuccessful legal challenges, and obstructing the presidential transition. On January 6, 2021, Trump urged his supporters to march to the United States Capitol, which many of them then attacked, resulting in multiple deaths and interrupting the electoral vote count. In November 2022, he announced his candidacy for the Republican nomination in the 2024 presidential election. Trump, who had been a member since 1989, resigned from the Screen Actors Guild in February 2021 rather than face a disciplinary committee hearing for inciting the January 6, 2021, mob attack on the U.S. Capitol and for his "reckless campaign of misinformation aimed at discrediting and ultimately threatening the safety of journalists."[146] Two days later, the union permanently barred him from readmission.[147] On January 6, 2021, while congressional certification of the presidential election results was taking place in the United States Capitol, Trump held a rally at the Ellipse, Washington, D.C., where he called for the election result to be overturned and urged his supporters to "take back our country" by marching to the Capitol to "show strength" and "fight like hell".[647][648] Trump's speech started at noon. By 12:30 p.m., rally attendees had gathered outside the Capitol, and at 1 p.m., his supporters pushed past police barriers onto Capitol grounds. Trump's speech ended at 1:10 p.m., and many supporters marched to the Capitol as he had urged, joining the crowd there. Around 2:15 p.m. the mob broke into the building, disrupting certification and causing the evacuation of Congress.[649] During the violence, Trump posted mixed messages on Twitter and Facebook, eventually tweeting to the rioters at 6 p.m., "go home with love & in peace", but describing them as "great patriots" and "very special", while still complaining that the election was stolen.[650][651] After the mob was removed from the Capitol, Congress reconvened and confirmed the Biden election win in the early hours of the following morning.[652] There were many injuries, and five people, including a Capitol Police officer, died.[653] On January 11, 2021, an article of impeachment charging Trump with incitement of insurrection against the U.S. government was introduced to the House.[654] The House voted 232–197 to impeach Trump on January 13, making him the first U.S. president to be impeached twice.[655] The impeachment, which was the most rapid in history, followed an unsuccessful bipartisan effort to strip Trump of his powers and duties via Section 4 of the 25th Amendment.[656] Ten Republicans voted for impeachment—the most members of a party ever to vote to impeach a president of their own party.[657] On November 18, 2022, Garland appointed a special counsel, federal prosecutor Jack Smith, to oversee the federal criminal investigations into Trump retaining government property at Mar-a-Lago and examining Trump's role in the events leading up to the January 6, 2021, Capitol attack.[700][701] Research suggests Trump's rhetoric caused an increased incidence of hate crimes.[813][814] During his 2016 campaign, he urged or praised physical attacks against protesters or reporters.[815][816] Numerous defendants investigated or prosecuted for violent acts and hate crimes, including participants of the January 6, 2021, storming of the U.S. Capitol, cited Trump's rhetoric in arguing that they were not culpable or should receive a lighter sentence.[817][818] A nationwide review by ABC News in May 2020 identified at least 54 criminal cases from August 2015 to April 2020 in which Trump was invoked in direct connection with violence or threats of violence mostly by white men and primarily against members of minority groups.[819] -------------------------------------------------- Joe Biden Max Citation -------------------------------------------------- 2008-08-01 00:00:00 Number of edits: 913 Number of citations found in references: 19 Number of citations found in body: 1 Searching References... (August 27, 2008) "Biden's Scranton childhood left lasting impression" (August 24, 2008) "In his home state, Biden is a regular Joe" (August 24, 2008) "Jill Biden Heads Toward Life in the Spotlight" (August 25, 2008) "Parishioners not surprised to see Biden at usual Mass" (August 20, 2008) "Biden's Foreign Policy Background Carries Growing Cachet" (August 26, 2008) "For Widener Law students, a teacher aims high" (August 27, 2008) "Widener students proud of Biden" (August 24, 2008) "In Biden, Obama chooses a foreign policy adherent of diplomacy before force" (August 23, 2008) "V.P. candidate profile: Sen. Joe Biden" (August 23, 2008) "Biden and Anita Hill, Revisited" (August 24, 2008) "Joe Biden respected—if not always popular—for foreign policy record" (August 25, 2008) "Biden, McCain Have a Friendship—and More—in Common" (August 23, 2008) "Obama's veep message to supporters" (August 23, 2008) "Obama Chooses Biden as Running Mate" (August 25, 2008) "Tramps Like Us: How Joe Biden will reassure working class voters and change the tenor of this week's convention" (August 27, 2008) "Biden accepts VP nomination" (August 24, 2008) "Biden Wages 2 Campaigns At Once" (August 24, 2008) "Demographics part of calculation: Biden adds experience, yes, but he could also help with Catholics, blue-collar whites and women" (August 23, 2008) "Halperin on Biden: Pros and Cons" Searching body... Shortly after Biden withdrew from the presidential race, Obama privately told him he was interested in finding an important place for Biden in his administration.[170] In early August, Obama and Biden met in secret to discuss the possibility,[170] and developed a strong personal rapport.[169] On August 22, 2008, Obama announced that Biden would be his running mate.[171] The New York Times reported that the strategy behind the choice reflected a desire to fill out the ticket with someone with foreign policy and national security experience.[172] Others pointed out Biden's appeal to middle-class and blue-collar voters.[173][174] Biden was officially nominated for vice president on August 27 by voice vote at the 2008 Democratic National Convention in Denver.[175] -------------------------------------------------- Barack Obama Max Citation -------------------------------------------------- 2017-01-01 00:00:00 Number of edits: 319 Number of citations found in references: 12 Number of citations found in body: 3 Searching References... (January 9, 2017) "Barack Obama's Shaky Legacy on Human Rights" (January 14, 2017) "Jolted by Deaths, Obama Found His Voice on Race" (January 6, 2017) "US House Passes Motion Repudiating UN Resolution on Israel" (January 6, 2017) "Final tally: Obama created 11.3 million jobs" (January 4, 2017) "LGBT activists view Obama as staunch champion of their cause" (January 9, 2017) "Americans Assess Progress Under Obama" (January 15, 2017) "Why Did the US Drop 26,171 Bombs on the World Last Year?" (January 19, 2017) "Map shows where President Barack Obama dropped his 20,000 bombs" (January 13, 2017) "President Obama, who hoped to sow peace, instead led the nation in war" (January 5, 2017) "Federal prison population fell during Obama's term, reversing recent trend" (January 18, 2017) "Obama leaving office at 60 percent approval rating" (January 18, 2017) "Obama approval hits 60 percent as end of term approaches" Searching body... After winning re-election by defeating Republican opponent Mitt Romney, Obama was sworn in for a second term on January 20, 2013. In his second term, Obama took steps to combat climate change, signing a major international climate agreement and an executive order to limit carbon emissions. Obama also presided over the implementation of the Affordable Care Act and other legislation passed in his first term, and he negotiated a nuclear agreement with Iran and normalized relations with Cuba. The number of American soldiers in Afghanistan fell dramatically during Obama's second term, though U.S. soldiers remained in Afghanistan throughout Obama's presidency. Obama left office on January 20, 2017, and continues to reside in Washington, D.C. His presidential library in Chicago began construction in 2021. On December 23, 2016, under the Obama Administration, the United States abstained from United Nations Security Council Resolution 2334, which condemned Israeli settlement building in the occupied Palestinian territories as a violation of international law, effectively allowing it to pass.[388] Netanyahu strongly criticized the Obama administration's actions,[389][390] and the Israeli government withdrew its annual dues from the organization, which totaled $6 million, on January 6, 2017.[391] On January 5, 2017, the United States House of Representatives voted 342–80 to condemn the UN Resolution.[392][393] Obama's presidency ended on January 20, 2017, upon the inauguration of his successor, Donald Trump.[472][473] The family moved to a house they rented in Kalorama, Washington, D.C.[474] On March 2, 2017, the John F. Kennedy Presidential Library and Museum awarded the Profile in Courage Award to Obama "for his enduring commitment to democratic ideals and elevating the standard of political courage."[475] His first public appearance since leaving the office was a seminar at the University of Chicago on April 24, where he appealed for a new generation to participate in politics.[476] -------------------------------------------------- Mitch McConnell Max Citation -------------------------------------------------- 2019-01-01 00:00:00 Number of edits: 123 Number of citations found in references: 12 Number of citations found in body: 1 Searching References... (January 22, 2019) "Mitch McConnell Got Everything He Wanted. But at What Cost?" (January 2, 2019) "McConnell suggests shutdown could last for weeks" (January 4, 2019) "McConnell keeps his head down as government shutdown drags on" (January 10, 2019) "Senate Democrats pushed a vote to reopen the government. Mitch McConnell shot them down" (January 11, 2019) "Mitch McConnell could end the shutdown. But he's sitting this one out" (January 23, 2019) "McConnell blocks bill to reopen most of government" (January 25, 2019) "'This is your fault': GOP senators clash over shutdown inside private luncheon" (January 3, 2019) "McConnell Faces Pressure From Republicans to Stop Avoiding Shutdown Fight" (January 25, 2019) "Trump signs bill to end shutdown and temporarily reopen government" (January 9, 2019) "The Government Shutdown Was the Longest Ever. Here's the History" (January 10, 2019) "America's Most and Least Popular Senators: McConnell loses spot as least popular senator" (January 22, 2019) "Mitch McConnell Got Everything He Wanted. But at What Cost?" Searching body... From December 22, 2018, until January 25, 2019, the federal government was shut down when Congress refused to give in to Trump's demand for $5.7 billion in federal funds for a U.S.–Mexico border wall.[142] In December 2018, the Republican-controlled Senate unanimously passed an appropriations bill without wall funding, and the bill appeared likely to be approved by the Republican-controlled House of Representatives and Trump. After Trump faced heavy criticism from some right-wing media outlets and pundits for appearing to back down on his campaign promise to "build the wall", he announced that he would not sign any appropriations bill that did not fund its construction.[143]
Above we consider the dates with the maximum number of citations for a few select pages. On Donald Trump's page we have January 2021, the infmamous attack on the US capitol. An event spurred by the former President of the US. For Joe Biden, we have again August 2008 when he was announced to be Obama's running mate. On Barack Obama's page we have January 2017, the end of his second term. For Mitch McConnell, the most cited month is January 2019. His credited for prolonging the federal goverment shutdown.
A natural quesiton to ask is if the number of edits are correlated to the number of references of a particular date. In particular, we expect that the number of Minor edits
is correlated with the number of Major edits
. This is observed in our time series plots of the edits for each page. Below we plot a heat map of the pearson correlation coefficient between the total number of citations of a particular date, the number of minor edits, the number of ip edits and the number of major edits.
# Combine all the dataframes of data for each page
combined_df = pd.concat(dict_of_tables.values(),ignore_index=True)
import seaborn as sns
sns.heatmap(combined_df[['Major','Minor edits','IPs','Total Found']].corr(), annot=True)
plt.show()
Based on the correlation heat map above it is clear that IPs edits
are not highly correlated with any variable. Moreover, the plot confirms our natural guess that Minor Edits
are highly correlated with Major Edits
. We also observe that Major edits
and Minor edits
are correlated with the number of citations found. This correlation suggests, mildly, that edits are indicitave of important events.
We now implement an unsupervised Anomaly Detection algorithm to detect important dates. In particular, here we make use of an algorithm known as Isolation Forests. More information on Isolation Forests can be found here.
For our models inputs, we only make use of a page's Major Edits
and Minor Edits
. We neglect using IPs Edits
as they do not carry much predictive power as suggested by our correlation analysis. We can measure the performace of our model by comparing dates classified as important with the presence of citations for that date. That is, a date is important if that date has any citations on the page. (It should be noted that this is a weak measure of importnance. Namely, this suggests a date with one citation is as important as a date with 50.)
from sklearn.ensemble import IsolationForest
import numpy as np
page = 'Donald_Trump'
df = dict_of_tables[page]
X = np.array(df[['Major','Minor edits']])
# Fit Isolation Forest Model
clf = IsolationForest(random_state=0).fit(X)
# Predict
anomaly_predicted = clf.predict(X)
predicted = df[anomaly_predicted == -1]
# Define important class as those with at least 1 citation
y_important = [(-1)**int(x) for x in df['Total Found']!=0]
fig, ax = plt.subplots()
#important_date = dt.datetime(2021, 1, 6)
ax.plot(df['Month'], df['Major'],linestyle='-', linewidth=2.0)
ax.scatter(predicted[predicted['Total Found']==0]['Month'], predicted[predicted['Total Found']==0]['Major'],
s=100,marker='x', c ='red' )
ax.scatter(df['Month'], df['Major'],
s = df['Total Found']*10,marker='o', c ='orange',alpha=0.35 )
ax.scatter(predicted['Month'], predicted['Major'],
s = predicted['Total Found']*10,marker='^', c ='green' )
plt.legend(['Major Edits','Anomalies without citation','Dates with citation','Anomalies with citation'])
plt.xlabel('Date')
plt.title(' '.join(page.split('_'))+' page Edits Anomaly Detection')
plt.show()
In the plot above we mark with green triangles the dates labeled as anomalies with citation, the size of triangle corresponds to how many citations that date has. That is, the larger the triangle, the more citations, the more important. The rex crosses denote dates that were classified as anomalies but have no citations. Namely, we consider those to be false positives or incorrectly labeled importnat dates. The orange circle mark all dates that have any citation, the size corresponds to the number of citations for that date.
We now plot the confusion matrix below.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_important, anomaly_predicted, labels =[1,-1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm ,display_labels =[1,-1])
disp.plot()
plt.show()
recall = cm[1,1] / np.sum(cm, axis = 1)[1]
precision = cm[1,1] / np.sum(cm, axis = 0)[1]
F1_score = 2*(precision*recall)/(precision+recall)
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {F1_score:.2f}')
Precision: 0.91 Recall: 0.33 F1 Score: 0.48
We observe for Donald Trumps page, the IsolatedForest model has high precision but low recall. This leads to a moderate F1 Score. This is due to the fact that there are many dates important dates with 1 citation. In turn, the model fails to capture them.
pred_average=predicted['Total Found'].mean()
print(f' Average number of citation for predicted dates {pred_average:.2f}')
Average number of citation for predicted dates 8.54
We see that the average number of citations for a page predicted to be important is fairly high. It's no suprize that we mis classyfing dates with 1 citation.
We now repeat this for all other pages.
sum_precision = 0
sum_recall = 0
sum_F1 = 0
sum_avg_predicted = 0
for page in page_names:
print('\n'+'-'*50)
print(' '.join(page.split('_'))+ ' Page Model')
print('-'*50)
df = dict_of_tables[page]
X = np.array(df[['Major','Minor edits']])
# Fit Isolation Forest Model
clf = IsolationForest(random_state=0).fit(X)
# Predict
anomaly_predicted = clf.predict(X)
predicted = df[anomaly_predicted == -1]
# Define important class
y_important = [(-1)**int(x) for x in df['Total Found']!=0]
# Plot predicted dates
fig, ax = plt.subplots()
ax.plot(df['Month'], df['Major'],linestyle='-', linewidth=2.0)
ax.scatter(predicted[predicted['Total Found']==0]['Month'], predicted[predicted['Total Found']==0]['Major'],
s=100,marker='x', c ='red' )
ax.scatter(df['Month'], df['Major'],
s = df['Total Found']*10,marker='o', c ='orange',alpha=0.35 )
ax.scatter(predicted['Month'], predicted['Major'],
s = predicted['Total Found']*10,marker='^', c ='green' )
plt.legend(['Major Edits','Anomalies without citation','Dates with citation','Anomalies with citation'])
plt.xlabel('Date')
plt.title(' '.join(page.split('_'))+' page Edits Anomaly Detection')
plt.show()
cm = confusion_matrix(y_important, anomaly_predicted, labels =[1,-1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm ,display_labels =[1,-1])
disp.plot()
plt.show()
recall = cm[1,1] / np.sum(cm, axis = 1)[1]
precision = cm[1,1] / np.sum(cm, axis = 0)[1]
F1_score = 2*(precision*recall)/(precision+recall)
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {F1_score:.2f}')
sum_precision += precision
sum_recall += recall
sum_F1 += F1_score
pred_average=predicted['Total Found'].mean()
sum_avg_predicted += pred_average
print(f'Average number of citation for predicted dates {pred_average:.2f}')
-------------------------------------------------- Joe Biden Page Model --------------------------------------------------
Precision: 0.88 Recall: 0.21 F1 Score: 0.34 Average number of citation for predicted dates 5.85 -------------------------------------------------- Donald Trump Page Model --------------------------------------------------
Precision: 0.91 Recall: 0.33 F1 Score: 0.48 Average number of citation for predicted dates 8.54 -------------------------------------------------- Hillary Clinton Page Model --------------------------------------------------
Precision: 0.64 Recall: 0.28 F1 Score: 0.39 Average number of citation for predicted dates 2.05 -------------------------------------------------- Barack Obama Page Model --------------------------------------------------
Precision: 0.89 Recall: 0.23 F1 Score: 0.37 Average number of citation for predicted dates 4.74 -------------------------------------------------- Chuck Schumer Page Model --------------------------------------------------
Precision: 0.52 Recall: 0.18 F1 Score: 0.27 Average number of citation for predicted dates 1.67 -------------------------------------------------- Mitch McConnell Page Model --------------------------------------------------
Precision: 0.69 Recall: 0.30 F1 Score: 0.42 Average number of citation for predicted dates 2.25
avg_precision = sum_precision/len(page_names)
avg_recall = sum_recall/len(page_names)
avg_F1 = sum_F1/len(page_names)
avg_avg_predicted = sum_avg_predicted/len(page_names)
print(f'Average Precision: {avg_precision:.2f}')
print(f'Average Recall: {avg_recall:.2f}')
print(f'Average F1 Score: {avg_F1:.2f}')
print(f'Average number of citation for predicted dates {avg_avg_predicted:.2f}')
Average Precision: 0.76 Average Recall: 0.26 Average F1 Score: 0.38 Average number of citation for predicted dates 4.18
We see across all pages that the IsolationForrest model preforms with high precision and low recall. Again, the failure of the recall is likely due to how we define important event.
Wikipedia page edit trends for political figures does appear to indicate important events. Visual, examining each time series one can observe that during periods where many edits occur, there many citations appear for those dates. This suggests that edits are occuring to reflect the recent news. Moreover, we have that they are indeed statstistically correlated. That is, the number of page edits is positivly correlated with the number of citations for that particular date.
Our Isolated Forrest Model also preforms moderatly for each page. We see that each model preforms with high precision and low recall. This is, due to the fact that our notion of important event is inappropriate. To build a better model, one needs to have a better measure of importance. For possible future work one could consider a normalized proportion of edits, or some function of the number of reference citations and body citations. Even more advanced, one could take the raw text associated to those dates like in the print_date_info()
function and build some sentient analysis model to measure how important each date is. In either case, there are many directions this can go.