Political Wikipedia Edit trends: Indicators for important events¶

Data Science: CMPS 6160, Fall 2022¶

By Oliver Orejola

The website in which the analysis is located can be found here.

Introduction¶

Wikipedia is an open source online encyclopedia, with the english wikipedia host to more than 6.5 million articles. With it being open source, many users are at liberty to make edits, updates, and changes to articles pertaining to a plethora of topics ranging from science, pop culture, politics and many more. Those who make edits may have many reasons, but ofcourse when an event happens one expects that editors flock to article relevant to particular event and make edits as to capture or reflect the news. In this project we set out to study the relationship of wikipedia edits made and important events. Here and through out, we restrict our consideration to a handful of political figures in the United States.

The goal of this project is to investigate time series trends in wikipedia page edits for key US political figures. In particular, we are interested in answer the question if wikipedia edits are indicators for importnat events. We address correlation between types of wikipedia edits made (Major, Minor, and IPs) and their relationship with important events. Moreover, we test a unsupervised model for classifying important events.

Since important event is a rather subjective notion, we measure the importance of a date with respect to a particular person, with the number of times that date appears on their wikipedia page. Although, we can rank importance with this number, through out we consider a date importance to be a binary class. That is, a date is considered important with respect to an individual, if that date appears any where on their wikipedia page.

We use a classic unsupervised learning technique, Isolation Forrests, to classify dates as important. We demonstrate that these models preform with high precision, but with low recall as our definition of importance is ill posed and over counts.

This Jupyter notebook is structured as follows:

  • Datasets used (webscraped)
  • Extract, Load, Transform
  • Exploratory Analysis
    • Trends
    • Counts
    • Correlation
  • Anomaly Detection
  • Conclusion

Project Datasets¶

The project considers two data web scraped from two seperate sources.

Wikipedia Page Edits

This data sets is scraped from https://xtools.wmflabs.org for a handful of wiki pages.

  • Joe Biden - https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Joe_Biden
  • Donald Trump - https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Donald_Trump
  • Hillary Clinton - https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Hillary_Clinton
  • Barack Obama - https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Barack_Obama
  • Chuck Schumer- https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Chuck_Schumer
  • Mitch McConnell- https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Mitch_McConnell

For each page the scraped data consists of the following information:

  • month -- YYYY-MM
  • Edits -- Total Number of Edits
  • IPs -- Edits made while masking IP address
  • IPs % -- Percentage of IP Edits
  • Minor Edits -- Edits consider that only correct superficial differences exist between the current and previous version, e.g. typographical errors, formatting, etc.
  • Minor Edits % -- Percentage of Minor Edits

Dates found on Wikipedia Page This data sets is scraped from https://en.wikipedia.org for the following individuals

  • Joe Biden - https://en.wikipedia.org/wiki/Joe_Biden
  • Donald Trump - https://en.wikipedia.org/wiki/Donald_Trump
  • Hillary Clinton - https://en.wikipedia.org/wiki/Hillary_Clinton
  • Barack Obama - https://en.wikipedia.org/wiki/Barack_Obama
  • Chuck Schumer - https://en.wikipedia.org/wiki/Chuck_Schumer
  • Mitch McConnell - https://en.wikipedia.org/wiki/Mitch_McConnell

For each page the scraped data consists of the following information only for the months present in the Wikipedia Page Edits data set:

  • month -- YYYY-MM
  • Found References - Number of references that cite the month and year
  • Found Body - Number of paragraphs in the body of the page that cite the month and year
  • Total Found - Total nuumber of citations the month and year that appear anywhere on the wikipedia page

Extraction, Transform, and Load¶

Extract¶

First, we extract the code from https://xtools.wmflabs.org and https://en.wikipedia.org/wiki/ for each of the wikipedia pages of interest. We then save the extracted data as .csv in the following format.

PAGENAME_monthly_edits.csv

In [1]:
# Import relevant libraries For Extraction
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

# Define Header for request.get to bypass 403 error code
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

# List of wikipedia page names 
page_names = ['Joe_Biden','Donald_Trump','Hillary_Clinton','Barack_Obama', 'Chuck_Schumer','Mitch_McConnell']
In [2]:
# Create dictionary to to convert numerical month for searching web page for date
month_names = ['January', 'February', 'March','April',
               'May', 'June', 'July', 'August', 
               'September', 'October','November', 'December']
month_number = list(range(1,13))
month_dict = dict(zip(month_number,month_names))

#Auxilary funciton to extract year nad month from date formated as str
def get_date(date):
    year = date[:4]
    month = month_dict[int(date[5:7])]
    return year, month
In [3]:
##### ONLY RUN IF YOU WANT TO SCRAPE WEB DATA #####
for page in page_names:
    print('\n'+'-'*50)
    print(' '.join(page.split('_'))+ ' information scrape.')
    print('-'*50)
    url = ' https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/'+page
    # Scrape web
    r = requests.get(url,headers=headers)
    print(f'Status code: {r.status_code} URL: {url}')    
    soup = BeautifulSoup( r.content )
    
    # Find Tables 
    df_tables = []
    for t in soup.findAll("table"):
        df_t = pd.read_html(str(t))
        df_tables.append(df_t[0])
    
    # Save Monthly Edits Table
    monthly_edits = df_tables[8]
    
    # Srape Wikipedia pages references and body for edit dates count the number of appearances of each date
    url_page = 'https://en.wikipedia.org/wiki/'+page
    # Scrape web
    r_page = requests.get(url_page,headers=headers)
    print(f'Status code: {r_page.status_code} URL: {url_page}')    
    soup_page = BeautifulSoup( r_page.content )

    monthly_edits['Found References']=0
    monthly_edits['Found Body']=0
    

    for i, date in enumerate(monthly_edits['Month']):
        count_ref = 0
        count_body = 0
        year, month = get_date(str(date))
        
        #Search through references
        regex_ref = r"\(\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year) +r"\)"
        c_ref=re.compile(regex_ref,re.I)  # with the ignorecase option
        for references in soup_page.findAll('ol'):
            ref = references.getText()
            split_ref = ref.split('\n')
            for item in split_ref:
                if len(c_ref.findall(item)):
                    count_ref +=1
        monthly_edits.at[i,"Found References"] += count_ref
        #Search through page body via paragraphs
        regex_body = r"\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year)
        c_body=re.compile(regex_body,re.I) 
        for paragraph in soup_page.findAll('p'):
            par = paragraph.getText()
            if len(c_body.findall(par)):
                count_body +=1
        monthly_edits.at[i,"Found Body"] += count_body
    
    print(f'Page searched for dates.') 
    monthly_edits['Total Found']=monthly_edits['Found Body'] + monthly_edits['Found References']
    
    #Save File
    file_name = page+'_monthly_edits.csv'
    monthly_edits.to_csv(file_name, index=False) 
--------------------------------------------------
Joe Biden information scrape.
--------------------------------------------------
Status code: 200 URL:  https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Joe_Biden
Status code: 200 URL: https://en.wikipedia.org/wiki/Joe_Biden
Page searched for dates.

--------------------------------------------------
Donald Trump information scrape.
--------------------------------------------------
Status code: 200 URL:  https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Donald_Trump
Status code: 200 URL: https://en.wikipedia.org/wiki/Donald_Trump
Page searched for dates.

--------------------------------------------------
Hillary Clinton information scrape.
--------------------------------------------------
Status code: 200 URL:  https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Hillary_Clinton
Status code: 200 URL: https://en.wikipedia.org/wiki/Hillary_Clinton
Page searched for dates.

--------------------------------------------------
Barack Obama information scrape.
--------------------------------------------------
Status code: 200 URL:  https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Barack_Obama
Status code: 200 URL: https://en.wikipedia.org/wiki/Barack_Obama
Page searched for dates.

--------------------------------------------------
Chuck Schumer information scrape.
--------------------------------------------------
Status code: 200 URL:  https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Chuck_Schumer
Status code: 200 URL: https://en.wikipedia.org/wiki/Chuck_Schumer
Page searched for dates.

--------------------------------------------------
Mitch McConnell information scrape.
--------------------------------------------------
Status code: 200 URL:  https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Mitch_McConnell
Status code: 200 URL: https://en.wikipedia.org/wiki/Mitch_McConnell
Page searched for dates.

Load and Transform¶

Load the saved PAGENAME_monthly_edits.csv files for each page and look at the contents and check data types. Since we are dealing with multiple .csv files, one for each page. We save the files in a dictionary with the page name as the relevant key. This is done for to streamline edits for all datasets.

In [4]:
# Init blank dict
dict_of_tables = dict()

for page in page_names:
    file_name = page+'_monthly_edits.csv'
    monthly_edits = pd.read_csv(file_name)
    dict_of_tables[page]=monthly_edits
    print('\n'+'-'*50)
    print(' '.join(page.split('_'))+ ' page data')
    print('-'*50)
    display(dict_of_tables[page])
    
    # Check data types
    print(dict_of_tables[page].dtypes,'\n')
--------------------------------------------------
Joe Biden page data
--------------------------------------------------
Month Edits IPs IPs % Minor edits Minor edits % Edits · Minor edits · IPs Found References Found Body Total Found
0 2002-11 1 0 0% 1 100% NaN 0 0 0
1 2002-12 0 0 0% 0 0% NaN 0 0 0
2 2003-01 0 0 0% 0 0% NaN 0 0 0
3 2003-02 0 0 0% 0 0% NaN 0 0 0
4 2003-03 0 0 0% 0 0% NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ...
237 2022-08 114 0 0% 21 18.4% NaN 13 3 16
238 2022-09 106 0 0% 5 4.7% NaN 3 2 5
239 2022-10 37 0 0% 12 32.4% NaN 3 1 4
240 2022-11 51 0 0% 6 11.8% NaN 6 0 6
241 2022-12 62 0 0% 17 27.4% NaN 2 1 3

242 rows × 10 columns

Month                             object
Edits                              int64
IPs                                int64
IPs %                             object
Minor edits                        int64
Minor edits %                     object
Edits  ·  Minor edits  ·  IPs    float64
Found References                   int64
Found Body                         int64
Total Found                        int64
dtype: object 


--------------------------------------------------
Donald Trump page data
--------------------------------------------------
Month Edits IPs IPs % Minor edits Minor edits % Edits · Minor edits · IPs Found References Found Body Total Found
0 2004-01 6 0 0% 4 66.7% NaN 0 0 0
1 2004-02 2 1 50% 1 50% NaN 0 0 0
2 2004-03 1 0 0% 0 0% NaN 0 0 0
3 2004-04 22 11 50% 4 18.2% NaN 1 0 1
4 2004-05 10 4 40% 1 10% NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ...
223 2022-08 215 0 0% 22 10.2% NaN 11 1 12
224 2022-09 131 0 0% 9 6.9% NaN 2 0 2
225 2022-10 190 0 0% 7 3.7% NaN 0 0 0
226 2022-11 224 0 0% 37 16.5% NaN 7 3 10
227 2022-12 81 0 0% 13 16% NaN 0 0 0

228 rows × 10 columns

Month                             object
Edits                              int64
IPs                                int64
IPs %                             object
Minor edits                        int64
Minor edits %                     object
Edits  ·  Minor edits  ·  IPs    float64
Found References                   int64
Found Body                         int64
Total Found                        int64
dtype: object 


--------------------------------------------------
Hillary Clinton page data
--------------------------------------------------
Month Edits IPs IPs % Minor edits Minor edits % Edits · Minor edits · IPs Found References Found Body Total Found
0 2001-03 2 0 0% 0 0% NaN 0 0 0
1 2001-04 0 0 0% 0 0% NaN 0 0 0
2 2001-05 0 0 0% 0 0% NaN 0 0 0
3 2001-06 2 0 0% 0 0% NaN 0 0 0
4 2001-07 4 0 0% 0 0% NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ...
257 2022-08 4 0 0% 1 25% NaN 0 0 0
258 2022-09 8 0 0% 3 37.5% NaN 1 0 1
259 2022-10 8 0 0% 3 37.5% NaN 0 0 0
260 2022-11 30 0 0% 9 30% NaN 0 0 0
261 2022-12 1 0 0% 1 100% NaN 0 0 0

262 rows × 10 columns

Month                             object
Edits                              int64
IPs                                int64
IPs %                             object
Minor edits                        int64
Minor edits %                     object
Edits  ·  Minor edits  ·  IPs    float64
Found References                   int64
Found Body                         int64
Total Found                        int64
dtype: object 


--------------------------------------------------
Barack Obama page data
--------------------------------------------------
Month Edits IPs IPs % Minor edits Minor edits % Edits · Minor edits · IPs Found References Found Body Total Found
0 2004-03 4 1 25% 0 0% NaN 5 0 5
1 2004-04 2 2 100% 0 0% NaN 0 0 0
2 2004-05 0 0 0% 0 0% NaN 2 0 2
3 2004-06 6 1 16.7% 0 0% NaN 1 0 1
4 2004-07 82 18 22% 36 43.9% NaN 6 0 6
... ... ... ... ... ... ... ... ... ... ...
221 2022-08 39 0 0% 17 43.6% NaN 0 0 0
222 2022-09 46 0 0% 11 23.9% NaN 1 0 1
223 2022-10 37 0 0% 4 10.8% NaN 0 0 0
224 2022-11 51 0 0% 13 25.5% NaN 0 0 0
225 2022-12 18 0 0% 13 72.2% NaN 0 0 0

226 rows × 10 columns

Month                             object
Edits                              int64
IPs                                int64
IPs %                             object
Minor edits                        int64
Minor edits %                     object
Edits  ·  Minor edits  ·  IPs    float64
Found References                   int64
Found Body                         int64
Total Found                        int64
dtype: object 


--------------------------------------------------
Chuck Schumer page data
--------------------------------------------------
Month Edits IPs IPs % Minor edits Minor edits % Edits · Minor edits · IPs Found References Found Body Total Found
0 2003-09 2 1 50% 0 0% NaN 0 0 0
1 2003-10 2 0 0% 1 50% NaN 0 0 0
2 2003-11 0 0 0% 0 0% NaN 0 0 0
3 2003-12 0 0 0% 0 0% NaN 0 0 0
4 2004-01 0 0 0% 0 0% NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ...
227 2022-08 4 0 0% 0 0% NaN 0 0 0
228 2022-09 1 0 0% 0 0% NaN 0 0 0
229 2022-10 2 0 0% 1 50% NaN 0 0 0
230 2022-11 13 0 0% 2 15.4% NaN 0 0 0
231 2022-12 2 0 0% 0 0% NaN 0 0 0

232 rows × 10 columns

Month                             object
Edits                              int64
IPs                                int64
IPs %                             object
Minor edits                        int64
Minor edits %                     object
Edits  ·  Minor edits  ·  IPs    float64
Found References                   int64
Found Body                         int64
Total Found                        int64
dtype: object 


--------------------------------------------------
Mitch McConnell page data
--------------------------------------------------
Month Edits IPs IPs % Minor edits Minor edits % Edits · Minor edits · IPs Found References Found Body Total Found
0 2003-10 1 0 0% 0 0% NaN 0 0 0
1 2003-11 3 0 0% 3 100% NaN 0 0 0
2 2003-12 3 0 0% 2 66.7% NaN 0 0 0
3 2004-01 0 0 0% 0 0% NaN 0 0 0
4 2004-02 0 0 0% 0 0% NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ...
226 2022-08 6 0 0% 1 16.7% NaN 0 0 0
227 2022-09 1 0 0% 0 0% NaN 0 0 0
228 2022-10 8 0 0% 1 12.5% NaN 0 0 0
229 2022-11 20 0 0% 7 35% NaN 0 0 0
230 2022-12 3 0 0% 1 33.3% NaN 0 0 0

231 rows × 10 columns

Month                             object
Edits                              int64
IPs                                int64
IPs %                             object
Minor edits                        int64
Minor edits %                     object
Edits  ·  Minor edits  ·  IPs    float64
Found References                   int64
Found Body                         int64
Total Found                        int64
dtype: object 

These columns IPs %, Minor edits % and Edits Minor Edits IPs do not mean anything for our analysis so we drop them. In addition, we convert the Month to a relevant datetime format.

In [5]:
relevant_columns = ['Month', 'Edits','IPs', 'Minor edits','Found References','Found Body','Total Found' ]
for page in page_names:
    print('\n'+'-'*50)
    print(' '.join(page.split('_'))+ ' page data')
    print('-'*50)
    
    monthly_edits = dict_of_tables[page]
    
    # Get relevant columns
    monthly_edits=monthly_edits.loc[:,relevant_columns]
    
    # Add Major edits column
    monthly_edits['Major'] = monthly_edits['Edits'] - monthly_edits['IPs'] - monthly_edits['Minor edits']
    
    #Rearange columns
    monthly_edits = monthly_edits[ ['Month', 'Edits','Major','IPs', 'Minor edits','Total Found','Found References','Found Body' ]]
    # Convert Month to date time
    monthly_edits['Month'] = pd.to_datetime(monthly_edits['Month'], format='%Y-%m')
    dict_of_tables[page]=monthly_edits
    display(dict_of_tables[page].head())
    print(dict_of_tables[page].dtypes)
--------------------------------------------------
Joe Biden page data
--------------------------------------------------
Month Edits Major IPs Minor edits Total Found Found References Found Body
0 2002-11-01 1 0 0 1 0 0 0
1 2002-12-01 0 0 0 0 0 0 0
2 2003-01-01 0 0 0 0 0 0 0
3 2003-02-01 0 0 0 0 0 0 0
4 2003-03-01 0 0 0 0 0 0 0
Month               datetime64[ns]
Edits                        int64
Major                        int64
IPs                          int64
Minor edits                  int64
Total Found                  int64
Found References             int64
Found Body                   int64
dtype: object

--------------------------------------------------
Donald Trump page data
--------------------------------------------------
Month Edits Major IPs Minor edits Total Found Found References Found Body
0 2004-01-01 6 2 0 4 0 0 0
1 2004-02-01 2 0 1 1 0 0 0
2 2004-03-01 1 1 0 0 0 0 0
3 2004-04-01 22 7 11 4 1 1 0
4 2004-05-01 10 5 4 1 0 0 0
Month               datetime64[ns]
Edits                        int64
Major                        int64
IPs                          int64
Minor edits                  int64
Total Found                  int64
Found References             int64
Found Body                   int64
dtype: object

--------------------------------------------------
Hillary Clinton page data
--------------------------------------------------
Month Edits Major IPs Minor edits Total Found Found References Found Body
0 2001-03-01 2 2 0 0 0 0 0
1 2001-04-01 0 0 0 0 0 0 0
2 2001-05-01 0 0 0 0 0 0 0
3 2001-06-01 2 2 0 0 0 0 0
4 2001-07-01 4 4 0 0 0 0 0
Month               datetime64[ns]
Edits                        int64
Major                        int64
IPs                          int64
Minor edits                  int64
Total Found                  int64
Found References             int64
Found Body                   int64
dtype: object

--------------------------------------------------
Barack Obama page data
--------------------------------------------------
Month Edits Major IPs Minor edits Total Found Found References Found Body
0 2004-03-01 4 3 1 0 5 5 0
1 2004-04-01 2 0 2 0 0 0 0
2 2004-05-01 0 0 0 0 2 2 0
3 2004-06-01 6 5 1 0 1 1 0
4 2004-07-01 82 28 18 36 6 6 0
Month               datetime64[ns]
Edits                        int64
Major                        int64
IPs                          int64
Minor edits                  int64
Total Found                  int64
Found References             int64
Found Body                   int64
dtype: object

--------------------------------------------------
Chuck Schumer page data
--------------------------------------------------
Month Edits Major IPs Minor edits Total Found Found References Found Body
0 2003-09-01 2 1 1 0 0 0 0
1 2003-10-01 2 1 0 1 0 0 0
2 2003-11-01 0 0 0 0 0 0 0
3 2003-12-01 0 0 0 0 0 0 0
4 2004-01-01 0 0 0 0 0 0 0
Month               datetime64[ns]
Edits                        int64
Major                        int64
IPs                          int64
Minor edits                  int64
Total Found                  int64
Found References             int64
Found Body                   int64
dtype: object

--------------------------------------------------
Mitch McConnell page data
--------------------------------------------------
Month Edits Major IPs Minor edits Total Found Found References Found Body
0 2003-10-01 1 1 0 0 0 0 0
1 2003-11-01 3 0 0 3 0 0 0
2 2003-12-01 3 1 0 2 0 0 0
3 2004-01-01 0 0 0 0 0 0 0
4 2004-02-01 0 0 0 0 0 0 0
Month               datetime64[ns]
Edits                        int64
Major                        int64
IPs                          int64
Minor edits                  int64
Total Found                  int64
Found References             int64
Found Body                   int64
dtype: object

Now consider the average statiscs for each page.

In [6]:
for page in page_names:
    
    monthly_edits = dict_of_tables[page][['Edits','IPs', 'Minor edits','Total Found', 'Found Body', 'Found References']]
    print('\n'+'-'*50)
    print(' '.join(page.split('_'))+ ' page data summary')
    print('-'*50)
    print(monthly_edits.describe(datetime_is_numeric=True))
--------------------------------------------------
Joe Biden page data summary
--------------------------------------------------
            Edits         IPs  Minor edits  Total Found  Found Body  \
count  242.000000  242.000000   242.000000   242.000000  242.000000   
mean    45.231405    3.140496     9.822314     1.599174    0.107438   
std     97.623611   11.855846    24.010834     2.915205    0.392918   
min      0.000000    0.000000     0.000000     0.000000    0.000000   
25%      7.000000    0.000000     1.000000     0.000000    0.000000   
50%     17.500000    0.000000     4.000000     0.000000    0.000000   
75%     39.750000    0.000000    10.000000     2.000000    0.000000   
max    913.000000  136.000000   286.000000    20.000000    3.000000   

       Found References  
count        242.000000  
mean           1.491736  
std            2.695073  
min            0.000000  
25%            0.000000  
50%            0.000000  
75%            2.000000  
max           19.000000  

--------------------------------------------------
Donald Trump page data summary
--------------------------------------------------
             Edits         IPs  Minor edits  Total Found  Found Body  \
count   228.000000  228.000000   228.000000   228.000000  228.000000   
mean    173.925439   11.592105    35.337719     3.236842    0.131579   
std     232.323428   26.092437    51.270939     5.724671    0.593847   
min       0.000000    0.000000     0.000000     0.000000    0.000000   
25%      23.000000    0.000000     6.000000     0.000000    0.000000   
50%      84.500000    0.000000    16.500000     0.000000    0.000000   
75%     227.000000   10.250000    43.000000     5.000000    0.000000   
max    1475.000000  131.000000   373.000000    43.000000    6.000000   

       Found References  
count        228.000000  
mean           3.105263  
std            5.380448  
min            0.000000  
25%            0.000000  
50%            0.000000  
75%            5.000000  
max           37.000000  

--------------------------------------------------
Hillary Clinton page data summary
--------------------------------------------------
            Edits         IPs  Minor edits  Total Found  Found Body  \
count  262.000000  262.000000   262.000000   262.000000  262.000000   
mean    65.408397    7.480916    15.816794     0.931298    0.087786   
std    103.312072   24.911291    31.531484     1.766833    0.355477   
min      0.000000    0.000000     0.000000     0.000000    0.000000   
25%     11.000000    0.000000     2.000000     0.000000    0.000000   
50%     24.000000    0.000000     6.000000     0.000000    0.000000   
75%     63.250000    0.000000    17.000000     1.000000    0.000000   
max    910.000000  158.000000   394.000000    13.000000    4.000000   

       Found References  
count        262.000000  
mean           0.843511  
std            1.551892  
min            0.000000  
25%            0.000000  
50%            0.000000  
75%            1.000000  
max            9.000000  

--------------------------------------------------
Barack Obama page data summary
--------------------------------------------------
             Edits         IPs  Minor edits  Total Found  Found Body  \
count   226.000000  226.000000   226.000000   226.000000  226.000000   
mean    128.115044    7.393805    33.703540     2.017699    0.300885   
std     192.301904   25.247218    50.720723     2.716149    0.651630   
min       0.000000    0.000000     0.000000     0.000000    0.000000   
25%      38.250000    0.000000    10.000000     0.000000    0.000000   
50%      57.500000    0.000000    15.000000     1.000000    0.000000   
75%     113.750000    0.000000    29.000000     2.000000    0.000000   
max    1434.000000  186.000000   451.000000    15.000000    3.000000   

       Found References  
count        226.000000  
mean           1.716814  
std            2.339873  
min            0.000000  
25%            0.000000  
50%            1.000000  
75%            2.000000  
max           12.000000  

--------------------------------------------------
Chuck Schumer page data summary
--------------------------------------------------
            Edits         IPs  Minor edits  Total Found  Found Body  \
count  232.000000  232.000000   232.000000   232.000000  232.000000   
mean    15.159483    4.810345     3.560345     0.745690    0.073276   
std     15.643360    7.098459     4.235060     1.408071    0.333902   
min      0.000000    0.000000     0.000000     0.000000    0.000000   
25%      5.000000    0.000000     1.000000     0.000000    0.000000   
50%     10.000000    2.000000     2.500000     0.000000    0.000000   
75%     18.250000    7.000000     4.000000     1.000000    0.000000   
max    108.000000   40.000000    30.000000     9.000000    3.000000   

       Found References  
count        232.000000  
mean           0.672414  
std            1.281070  
min            0.000000  
25%            0.000000  
50%            0.000000  
75%            1.000000  
max            8.000000  

--------------------------------------------------
Mitch McConnell page data summary
--------------------------------------------------
            Edits         IPs  Minor edits  Total Found  Found Body  \
count  231.000000  231.000000   231.000000   231.000000  231.000000   
mean    21.142857    4.865801     4.930736     0.696970    0.086580   
std     23.736258    7.237835     5.964712     1.493155    0.311158   
min      0.000000    0.000000     0.000000     0.000000    0.000000   
25%      6.000000    0.000000     1.000000     0.000000    0.000000   
50%     14.000000    2.000000     3.000000     0.000000    0.000000   
75%     27.000000    7.000000     6.000000     1.000000    0.000000   
max    139.000000   47.000000    47.000000    13.000000    2.000000   

       Found References  
count        231.000000  
mean           0.610390  
std            1.352831  
min            0.000000  
25%            0.000000  
50%            0.000000  
75%            1.000000  
max           12.000000  

Exploratory Data Analysis: Edit Trends¶

We now plot the edit trends. Below we have the plotted the time series data for page edit trends - for each page we plot minor, ip, and total edits. In addition, we add marks for months found on the wikipage. Here, size corresponds to the frequence of dates referenced. I.e. the larger the mark, the more frequent the date cited. Since, at the present moment we arent interested in distinguishing between reference and page citation, we mark the count together.

In [7]:
import matplotlib.pyplot as plt 
import datetime as dt

plt.rcParams['figure.figsize'] = [12, 7]
for page in page_names:
    monthly_edits = dict_of_tables[page]
    
    fig, ax = plt.subplots()
    ax.plot(monthly_edits['Month'], monthly_edits['Major'], linewidth=2.0)
    ax.plot(monthly_edits['Month'], monthly_edits['Minor edits'], linewidth=2.0)
    ax.plot(monthly_edits['Month'], monthly_edits['IPs'], linewidth=2.0)
    ax.scatter(monthly_edits['Month'], monthly_edits['Major'],
           s=monthly_edits['Total Found']*10,marker='o', c ='purple'  )
    
    ax.legend(['Major Edits','Minor Edits','IPs Edits','Dates Found'])

    ax.legend(['Major Edits','Minor Edits','IPs Edits'])
    plt.xlabel('Date')
    plt.title(' '.join(page.split('_'))+' page Edits')
    plt.show()

We see there are some repeating 'spikes' whcih may suggest preiodicity. The natural guess is that these spikes occur near importiant political dates. Interestingly enough, where there is a large activity of edits, we see theere are also many cited dates. Suggesting, at least visually, the number of edits relates to presence of a citation on that date.

Exploritory Data Analysis: Important Dates¶

We now consider what are the events that occured on a given date for a specific page. The following function takes in a page name as well as month and year of interest, then prints out all the citations in both the body of the wikipedia page as well as references.

In [8]:
def print_date_info(date,page):
    
    url_page = 'https://en.wikipedia.org/wiki/'+page
    # Scrape web
    r_page = requests.get(url_page,headers=headers)
    #print(f'Status code: {r_page.status_code} URL: {url_page}')    
    soup_page = BeautifulSoup( r_page.content )

    year, month = get_date(str(date))
        
    #Search through references
    regex_ref = r"\(\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year) +r"\)"
    c_ref=re.compile(regex_ref,re.I)  # with the ignorecase option
    found_ref = False
    print('Searching References...\n')
    for references in soup_page.findAll('ol'):
        ref = references.getText()
        split_ref = ref.split('\n')
        for item in split_ref:
            if len(c_ref.findall(item)):
                reference_Date = re.findall(r'\(.*\)',item)[0]
                reference_title = re.findall(r'["].*["]',item)[0]
                print(reference_Date+' '+ reference_title)
                found_ref = True
    if not found_ref:
        print('No date in references found.')
    #Search through page body via paragraphs
    regex_body = r"\b" + re.escape(month) + r"\b \d*. \b" + re.escape(year)
    
    found_body = False
    c_body=re.compile(regex_body,re.I) 
    print('\nSearching body...\n')
    for paragraph in soup_page.findAll('p'):
        par = paragraph.getText()
        if len(c_body.findall(par)):
                print(par)
                found_body=True
    if not found_body:
        print('No date in body found.')

Proportion of Edits¶

First we consider a histogram of the number of edits per day. That is, in the figure below we plot the proportion of edits per day for each page of interest. We then report the information found on each date with the maximum number of edits.

In [9]:
list_of_counts = []
nbins=20
for page in page_names:
    monthly_edits =dict_of_tables[page]
    found_count = monthly_edits['Edits']
    list_of_counts.append(found_count)

plt.hist(list_of_counts,nbins,density=True, histtype='bar')
plt.legend(page_names)
plt.show()
In [10]:
for page in page_names:
    monthly_edits = dict_of_tables[page]
    max_edit_entry =monthly_edits.sort_values(by = 'Edits',ascending=False).iloc[0]
    print('-'*50)
    print(' '.join(page.split('_'))+ ' Max edits')
    print('-'*50)
    print(max_edit_entry['Month'])
    print('Number of edits:', max_edit_entry['Edits'])
    print('Number of citations found in references:', max_edit_entry['Found References'])
    print('Number of citations found in body:', max_edit_entry['Found Body'])
    print_date_info(max_edit_entry['Month'],page)
--------------------------------------------------
Joe Biden Max edits
--------------------------------------------------
2008-08-01 00:00:00
Number of edits: 913
Number of citations found in references: 19
Number of citations found in body: 1
Searching References...

(August 27, 2008) "Biden's Scranton childhood left lasting impression"
(August 24, 2008) "In his home state, Biden is a regular Joe"
(August 24, 2008) "Jill Biden Heads Toward Life in the Spotlight"
(August 25, 2008) "Parishioners not surprised to see Biden at usual Mass"
(August 20, 2008) "Biden's Foreign Policy Background Carries Growing Cachet"
(August 26, 2008) "For Widener Law students, a teacher aims high"
(August 27, 2008) "Widener students proud of Biden"
(August 24, 2008) "In Biden, Obama chooses a foreign policy adherent of diplomacy before force"
(August 23, 2008) "V.P. candidate profile: Sen. Joe Biden"
(August 23, 2008) "Biden and Anita Hill, Revisited"
(August 24, 2008) "Joe Biden respected—if not always popular—for foreign policy record"
(August 25, 2008) "Biden, McCain Have a Friendship—and More—in Common"
(August 23, 2008) "Obama's veep message to supporters"
(August 23, 2008) "Obama Chooses Biden as Running Mate"
(August 25, 2008) "Tramps Like Us: How Joe Biden will reassure working class voters and change the tenor of this week's convention"
(August 27, 2008) "Biden accepts VP nomination"
(August 24, 2008) "Biden Wages 2 Campaigns At Once"
(August 24, 2008) "Demographics part of calculation: Biden adds experience, yes, but he could also help with Catholics, blue-collar whites and women"
(August 23, 2008) "Halperin on Biden: Pros and Cons"

Searching body...

Shortly after Biden withdrew from the presidential race, Obama privately told him he was interested in finding an important place for Biden in his administration.[170] In early August, Obama and Biden met in secret to discuss the possibility,[170] and developed a strong personal rapport.[169] On August 22, 2008, Obama announced that Biden would be his running mate.[171] The New York Times reported that the strategy behind the choice reflected a desire to fill out the ticket with someone with foreign policy and national security experience.[172] Others pointed out Biden's appeal to middle-class and blue-collar voters.[173][174] Biden was officially nominated for vice president on August 27 by voice vote at the 2008 Democratic National Convention in Denver.[175]

--------------------------------------------------
Donald Trump Max edits
--------------------------------------------------
2016-11-01 00:00:00
Number of edits: 1475
Number of citations found in references: 11
Number of citations found in body: 1
Searching References...

(November 18, 2016) "Donald Trump Agrees to Pay $25 Million in Trump University Settlement"
(November 15, 2016) "Clickbait scoops and an engaged alt-right: everything to know about Breitbart News"
(November 23, 2016) "Donald Trump disavows 'alt-right'"
(November 11, 2016) "Donald Trump will be the only US president ever with no political or military experience"
(November 9, 2016) "Trump pulls off biggest upset in U.S. history"
(November 9, 2016) "Why Trump Won: Working-Class Whites"
(November 9, 2016) "Republicans are poised to grasp the holy grail of governance"
(November 10, 2016) "Protests against Donald Trump break out nationwide"
(November 11, 2016) "Trump says protesters have 'passion for our great country' after calling demonstrations 'very unfair'"
(November 9, 2016) "Trump Can Kill Obamacare With Or Without Help From Congress"
(November 15, 2016) "Trump: Same-sex marriage is 'settled', but Roe v Wade can be changed"

Searching body...

On November 8, 2016, Trump received 306 pledged electoral votes versus 232 for Clinton. The official counts were 304 and 227 respectively, after defections on both sides.[208] Trump received nearly 2.9 million fewer popular votes than Clinton, which made him the fifth person to be elected president while losing the popular vote.[209] Trump is the only president who neither served in the military nor held any government office prior to becoming president.[210]

--------------------------------------------------
Hillary Clinton Max edits
--------------------------------------------------
2005-07-01 00:00:00
Number of edits: 910
Number of citations found in references: 2
Number of citations found in body: 0
Searching References...

(July 14, 2005) "Clinton among senators urging larger-sized army"
(July 14, 2005) "Clinton burnishes hawkish image"

Searching body...

No date in body found.
--------------------------------------------------
Barack Obama Max edits
--------------------------------------------------
2008-11-01 00:00:00
Number of edits: 1434
Number of citations found in references: 7
Number of citations found in body: 2
Searching References...

(November 4, 2008) "Obama Elected President as Racial Barrier Falls"
(November 9, 2008) "Obama stood out, even during brief 1985 NYPIRG job"
(November 17, 2008) "Obama's church choice likely to be scrutinized; D.C. churches have started extending invitations to Obama and his family"
(November 16, 2008) "Obama resigns Senate seat, thanks Illinois"
(November 4, 2008) "Barack Obama elected 44th president"
(November 5, 2008) "Change has come, says President-elect Obama"
(November 30, 2008) "Obama: Oratory and originality"

Searching body...

In a 2006 interview, Obama highlighted the diversity of his extended family: "It's like a little mini-United Nations," he said. "I've got relatives who look like Bernie Mac, and I've got relatives who look like Margaret Thatcher."[69] Obama has a half-sister with whom he was raised (Maya Soetoro-Ng) and seven other half-siblings from his Kenyan father's family—six of them living.[70] Obama's mother was survived by her Kansas-born mother, Madelyn Dunham,[71] until her death on November 2, 2008,[72] two days before his election to the presidency. Obama also has roots in Ireland; he met with his Irish cousins in Moneygall in May 2011.[73] In Dreams from My Father, Obama ties his mother's family history to possible Native American ancestors and distant relatives of Jefferson Davis, President of the Confederate States of America during the American Civil War. He also shares distant ancestors in common with George W. Bush and Dick Cheney, among others.[74][75][76]

Obama resigned his Senate seat on November 16, 2008, to focus on his transition period for the presidency.[170]

--------------------------------------------------
Chuck Schumer Max edits
--------------------------------------------------
2021-01-01 00:00:00
Number of edits: 108
Number of citations found in references: 2
Number of citations found in body: 3
Searching References...

(January 20, 2021) "Schumer becomes new Senate majority leader"
(January 6, 2021) "Nancy Pelosi and Chuck Schumer call on Trump to demand protestors to leave the US Capitol 'immediately'"

Searching body...

Charles Ellis Schumer (/ˈʃuːmər/ SHOO-mər; born November 23, 1950) is an American politician serving as Senate Majority Leader since January 20, 2021.[2] A member of the Democratic Party, Schumer is in his fourth Senate term, having held his seat since 1999, and is the senior United States senator from New York. He is the dean of New York's congressional delegation.

The Senate Democratic caucus elected Schumer minority leader in November 2016. Schumer had been widely expected to lead Senate Democrats after Reid announced his retirement in 2015. He is the first New Yorker, as well as the first Jewish person, to serve as a Senate leader.[36] On January 20, 2021, Democrats gained control of the Senate with the swearing-in of newly elected Georgia senators Jon Ossoff and Raphael Warnock, following the 2020–21 election runoff and special election runoff, making Schumer the majority leader, replacing Republican Mitch McConnell.[37]

Schumer was participating in the certification of the 2021 United States Electoral College vote count on January 6, 2021, when Trump supporters stormed the U.S. Capitol. Schumer and other members of Congress were removed from the Senate chambers. He and Mitch McConnell joined Nancy Pelosi and Steny Hoyer in an undisclosed location. As the attack persisted, Schumer and Pelosi released a joint statement calling on Trump to demand the rioters leave the Capitol and its grounds immediately.[125] When the Senate reconvened after the Capitol was secure, Schumer gave remarks, calling it a day "that will live forever in infamy".[126] Later that day, he blamed Trump for the attack, calling on Vice President Mike Pence to invoke the Twenty-fifth Amendment to the United States Constitution to remove Trump from office. He also said he would support impeachment.[127]

--------------------------------------------------
Mitch McConnell Max edits
--------------------------------------------------
2021-01-01 00:00:00
Number of edits: 139
Number of citations found in references: 4
Number of citations found in body: 1
Searching References...

(January 6, 2021) "Analysis | Mitch McConnell's forceful rejection of Trump's election 'conspiracy theories'"
(January 6, 2021) "Resuming electoral counting, McConnell condemns the mob assault on the Capitol as a 'failed insurrection.'"
(January 12, 2021) "McConnell is said to be pleased about impeachment, believing it will be easier to purge Trump from the G.O.P."
(January 13, 2021) "McConnell won't agree to reconvene Senate early for impeachment trial"

Searching body...

On January 12, 2021, it was reported that McConnell supported impeaching Trump for his role in inciting the 2021 storming of the United States Capitol, believing it would make it easier for Republicans to purge the party of Trump and rebuild the party.[94] On January 13, despite having the authority to call for an emergency meeting of the Senate to hold the Senate trial,[failed verification] McConnell did not reconvene the chamber, claiming unanimous consent was required.[95] McConnell called for delaying the Senate trial until after Joe Biden's inauguration.[96] Once the Senate trial started, McConnell voted to acquit Trump on February 13, 2021, and said it was unconstitutional to convict someone who was no longer in office.[97]

Reading the report we see that the dates where the maximum number of edits occured all have citations for noteworthy events. Cherry picking a few, we see that for Joe Biden, the month of maximum edits was August 2008, in particular this is when it was announced he would be Obama's running mate. On Donald Trumps page, we see that the month of maximum edits is November 2016, this is the month he won the US presidential election. Similarly, on Barack Obama's page the month of maximum edits is November 2008, when Obama won the US presidential election. This number consistent with our hypothesis that wikipedia trends are indiciative of important events.

Proportion of Counts¶

We consider a histogram of the number of citations for a day. That is, in the figure below we plot the proportion of citation each day recieves for each page of interest. We then report the information found on each date with the maximum number of citations.

In [11]:
list_of_counts = []
nbins=10
for page in page_names:
    monthly_edits =dict_of_tables[page]
    found_count = monthly_edits[monthly_edits['Total Found']!=0]['Total Found']
    list_of_counts.append(found_count)

plt.hist(list_of_counts,nbins,density=True, histtype='bar')
plt.legend(page_names)
plt.show()
In [12]:
for page in ['Donald_Trump', 'Joe_Biden', 'Barack_Obama','Mitch_McConnell']:
    monthly_edits = dict_of_tables[page]
    max_edit_entry =monthly_edits.sort_values(by = 'Total Found',ascending=False).iloc[0]
    print('-'*50)
    print(' '.join(page.split('_'))+ ' Max Citation')
    print('-'*50)
    print(max_edit_entry['Month'])
    print('Number of edits:', max_edit_entry['Edits'])
    print('Number of citations found in references:', max_edit_entry['Found References'])
    print('Number of citations found in body:', max_edit_entry['Found Body'])
    print_date_info(max_edit_entry['Month'],page)
--------------------------------------------------
Donald Trump Max Citation
--------------------------------------------------
2021-01-01 00:00:00
Number of edits: 1055
Number of citations found in references: 37
Number of citations found in body: 6
Searching References...

(January 20, 2021) "Trump's presidency ends where so much of it was spent: A Trump Organization property"
(January 12, 2021) "Deutsche Bank won't do any more business with Trump"
(January 25, 2021) "Supreme Court dismisses emoluments cases against Trump"
(January 8, 2021) "Trump will have the worst jobs record in modern U.S. history. It's not just the pandemic"
(January 14, 2021). "Donald Trump Built a National Debt So Big (Even Before the Pandemic) "Donald Trump Built a National Debt So Big (Even Before the Pandemic) That It'll Weigh Down the Economy for Years"
(January 30, 2021) "From building the wall to bringing back coal: Some of Trump's more notable broken promises"
(January 20, 2021) "The Trump Administration Rolled Back More Than 100 Environmental Rules. Here's the Full List"
(January 16, 2021) "In Trump's final days, a rush of federal executions"
(January 15, 2021) "Trump administration carries out 13th and final execution"
(January 20, 2021) "With Hours Left in Office, Trump Grants Clemency to Bannon and Other Allies"
(January 13, 2021) "Fact check: Mexico never paid for it. But what about Trump's other border wall promises?"
(January 13, 2021) "How Trump compares with other recent presidents in appointing federal judges"
(January 3, 2021) "'I just want to find 11,780 votes': In extraordinary hour-long call, Trump pressures Georgia secretary of state to recalculate the vote in his favor"
(January 5, 2021) "Pence Said to Have Told Trump He Lacks Power to Change Election Result"
(January 20, 2021) "Trump Departs Vowing, 'We Will Be Back in Some Form'"
(January 10, 2021) "Incitement to Riot? What Trump Told Supporters Before Mob Stormed Capitol"
(January 9, 2021) "How one of America's ugliest days unraveled inside and outside the Capitol"
(January 6, 2021) "Facebook, Twitter lock Trump's account following video addressing Washington rioters"
(January 6, 2021) "Calls grow for social media platforms to silence Trump as rioters storm US Capitol"
(January 6, 2021) "Congress confirms Biden's win after pro-Trump mob's assault on Capitol"
(January 11, 2021) "Impeachment Resolution Cites Trump's 'Incitement' of Capitol Insurrection"
(January 13, 2021) "Trump Impeached for Inciting Insurrection"
(January 13, 2021) "House calls on Pence to invoke 25th Amendment, but he's already dismissed the idea"
(January 13, 2021) "Trump's second impeachment is the most bipartisan one in history"
(January 28, 2021) "Palm Beach considers options as Trump remains at Mar-a-Lago"
(January 27, 2021) "Explainer: Why Trump's post-presidency perks, like a pension and office, are safe for the rest of his life"
(January 27, 2021) "Trump opens "Office of the Former President" in Florida"
(January 18, 2021) "Last Trump Job Approval 34%; Average Is Record-Low 41%"
(January 16, 2021) "Trump finishes with worst first term approval rating ever"
(January 16, 2021) "Inside Twitter's Decision to Cut Off Trump"
(January 9, 2021) "A farewell to @realDonaldTrump, gone after 57,000 tweets"
(January 11, 2021) "All the platforms that have banned or restricted Trump so far"
(January 14, 2021) "Twitter ban reveals that tech companies held keys to Trump's power all along"
(January 16, 2021) "Misinformation dropped dramatically the week after Twitter banned Trump and some allies"
(January 23, 2021) "A term of untruths: The longer Trump was president, the more frequently he made false or misleading claims"
(January 15, 2021) "6 conspiracy theories about the 2020 election – debunked"
(January 16, 2021) "'Trump said to do so': Accounts of rioters who say the president spurred them to rush the Capitol could be pivotal testimony"

Searching body...

Trump lost the 2020 presidential election to Joe Biden but refused to concede defeat, falsely claiming widespread electoral fraud and attempting to overturn the results by pressuring government officials, mounting scores of unsuccessful legal challenges, and obstructing the presidential transition. On January 6, 2021, Trump urged his supporters to march to the United States Capitol, which many of them then attacked, resulting in multiple deaths and interrupting the electoral vote count. In November 2022, he announced his candidacy for the Republican nomination in the 2024 presidential election.

Trump, who had been a member since 1989, resigned from the Screen Actors Guild in February 2021 rather than face a disciplinary committee hearing for inciting the January 6, 2021, mob attack on the U.S. Capitol and for his "reckless campaign of misinformation aimed at discrediting and ultimately threatening the safety of journalists."[146] Two days later, the union permanently barred him from readmission.[147]

On January 6, 2021, while congressional certification of the presidential election results was taking place in the United States Capitol, Trump held a rally at the Ellipse, Washington, D.C., where he called for the election result to be overturned and urged his supporters to "take back our country" by marching to the Capitol to "show strength" and "fight like hell".[647][648] Trump's speech started at noon. By 12:30 p.m., rally attendees had gathered outside the Capitol, and at 1 p.m., his supporters pushed past police barriers onto Capitol grounds. Trump's speech ended at 1:10 p.m., and many supporters marched to the Capitol as he had urged, joining the crowd there. Around 2:15 p.m. the mob broke into the building, disrupting certification and causing the evacuation of Congress.[649] During the violence, Trump posted mixed messages on Twitter and Facebook, eventually tweeting to the rioters at 6 p.m., "go home with love & in peace", but describing them as "great patriots" and "very special", while still complaining that the election was stolen.[650][651] After the mob was removed from the Capitol, Congress reconvened and confirmed the Biden election win in the early hours of the following morning.[652] There were many injuries, and five people, including a Capitol Police officer, died.[653]

On January 11, 2021, an article of impeachment charging Trump with incitement of insurrection against the U.S. government was introduced to the House.[654] The House voted 232–197 to impeach Trump on January 13, making him the first U.S. president to be impeached twice.[655] The impeachment, which was the most rapid in history, followed an unsuccessful bipartisan effort to strip Trump of his powers and duties via Section 4 of the 25th Amendment.[656] Ten Republicans voted for impeachment—the most members of a party ever to vote to impeach a president of their own party.[657]

On November 18, 2022, Garland appointed a special counsel, federal prosecutor Jack Smith, to oversee the federal criminal investigations into Trump retaining government property at Mar-a-Lago and examining Trump's role in the events leading up to the January 6, 2021, Capitol attack.[700][701]

Research suggests Trump's rhetoric caused an increased incidence of hate crimes.[813][814] During his 2016 campaign, he urged or praised physical attacks against protesters or reporters.[815][816] Numerous defendants investigated or prosecuted for violent acts and hate crimes, including participants of the January 6, 2021, storming of the U.S. Capitol, cited Trump's rhetoric in arguing that they were not culpable or should receive a lighter sentence.[817][818] A nationwide review by ABC News in May 2020 identified at least 54 criminal cases from August 2015 to April 2020 in which Trump was invoked in direct connection with violence or threats of violence mostly by white men and primarily against members of minority groups.[819]

--------------------------------------------------
Joe Biden Max Citation
--------------------------------------------------
2008-08-01 00:00:00
Number of edits: 913
Number of citations found in references: 19
Number of citations found in body: 1
Searching References...

(August 27, 2008) "Biden's Scranton childhood left lasting impression"
(August 24, 2008) "In his home state, Biden is a regular Joe"
(August 24, 2008) "Jill Biden Heads Toward Life in the Spotlight"
(August 25, 2008) "Parishioners not surprised to see Biden at usual Mass"
(August 20, 2008) "Biden's Foreign Policy Background Carries Growing Cachet"
(August 26, 2008) "For Widener Law students, a teacher aims high"
(August 27, 2008) "Widener students proud of Biden"
(August 24, 2008) "In Biden, Obama chooses a foreign policy adherent of diplomacy before force"
(August 23, 2008) "V.P. candidate profile: Sen. Joe Biden"
(August 23, 2008) "Biden and Anita Hill, Revisited"
(August 24, 2008) "Joe Biden respected—if not always popular—for foreign policy record"
(August 25, 2008) "Biden, McCain Have a Friendship—and More—in Common"
(August 23, 2008) "Obama's veep message to supporters"
(August 23, 2008) "Obama Chooses Biden as Running Mate"
(August 25, 2008) "Tramps Like Us: How Joe Biden will reassure working class voters and change the tenor of this week's convention"
(August 27, 2008) "Biden accepts VP nomination"
(August 24, 2008) "Biden Wages 2 Campaigns At Once"
(August 24, 2008) "Demographics part of calculation: Biden adds experience, yes, but he could also help with Catholics, blue-collar whites and women"
(August 23, 2008) "Halperin on Biden: Pros and Cons"

Searching body...

Shortly after Biden withdrew from the presidential race, Obama privately told him he was interested in finding an important place for Biden in his administration.[170] In early August, Obama and Biden met in secret to discuss the possibility,[170] and developed a strong personal rapport.[169] On August 22, 2008, Obama announced that Biden would be his running mate.[171] The New York Times reported that the strategy behind the choice reflected a desire to fill out the ticket with someone with foreign policy and national security experience.[172] Others pointed out Biden's appeal to middle-class and blue-collar voters.[173][174] Biden was officially nominated for vice president on August 27 by voice vote at the 2008 Democratic National Convention in Denver.[175]

--------------------------------------------------
Barack Obama Max Citation
--------------------------------------------------
2017-01-01 00:00:00
Number of edits: 319
Number of citations found in references: 12
Number of citations found in body: 3
Searching References...

(January 9, 2017) "Barack Obama's Shaky Legacy on Human Rights"
(January 14, 2017) "Jolted by Deaths, Obama Found His Voice on Race"
(January 6, 2017) "US House Passes Motion Repudiating UN Resolution on Israel"
(January 6, 2017) "Final tally: Obama created 11.3 million jobs"
(January 4, 2017) "LGBT activists view Obama as staunch champion of their cause"
(January 9, 2017) "Americans Assess Progress Under Obama"
(January 15, 2017) "Why Did the US Drop 26,171 Bombs on the World Last Year?"
(January 19, 2017) "Map shows where President Barack Obama dropped his 20,000 bombs"
(January 13, 2017) "President Obama, who hoped to sow peace, instead led the nation in war"
(January 5, 2017) "Federal prison population fell during Obama's term, reversing recent trend"
(January 18, 2017) "Obama leaving office at 60 percent approval rating"
(January 18, 2017) "Obama approval hits 60 percent as end of term approaches"

Searching body...

After winning re-election by defeating Republican opponent Mitt Romney, Obama was sworn in for a second term on January 20, 2013. In his second term, Obama took steps to combat climate change, signing a major international climate agreement and an executive order to limit carbon emissions. Obama also presided over the implementation of the Affordable Care Act and other legislation passed in his first term, and he negotiated a nuclear agreement with Iran and normalized relations with Cuba. The number of American soldiers in Afghanistan fell dramatically during Obama's second term, though U.S. soldiers remained in Afghanistan throughout Obama's presidency. Obama left office on January 20, 2017, and continues to reside in Washington, D.C. His presidential library in Chicago began construction in 2021.

On December 23, 2016, under the Obama Administration, the United States abstained from United Nations Security Council Resolution 2334, which condemned Israeli settlement building in the occupied Palestinian territories as a violation of international law, effectively allowing it to pass.[388] Netanyahu strongly criticized the Obama administration's actions,[389][390] and the Israeli government withdrew its annual dues from the organization, which totaled $6 million, on January 6, 2017.[391] On January 5, 2017, the United States House of Representatives voted 342–80 to condemn the UN Resolution.[392][393]

Obama's presidency ended on January 20, 2017, upon the inauguration of his successor, Donald Trump.[472][473] The family moved to a house they rented in Kalorama, Washington, D.C.[474] On March 2, 2017, the John F. Kennedy Presidential Library and Museum awarded the Profile in Courage Award to Obama "for his enduring commitment to democratic ideals and elevating the standard of political courage."[475] His first public appearance since leaving the office was a seminar at the University of Chicago on April 24, where he appealed for a new generation to participate in politics.[476]

--------------------------------------------------
Mitch McConnell Max Citation
--------------------------------------------------
2019-01-01 00:00:00
Number of edits: 123
Number of citations found in references: 12
Number of citations found in body: 1
Searching References...

(January 22, 2019) "Mitch McConnell Got Everything He Wanted. But at What Cost?"
(January 2, 2019) "McConnell suggests shutdown could last for weeks"
(January 4, 2019) "McConnell keeps his head down as government shutdown drags on"
(January 10, 2019) "Senate Democrats pushed a vote to reopen the government. Mitch McConnell shot them down"
(January 11, 2019) "Mitch McConnell could end the shutdown. But he's sitting this one out"
(January 23, 2019) "McConnell blocks bill to reopen most of government"
(January 25, 2019) "'This is your fault': GOP senators clash over shutdown inside private luncheon"
(January 3, 2019) "McConnell Faces Pressure From Republicans to Stop Avoiding Shutdown Fight"
(January 25, 2019) "Trump signs bill to end shutdown and temporarily reopen government"
(January 9, 2019) "The Government Shutdown Was the Longest Ever. Here's the History"
(January 10, 2019) "America's Most and Least Popular Senators: McConnell loses spot as least popular senator"
(January 22, 2019) "Mitch McConnell Got Everything He Wanted. But at What Cost?"

Searching body...

From December 22, 2018, until January 25, 2019, the federal government was shut down when Congress refused to give in to Trump's demand for $5.7 billion in federal funds for a U.S.–Mexico border wall.[142] In December 2018, the Republican-controlled Senate unanimously passed an appropriations bill without wall funding, and the bill appeared likely to be approved by the Republican-controlled House of Representatives and Trump. After Trump faced heavy criticism from some right-wing media outlets and pundits for appearing to back down on his campaign promise to "build the wall", he announced that he would not sign any appropriations bill that did not fund its construction.[143]

Above we consider the dates with the maximum number of citations for a few select pages. On Donald Trump's page we have January 2021, the infmamous attack on the US capitol. An event spurred by the former President of the US. For Joe Biden, we have again August 2008 when he was announced to be Obama's running mate. On Barack Obama's page we have January 2017, the end of his second term. For Mitch McConnell, the most cited month is January 2019. His credited for prolonging the federal goverment shutdown.

Exploritory Data Analysis: Correlations¶

A natural quesiton to ask is if the number of edits are correlated to the number of references of a particular date. In particular, we expect that the number of Minor edits is correlated with the number of Major edits. This is observed in our time series plots of the edits for each page. Below we plot a heat map of the pearson correlation coefficient between the total number of citations of a particular date, the number of minor edits, the number of ip edits and the number of major edits.

In [13]:
# Combine all the dataframes of data for each page
combined_df = pd.concat(dict_of_tables.values(),ignore_index=True)
In [14]:
import seaborn as sns

sns.heatmap(combined_df[['Major','Minor edits','IPs','Total Found']].corr(), annot=True)
plt.show()

Based on the correlation heat map above it is clear that IPs edits are not highly correlated with any variable. Moreover, the plot confirms our natural guess that Minor Edits are highly correlated with Major Edits. We also observe that Major edits and Minor edits are correlated with the number of citations found. This correlation suggests, mildly, that edits are indicitave of important events.

Anomaly Detection: Donald Trump¶

We now implement an unsupervised Anomaly Detection algorithm to detect important dates. In particular, here we make use of an algorithm known as Isolation Forests. More information on Isolation Forests can be found here.

For our models inputs, we only make use of a page's Major Edits and Minor Edits. We neglect using IPs Edits as they do not carry much predictive power as suggested by our correlation analysis. We can measure the performace of our model by comparing dates classified as important with the presence of citations for that date. That is, a date is important if that date has any citations on the page. (It should be noted that this is a weak measure of importnance. Namely, this suggests a date with one citation is as important as a date with 50.)

In [15]:
from sklearn.ensemble import IsolationForest
import numpy as np

page = 'Donald_Trump'
df = dict_of_tables[page]


X = np.array(df[['Major','Minor edits']])
# Fit Isolation Forest Model
clf = IsolationForest(random_state=0).fit(X)
# Predict
anomaly_predicted =  clf.predict(X)

predicted = df[anomaly_predicted == -1]

# Define important class as those with at least 1 citation
y_important = [(-1)**int(x) for x in df['Total Found']!=0]
In [16]:
fig, ax = plt.subplots()
#important_date = dt.datetime(2021, 1, 6)

ax.plot(df['Month'], df['Major'],linestyle='-', linewidth=2.0)

ax.scatter(predicted[predicted['Total Found']==0]['Month'], predicted[predicted['Total Found']==0]['Major'],
           s=100,marker='x', c ='red'  )

ax.scatter(df['Month'], df['Major'],
           s = df['Total Found']*10,marker='o', c ='orange',alpha=0.35  )
ax.scatter(predicted['Month'], predicted['Major'],
           s = predicted['Total Found']*10,marker='^', c ='green'  )

plt.legend(['Major Edits','Anomalies without citation','Dates with citation','Anomalies with citation'])

plt.xlabel('Date')
plt.title(' '.join(page.split('_'))+' page Edits Anomaly Detection')
plt.show()

In the plot above we mark with green triangles the dates labeled as anomalies with citation, the size of triangle corresponds to how many citations that date has. That is, the larger the triangle, the more citations, the more important. The rex crosses denote dates that were classified as anomalies but have no citations. Namely, we consider those to be false positives or incorrectly labeled importnat dates. The orange circle mark all dates that have any citation, the size corresponds to the number of citations for that date.

We now plot the confusion matrix below.

In [17]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_important, anomaly_predicted, labels =[1,-1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm ,display_labels =[1,-1])
disp.plot()
plt.show()
In [18]:
recall = cm[1,1] / np.sum(cm, axis = 1)[1]
precision = cm[1,1] / np.sum(cm, axis = 0)[1]
F1_score = 2*(precision*recall)/(precision+recall)
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {F1_score:.2f}')
Precision: 0.91
Recall: 0.33
F1 Score: 0.48

We observe for Donald Trumps page, the IsolatedForest model has high precision but low recall. This leads to a moderate F1 Score. This is due to the fact that there are many dates important dates with 1 citation. In turn, the model fails to capture them.

In [19]:
pred_average=predicted['Total Found'].mean()
print(f' Average number of citation for predicted dates {pred_average:.2f}')
 Average number of citation for predicted dates 8.54

We see that the average number of citations for a page predicted to be important is fairly high. It's no suprize that we mis classyfing dates with 1 citation.

Anomlay Detection: Other Pages¶

We now repeat this for all other pages.

In [20]:
sum_precision = 0
sum_recall = 0
sum_F1 = 0
sum_avg_predicted = 0

for page in page_names:
    print('\n'+'-'*50)
    print(' '.join(page.split('_'))+ ' Page Model')
    print('-'*50)
    df = dict_of_tables[page]
    X = np.array(df[['Major','Minor edits']])
    # Fit Isolation Forest Model
    clf = IsolationForest(random_state=0).fit(X)
    # Predict
    anomaly_predicted =  clf.predict(X)
    predicted = df[anomaly_predicted == -1]
    
    # Define important class
    y_important = [(-1)**int(x) for x in df['Total Found']!=0]
    
    # Plot predicted dates
    fig, ax = plt.subplots()
    ax.plot(df['Month'], df['Major'],linestyle='-', linewidth=2.0)
    ax.scatter(predicted[predicted['Total Found']==0]['Month'], predicted[predicted['Total Found']==0]['Major'],
           s=100,marker='x', c ='red'  )
    ax.scatter(df['Month'], df['Major'],
           s = df['Total Found']*10,marker='o', c ='orange',alpha=0.35  )
    ax.scatter(predicted['Month'], predicted['Major'],
           s = predicted['Total Found']*10,marker='^', c ='green'  )
    plt.legend(['Major Edits','Anomalies without citation','Dates with citation','Anomalies with citation'])
    plt.xlabel('Date')
    plt.title(' '.join(page.split('_'))+' page Edits Anomaly Detection')
    plt.show()
    
    
    cm = confusion_matrix(y_important, anomaly_predicted, labels =[1,-1])
    disp = ConfusionMatrixDisplay(confusion_matrix=cm ,display_labels =[1,-1])
    disp.plot()
    plt.show()
        
    recall = cm[1,1] / np.sum(cm, axis = 1)[1]
    precision = cm[1,1] / np.sum(cm, axis = 0)[1]
    F1_score = 2*(precision*recall)/(precision+recall)
    print(f'Precision: {precision:.2f}')
    print(f'Recall: {recall:.2f}')
    print(f'F1 Score: {F1_score:.2f}')
    
    sum_precision += precision
    sum_recall += recall
    sum_F1 += F1_score
    pred_average=predicted['Total Found'].mean()
    sum_avg_predicted += pred_average
    print(f'Average number of citation for predicted dates {pred_average:.2f}')
--------------------------------------------------
Joe Biden Page Model
--------------------------------------------------
Precision: 0.88
Recall: 0.21
F1 Score: 0.34
Average number of citation for predicted dates 5.85

--------------------------------------------------
Donald Trump Page Model
--------------------------------------------------
Precision: 0.91
Recall: 0.33
F1 Score: 0.48
Average number of citation for predicted dates 8.54

--------------------------------------------------
Hillary Clinton Page Model
--------------------------------------------------
Precision: 0.64
Recall: 0.28
F1 Score: 0.39
Average number of citation for predicted dates 2.05

--------------------------------------------------
Barack Obama Page Model
--------------------------------------------------
Precision: 0.89
Recall: 0.23
F1 Score: 0.37
Average number of citation for predicted dates 4.74

--------------------------------------------------
Chuck Schumer Page Model
--------------------------------------------------
Precision: 0.52
Recall: 0.18
F1 Score: 0.27
Average number of citation for predicted dates 1.67

--------------------------------------------------
Mitch McConnell Page Model
--------------------------------------------------
Precision: 0.69
Recall: 0.30
F1 Score: 0.42
Average number of citation for predicted dates 2.25
In [21]:
avg_precision = sum_precision/len(page_names)
avg_recall = sum_recall/len(page_names)
avg_F1 = sum_F1/len(page_names)
avg_avg_predicted = sum_avg_predicted/len(page_names)

print(f'Average Precision: {avg_precision:.2f}')
print(f'Average Recall: {avg_recall:.2f}')
print(f'Average F1 Score: {avg_F1:.2f}')

print(f'Average number of citation for predicted dates {avg_avg_predicted:.2f}')
Average Precision: 0.76
Average Recall: 0.26
Average F1 Score: 0.38
Average number of citation for predicted dates 4.18

We see across all pages that the IsolationForrest model preforms with high precision and low recall. Again, the failure of the recall is likely due to how we define important event.

Conclusion and Future Work¶

Wikipedia page edit trends for political figures does appear to indicate important events. Visual, examining each time series one can observe that during periods where many edits occur, there many citations appear for those dates. This suggests that edits are occuring to reflect the recent news. Moreover, we have that they are indeed statstistically correlated. That is, the number of page edits is positivly correlated with the number of citations for that particular date.

Our Isolated Forrest Model also preforms moderatly for each page. We see that each model preforms with high precision and low recall. This is, due to the fact that our notion of important event is inappropriate. To build a better model, one needs to have a better measure of importance. For possible future work one could consider a normalized proportion of edits, or some function of the number of reference citations and body citations. Even more advanced, one could take the raw text associated to those dates like in the print_date_info() function and build some sentient analysis model to measure how important each date is. In either case, there are many directions this can go.