Welcome to WikiVice, a version control history for Wikipedia pages.
Wikipedia articles are living, breathing things, but it's often difficult to tell just how alive these articles are. For our final project at Flatiron, our team of four wanted to create a visualized edit history for Wikipedia articles, to show information such as how often the page is changing, who is changing it, and what other pages that author is editing.
A Technical Walkthrough (& next steps)
For this project, we pulled all of our data from Wikipedia's APIs. Compared to other APIs, Wikipedia's is extremely user-friendly— they offer many ways to customize the query to return the data you want (for example, we could filter specifically for the category of edits, which is how we were able to find the vandalism edits for a given page), and better yet, there are no call limits.
To control the information that's coming in, we have a Wikipedia wrapper that's making the calls to the API and saving them out to the database. For any given search, it's creating a Page object, 500 past revision objects and 500 author objects, and using these to aggregate revision history data for a page.
Once you're taken to a page, there's a whole bunch of data visualized for you. Look through to see how often a page is changed, how many unique authors have edited that page, and, based on how often it's changed, how volatile this page is. Although this last method relies on the time between revisions at the moment, I'd like to get this method working based off of the time between revisions of all other pages in our database, simply because it would be interesting to know how often this page is changing compared to other pages on Wikipedia.
We also worked with Chart.js and C3 to develop some visualized graphics. The revision history is particularly interesting, showing how many times a page changes per day for its history. We gathered this data by sorting through all of the revisions we had for the page, sorting them by date, and counting them, and translating this data into some arrays to use as data for the visualizations. As an extreme next step, I'd really like to pull in the New York Times API to show the top news story for that page on a day when it's experienced a huge jump in edits.
The last two days of our project I worked to refactor the vandalism methods out of our models and into a Vandalism helper with some AREL to help lean out the models, and also to add some functionality that we may want to implement later. Right now, we have aggregate data for the vandalism of a page, but it's also interesting to see data on overall vandalism. For example, most vandalism is coming from anonymous authors, and every anonymous author has an ip address that can be used to find the country of origin. We can use this data to answer a few interesting questions— which country, overall, is contributing the most to vandalism? Which users specifically are the worst repeat offenders of vandalizing pages? (Sidenote: another method I'd like to eventually implement would be to find out which categories of pages are vandalized the most, although my gut instinct says politicians.)
We also have a model that automatically tweets out any vandalism anytime it's persisted to our database. This, predictably, had a few issues, mainly stemming from some of these vandalism edits coming from the deep dark abyss of the internet. At the end, I made a working method that searches through the words of a tweet at the critical point BEFORE it is immediately sent out, (think: a bouncer), searching to see if it has any of the offending language, and converting it to a slightly less awful (read: vowels starred out) version. Unfortunately this method is not perfect, and where there is a will to use profanity, there is a way, so this method needs to take into account special characters and a more extensive list of words before it's perfect. For next steps, I'd like to increase the list of terms it's searching against, as well as blacklist some terms. For example, if a racial slur comes up, the tweet won't get sent at all. There's a fine line between mean spirited (like racist edits) and funny (someone repeatedly injecting the lyrics to "Happy" on various parts of Pharrell's page).
This project came together in two weeks, which is a little crazy, and I'm excited to see where we'll be able to build it out and grow it from here. Visit my other [super, super talented] teammates on Github here:
Abhijit Pradhan, who was the main force behind switching our program over to mass assignment to our database, notably cutting the load time from 30 seconds down to a much more reasonable 7;
Heather Petrow, who created the parser helper to go through all of our revision content and turn them into something a human can read—no small feat if you've seen what Wikipedia returns to you content-wise; and
Mitchell Hart, who worked to create the anonymous author by country map, and also got our vandalisms being sent out via Twitter, releasing our project to the world.