Ben’s Blog

Portrait of L. Ron Hubbard, who appears in the wiki crawling datasets

A Wiki Crawling Reflection: The Return of L. Ron

Posted by:

|

On:

|

,

Introduction

Hey all! Maybe you’ve seen my previous post on visualizations of jazz-themed Wikipedia articles. I also posted the code for the project here on GitHub. I just wanted to go behind-the-scenes a little on how that project works and brainstorm some of the other things you can do with it!

Some of this, I discuss in the README.md of the project. But even though I had a specific goal for this repository of creating a visualization of famous jazz albums, I wanted to balance that with generalizing my code for reuse.

The easiest way to alter the project is to start the web crawling algorithm on a different initial page than the one I used for the jazz visualization. This is a breadth-first web crawling algorithm, so it will follow all the links on a given starting page, and when it’s done with those “child” pages, follow all the links that were links on the “child” pages, continuing that way for as many “generations” as you like. Just by starting on a list of science-fiction novels instead of a list of famous jazz albums and following the same TF-IDF and t-SNE data transformation methods, I got this “map” of science-fiction related articles (hover over a data point to reveal the title of the article it represents):

sci-fi

Criticism

Some observations on this chart: It is harder to interpret than the charts on my other post, and there are a lot more “genre” or “category” pages among the data points – such as the yellow cluster in the bottom. This is because the crawler algorithm was hard-coded to identify pages that represent albums and musical artists based on webpage layouts and exclude other page types from the visualization. So if we wanted cleaner data for this visualization, one way to do it might be add in other hard-coded classifications for author or book page types. Also, having the author and book title structured data available would help create better hover labels. I only know that the yellow-orange cluster on the far right of the chart is centered around Phillip K. Dick because I know the titles of his books by heart, same with Robert A. Heinlein in the very top. Some of the others, I have no clue, so it would be a huge help to see if lots of the surrounding author names match up.

Also, since it is more common to author a book that a jazz album solo, another fun way to remake this graph would be to color not by k-means but rather by author (or clusters of authors). Then, the mix of colors would show off relations between authors and their themes.

I also wish the hover texts for the visualization would be clickable so you can get more info on a data point, but you can actually download the full dataset here in a .csv or other flat file, if you would like to.

Conclusion

But that said, it is fun to see the algorithms still effective at highlighting some of the underlying structure of the data. Octavia Butler is right next to the Hugo award. And even the small Ursula Le Guin cluster puts the Hainish novels together and separates out her fantasy from her science-fiction stories. L. Ron Hubbard finds his way back into another visualization – although this one is a less surprising appearance than on my previous jazz post. Maybe I can avoid him in the future by sticking with “lower-dimensional” data.

Other things I would like to add to the repository include multiprocessing and upgraded (bulk) database operations. Right now, the web-requests work nicely in an asynchronous pattern, but processing the response data could also be parallelized to improve the scalability of this method and make it easier to create that structured, derived data such as “author” and “artist.” Well, until then!