Two Graphs on Jazz with Free Data Scraped from Wikipedia

Explanation:

For this visualization, I scraped 6064 Wikipedia articles for jazz or jazz-related albums, and clustered the articles according to similarity using TF-IDF and t-SNE for getting the 2-D position of the album on the graph – and using k-means for the point colors.

Insights & Future Directions:

Although as genre “jazz” is large umbrella, the big orange cluster on the left-middle of the chart captures a lot of the data you wouldn’t necessarily call “jazz.” You will find things like Taylor Swift (who is ready for jazz era?), The Three Stooges, Alan Ginsberg reading the poetry of William Blake, and the Star Wars soundtrack. Outside of this cluster, I can’t find much else that doesn’t fit comfortably under the jazz umbrella so I think the clustering worked well here.
Even though Miles Davis and John Coltrane are on opposite ends of the 2-D representation of the TF-IDF data, k-means classified them being in the same cluster, so they were closer together in the higher dimension. You can find Davis in the southwest and Coltrane in the northeast of the chart in the purple or “kind of blue” group.
Albums are grouped well together by artist. TF-IDF probably put a lot of importance on artist names, which could explain why albums by the same artist end up so close together and why L. Ron Hubbard shows up in the Freddie Hubbard cluster. John and Alice Coltrane being clustered can also be attributed to their frequent collaborations, but it would be nice to further study the breakdown how much the TF-IDF distances were influenced by name versus actual collaborations. At least the proximities of groups like Wayne Shorter-Miles Davis-Bill Evans, Sonny Rollins-Thelonious Monk, Ella Fitzgerald-Duke Ellington give evidence that collaboration had some influence.

Explanation:

This is a doughnut chart that counts the number of links to the pages of musicians from a sample of 56773 articles on jazz-related topics.

Insights & Future Directions:

There is a disappointing absence of references to vocalists like Ella Fitzgerald, Billie Holliday, Nat King Cole, Sarah Vaughn, Frank Sinatra, and so on… I also would have been glad to see Sun Ra, Charlie Parker, and Erroll Garner rank highly enough to include in the chart without having incomprehensibly small pie slices.
There could be some sample bias in the web-crawling algorithm or initial article I used to seed the scraping process. Associating each artist with an instrument or sub-genre and re-creating side-by-side versions chart filtered on those categories could surface that data, but it might be better to test whether the scraping process was biased by seeding with a vocalist’s article and measuring the increase in references to these artists. If the increase is small, we can probably rule out bias of the algorithm seed.

Further info: View, download, or run the code used to collect the data for the visualizations here. Read more about the code in this post. Libraries used for web crawling and visualizations:

Beautiful Soup – for parsing HTML responses
aiohttp & throttle – for making and throttling bulk asynchronous web requests
SQLite, SQLAlchemy & SQLAlchemy Utils – for persistent storage of scraped web data
Pandas & Scikit-Learn – for processing crawled data and computing TF-IDF, k-means, and t-SNE algorithms
Plotly & Chart-Studio – for creating and publishing the data visualizations