Revisiting the Jazz Album Data Visualization

I missed something in my initial critiques of my previous jazz album data visualization. For ease of reference, I’ll re-embed the graph I’m discussing here:

Overall, I think this scatter plot does a good job illustrating the dataset of Wikipedia articles on jazz albums. There are clusters that group similar albums and appealing colors that highlight clusters defined by an algorithm. If you were curious about how to use term frequency-inverse document frequency, t-SNE, k-means methods in a project on data extracted from the web, or about how to embed Python generated scatter plots on the internet, this graph might inspire you to look at my code or to do your own web crawl. But more than just show off techniques, I want to engage with the subject, with jazz, and give the audience a place to start with it.

In the above graph, Miles Davis is on the far left, John Coltrane is in the top right, and likewise, Gillespie, Mingus, Monk, Fitzgerald, Sinatra, Sun Ra, Jarrett, and Ellington are all pushed to the “outside” of the visualization rather than emphasized at the center. (Actually, that makes sense for Sun Ra: He’s pretty out there.) In fact, near center, we have the orange cluster with Taylor Swift, Rod Steward, and The Beach Boys among others who barely have any business being in this dataset. Where are your eyes drawn? Where does someone look if they want to learn more about jazz? I’m guessing not the right place. To address this, I have created a new visualization, violà:

Encodings

What’s going on here? Like the best jokes and Halloween costumes, the best data visualizations require explanations. Here’s some background on the various encodings used in the graphs.

Coordinates

Recall from my previous posts on web crawling and the death of jazz that each data point on the scatter plot stands for a jazz album – but more specifically, a Wikipedia article on the subject of a famous jazz album. I transform the text of the Wikipedia articles through various algorithms into the (x,y) coordinates of the datapoints, the idea being that similar articles end up close on the graph. (If you want to know more about these specific algorithms, start by reading about TF-IDF here!)

Color

Since these algorithms lose a lot of information transforming entire articles into just (x,y) values, I use colors to emphasize another degree of similarity between articles. Notice how, for example, in both visualizations, Miles Davis and John Coltrane have separate and very distinct clusters, but they ended up with the same k-means cluster color. To me, that signals the use of color and coordinates alone encode interesting insights into this dataset, such as the famous collaboration of these two musicians. Although, I am once again disappointed that only one Charlie Parker album appears above despite his own close association with Davis.

Size

There’s only one substantive difference between the above two plots: The new graph increases the size of the dots on the scatter plot for more famous or influential albums. So “Kind of Blue” by Miles Davis ends up being a lot larger than “For Musicians Only” by Dizzy Gillespie because it is more well-known. I measured the “fame” of an album by the number of articles I found that link to it. Search engines often use this metric combined with others to assign popularity to web pages. The effect on the graph is that the reader is more drawn to or engaged with the more famous albums. Hopefully, this increases the chance they can engage with the subject by recognizing some patterns or just some artist or album names!

I was also able to de-clutter the plot by leaving out datapoints that were less “relevant” to the theme using the new variables and columns I used in Python to feed into the point sizes.

The Code

I posted the code used to collect this data and create the visualization as open source code to my GitHub. Please contact me with questions or if you want me to expand on anything mentioned.

One response to “Revisiting the Jazz Album Data Visualization”

Sci-Fi Novel Data Visualization – Ben's Blog says:

October 13, 2024 at 4:32 am

[…] represents the prominence of an article among the other data. Read more about the methodology at this post. All methods are the same. Just substitute “novels” for “albums” and […]