Hi all! ‘Tis the season for Spotify Wrapped, that is, if you subscribe to Spotify. I used to, but I left for two reasons: The algorithm was not good enough at making recommendations that stuck with me long-term, and the delays in their lossless audio options. However, this means I miss out on the annual Spotify Wrapped. But I came across a meme this year that read something like: “Wrapped is people waiting around all winter for two SQL queries.” To be fair, I think it’s maybe three SQL queries, but it got me thinking. What would it take to prepare my own music listening data for exploratory data analysis. In this post, I’ll show you how to explore your music library with Python.
For this post, I’ll be assuming you use Apple’s Music app to manage your library. If you don’t, you can at least follow along with the pandas Dataframe portion of the tutorial at the end. See this article for generating sample data with pandas. We’ll also add some commentary at the end.
Prerequisites
To follow along, you will need: Python (install pandas with pip if you don’t have it already) and Jupyter (or you can run as a script).
- Create a folder on your computer called something like “music-library-exploration”
- Start a Jupyter server in your folder either with the Jupyter server application or with the VS Code Jupyter extension.
- In the Music app, open the toolbar, and select “File > Library > Export Library” in the menus. Export a .xml file to your folder from Step 1. Name it “library.xml” for example.
Now we’re ready to load your music library data into Python!
Data cleaning
Copy and paste this code into your Jupyter notebook. Then, run it. You may need to update the library_path variable if you named your library export something different. This script extracts the track data from your XML file into a pandas Dataframe. We’re going to see the benefits of this in a second.
import pandas as pd
import xml.etree.ElementTree as ET
# Path to your iTunes Library XML file
library_path = "library.xml"
# Parse the XML file
tree = ET.parse(library_path)
root = tree.getroot()
# Extract relevant data from the XML
data = []
for track in root.findall("dict/dict/dict"):
track_data = {}
should_append = True
for key, value in zip(track[::2], track[1::2]):
if key.text in ["Name", "Artist", "Album", "Genre","Play Count"]:
track_data[key.text] = value.text
if key.text == 'Apple Music':
should_append = False #Set to False if you only want to include your local music,
# True if you want to include streaming data
if(should_append):
data.append(track_data)
# Create a Pandas DataFrame
df = pd.DataFrame(data).dropna()
The Two Queries of Spotify Wrapped
Let’s check out a couple of the things a Dataframe can do. Try out this line in a cell in your Jupyter notebook:
df.sort_values(by="Play Count", ascending=False)
As you can guess, this will sort the rows of the Dataframe’s tabular data by play count from highest to lowest. The output will be a table that’s something like this:
Name | Artist | Album | Genre | Play Count |
Goin’ Home Comin’ On | Carley Arrowood | Goin’ Home Comin’ On | Country & Folk | 95 |
I Didn’t Know What Time It Was | Charlie Parker | Charlie Parker With Strings | Jazz | 9 |
The Union Meetin’ | Rie Lee Kanehira | The Union Meetin’ | Blues | 9 |
So What | Miles Davis | Kind of Blue | Jazz | 9 |
But there’s something curious going on about this data. Why do all my highest play counts start with the digit 9? And why is there such a gap between the first and second highest played songs? I’m going to guess yours will do something similar. Now, run this line of code, then try the sort again:
df['Play Count'] = df['Play Count'].astype(int)
Now, you are probably less surprised by your most played song, right? The problem was we imported the data, but we didn’t clean it properly. Pandas was interpreting our play count column as a string and sorting it alphabetically rather than numerically. To fix this, we had to tell pandas to treat the data as integers.
A lot of any task with data or even any software development involving queries is just cleaning and processing data to allow you to create meaningful queries and insights from it easily.
To emulate the Spotify Wrapped, let’s try a couple of queries.
df[["Artist", "Play Count"]]
.groupby("Artist")
.sum()
.reset_index()
.sort_values(by="Play Count", ascending=False)
.head(10)
Let’s go line by line here. First, we grab the “Artist” and “Play Count” columns from the Dataframe. Second, we group rows together by the artist, and sum the play count for each group. For our purposes today, “reset index” is cosmetic. It just flattens out the groupings of rows to make them look like regular rows again. We’ve seen “sort_values” already. Sending the “ascending=False” argument tells pandas that we want to sort from highest to lowest rather than the default of lowest to highest. The last method, “head(10)” just says we want the first 10 results in the table.
So, putting it all together, this query tells us our top 10 most played artists and how many times we listened to one of their songs in total!
If we change things up just a little:
df[["Genre", "Name"]].groupby("Genre").sum().reset_index().sort_values(
by="Name", ascending=False
).head(10)
This query is the same except we aggregate on genre rather than artist! So, we discover our top genres!
Bonus Visualizations!
Another benefit of pandas is that there’s a lot of great support for making data visualizations. Let’s save one of our queries to a variable:
df1 = df[["Genre", "Name"]].groupby("Genre").sum().reset_index().sort_values(
by="Name", ascending=False
).head(10)
Now, let’s import Plotly, which will help us make our visualization. If you haven’t installed this library, you can do so now with pip in your terminal. Then, run:
import plotly.express as px
px.pie(df1,values='Name',names='Genre',title="Genres in My Music Library",hole=.3)
For me, the out put looks like this:
“Wrapping Up”
Hopefully, this gives you a good starting point or motivation to play with data frames. One thing I do miss about Spotify Wrapped, though, is the seamless way it fits into the Spotify app. It’s a fun surprise because you get so used to opening the app and just going to your main playlists, but then one day in early winter, this beautiful and smooth front-end experience and gives you a personalized recap of your year. This is something Apple Music still hasn’t figured out: The magic of Spotify Wrapped is how integrated it is into the regular listening experience.
As users of software, even as non-programmers, we understand that each software has certain “boundaries” and things we can and can’t “do” with it. We understand Clippy can’t help us write an essay, or that we’ll get stopped by invisible walls if we try to go out of bounds in early 3D video games. Spotify Wrapped is magic because after using the Spotify app all year, we think we understand the limits of the application, only to realize in one concentrated experience that the app is much more capable than the monotonous way we get used to using it.
My other favorite example of software that does this perfectly is the original 2001 Animal Crossing. That game was released on the GameCube, before Nintendo games consoles and games were subject to the curse of digital downloads and wireless updates. That means all the “boundaries” of the game were defined when it shipped. A lot of the game is monotonous: shaking trees, fishing, and shopping. But a few times a year, a special event like a fishing tournament, fireworks, the appearance of a strange traveler or a grumpy mole rewards (or punishes) the player by challenging what they imagined the limits of the software to be.
However, don’t get the wrong idea about my takeaway for software. I am not advocating that developers spend oodles of time on niche, one-a-year features or queries. If you take that lesson away, you end up with Apple Music’s Wrapped rip-off. It’s not about the queries. It’s about giving users new freedom to create their own new experiences with the software and their data.