In this post, let’s explore your music library with Python!
Reflections
In my last tutorial, the focus on complicated XML parsing drained some of the fun from playing with our data. Also, I missed some important facts about the dataset when trying to draw conclusions from my data. I uncovered this oversight trying to understand a simple question: “Which artist in my iTunes library am I least likely to skip?” The answer surprised me – Arthur Rubinstein. I didn’t even know who that was. Let alone how he snuck into my library. But apparently, 99.17% of the time, I listen to his songs when they play. And this wasn’t just a low-sample size outlier. I have over 120 plays on his music.
So, I typed his name in the search bar, and then, it then everything made sense. Arthur Rubinstein recorded a collection of works by Chopin, so his recordings appear in my playlist of works by Chopin. I often have this playlist on in the background while working, reading, or writing. Meaning, I’m not actively paying attention, and therefore, not taking note of the recording artist or skipping tunes.
More complications
This got me thinking. A good percentage of my library is classical music like this, recorded by artists more modern than the original composers. Even my jazz library is full of standards recorded by multiple artists. I have over 5 recordings of “April in Paris” – all by different artists. Same with “A Night in Tunisia.”
Even worse, here’s a couple of excerpts of artists in my library, sorted alphabetically: “Charlie Parker, Charlie Parker & Dizzy Gillespie, Charlie Parker & Miles Davis, Charlie Parker Quintet, […], Miles Davis, Miles Davis & John Coltrane, The Miles Davis Quintet, Miles Davis Sextet…” This means simple aggregations of song play counts by artist won’t be a fair measure of how much I really listen to that artist. Imagine I listen to 15 hours of Rihanna music, but also 10 hours each to the Miles Davis Quintet and Miles Davis Sextet. Who do I listen to more, Rihanna or Miles Davis?
I started to realize the structure of music data doesn’t readily support inquiry into the collaborative or evolutionary aspects of music performance and history. For example, who was in the Miles Davis Quintet versus the Sextet? Which came first? Did someone quit the group? Did they add a new talent? Is there even any overlap in the groups besides Miles Davis?
These aren’t questions I would expect to answer searching in Apple Music or Spotify. The algorithms on those platforms are designed to keep you listening, streaming, and subscribing, not researching. Personally, I would appreciate a tool that explicitly accounts for these more complex relationships between artists and compositions, and provides wikipedia-level information for the music and artists involved. So, I’m working on that. But in the meantime, let’s redo some of the queries that give us insight into our music libraries.
Getting the data
So, apparently, Apple Music has a better way to export your library for our purposes than the XML export I used last time. Just select your playlist, or the list of all your songs, and go to “File->Library->Export Playlist” – and you can chose to export your playlist data as a plain text file, named something like Music.txt. Remember where you save it, or copy and paste the file into a folder with a fresh Jupyter notebook to follow along!
Cleaning the data
First, let’s import the necessary libraries and the Music.txt file! Let’s assume that’s your filename from here on out. Then your first couple Jupyter cells should look like this:
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from sklearn.cluster import AgglomerativeClustering
import plotly.express as px
import re
df = pd.read_csv("Music.txt", sep='\t')
df['Playtime in hours'] = df['Time'] * df['Plays']/3600
The next function we will make use of, I actually just generated with AI. I prompted it to “create a function to group a pandas DataFrame by string similarity on a given column.” Using the DataFrame’s groupby method, this is easy to do for strings that match exactly. But remember, I want Miles Davis and the Miles Davis Quintet to aggregate together. To accomplish that, we will use string similarity metrics, or that’s what the AI thought to do. The function it made is below. If you want to skip reading through it, that’s ok, but we will also explain it a little after:
def group_by_string_similarity(
df, column, threshold=0.3, agg_func='sum'
):
"""
Groups a DataFrame by string similarity in the specified column.
Parameters:
df (pd.DataFrame): The input DataFrame.
column (str): The column containing strings to group by similarity.
threshold (float): Distance threshold for clustering (0 = identical, 1 = very different).
agg_func (str or dict): Aggregation function passed to DataFrame.groupby().agg().
Returns:
pd.DataFrame: Aggregated DataFrame with similar strings grouped.
"""
# Get unique strings from the column
unique_strings = df[column].unique()
n = len(unique_strings)
# Compute pairwise distance matrix (1 - similarity)
distance_matrix = np.zeros((n, n))
for i in range(n):
for j in range(n):
if i != j:
similarity = fuzz.ratio(unique_strings[i], unique_strings[j]) / 100
distance_matrix[i, j] = 1 - similarity
# Cluster similar strings
clustering = AgglomerativeClustering(
metric='precomputed',
linkage='complete',
distance_threshold=threshold,
n_clusters=None
)
labels = clustering.fit_predict(distance_matrix)
# Create cluster representative mapping
cluster_map = {}
for label in set(labels):
cluster_items = unique_strings[labels == label]
representative = sorted(cluster_items, key=len)[0] # shortest name as representative
for item in cluster_items:
cluster_map[item] = representative
# Apply mapping to DataFrame
df = df.copy()
df[f'{column}_group'] = df[column].map(cluster_map)
# Group and aggregate
grouped = df.groupby(f'{column}_group').agg(agg_func)
return grouped
This function essentially compares every string in a column (for us, it will be the artist name column) to every other string in that column. Then, by using clustering methods, similar to what we have used in previous posts, it makes an educated guess about which strings belong “together” and labels them. Then, it uses those cluster labels along with the good-old “groupby” method to aggregate the data frame. Notice we can choose the aggregation function as a parameter to the function. We’ll want “sum” to add up all the hours listened to certain artists.
Now, let’s run this on our DataFrame:
adf2 = group_by_string_similarity(df[["Artist","Plays","Playtime in hours"]].dropna(),'Artist',threshold=.4,agg_func={'Artist':'count','Plays':'sum','Playtime in hours':'sum'}).reset_index()
adf2["Artist"] = adf2["Artist_group"]
adf2 = adf2.drop(columns=["Artist_group"])
adf2 = adf2.sort_values(by='Playtime in hours',ascending=False).reset_index().drop(columns=['index'])
adf2.head(8)
In the first line, we apply our group_by_string_similarity function described above. We are only interested in the “Artist”, “Plays”, and “Playtime in hours” columns. I played around with the threshold for string similarity clustering, and 0.4 seemed to work at grouping together Miles Davis with The Miles Davis Quintet, Charles Parker with Charlie Parker, Bill Evans with Bill Evans Trio – while not clustering Eminem with Ella Fitzgerald.
A few boilerplate cleanup lines come after, and then we sort by the “Playtime in hours” column, with ascending=False so that the adf2.head(8) method shows us the top 8 artists and not the bottom 8.
Visualizing the data
Then, just one more cell to visualize your data in Plotly!
plot_data_df = adf2.head(8)
plot_df = pd.DataFrame({
'Artist': plot_data_df[['Artist']].values.flatten().tolist() ,
'Playtime in hours': (plot_data_df[['Playtime in hours']]).values.flatten().tolist()
})
fig = px.bar(
plot_df,
x='Artist',
y='Playtime in hours',
title='Most listened to artists'
)
fig.show()
Mine looked like this!
Happy listening and happy coding!
