Introduction
Welcome to the first tutorial post on my blog! Any post that I tag “Tutorial” should suit a wider audience, and I hold to a higher standard for reproducibility. I want to eliminate the problem of developers searching for answers in documentation, but things don’t quite line up with how their system is set up or how their code is written. In this post, we’ll be predicting jazz genres with XGBoost classifiers.
I’ll be working with data from previous posts in Pandas data frames, classifier models, TF-IDF vectorization, and label encoders. To follow along with the code, you should have Python and Jupyter installed. I posted the code I will be referencing here. We will be training machine learning models to extract the genre of a jazz album based on the text of a Wikipedia article.
Noticing that Wikipedia lists the genre of a music album in a prominent place on the page (see example), I was able to retrieve the genre labels for thousands of albums. But, there was an issue. Among the thousands of albums, over 80% had a genre simply labeled as “jazz.” Less than 800 had more specific labels like “big band” or “vocal jazz.” That is not a large enough sample for understanding overall what percentage of albums in my dataset have vocals.
But there’s good news. For both pages labeled “jazz” and pages labeled with more specific genre like “blues” or “modal jazz,” I had a lot more underlying data. Namely, I have the text of the Wikipedia article for the album. Using machine learning, we can try to extract the data we want from the data we have. I’ll spoil the results of this process now, then explain the visualization and how we got there:
Machine Learning
Let’s use variables to talk about this data. We might think that genre is a function of the text of an article. For example, if the word “sing” occurs frequently in an article, labeling it “vocal jazz” could make sense. So, if x is the article text and y is the genre, we hypothesize a function f where f(x) = y. We don’t know what the function f is, but we do have lots of datapoints where we do know (x,y). For the 800-or-so albums with the specific genre labels, we can train a computer to try to guess f. In other words, use the article text to predict the genre for the thousands without specific labels.
There are lots of different methods for teaching a computer to guess the function f. Typically, you make an assumption about the function that narrows the scope of what it could be. Then, under that assumption, you iterate guesses to find the best guess possible. For some problems, you can assume linear functions. Often, large language models assume their linguistic output function has the structure of a neural network.
Decision Trees
XGBoosting, or extreme gradient boosting, is method for guessing f based on a type of model called a decision tree. Without going into too much detail, decision trees are basically like any decision diagram a human would use to make a choice or follow a procedure. But the computer figures out the branches of the tree – the decisions it needs to make – by studying which datapoints give it the most possible information at a given time.
For example: I am expecting an important letter. Should I check the mail? The answer depends on the time of day and the day of the week, but which is more important? If it’s after 5pm, it’s probably fair to say go and check the mail. But that would be silly to do on a Sunday. We should probably make sure it’s not Sunday first. But on any other day of the week, we would want to check the time. XGBoosting figures all this out automatically using information theory.
For our XGBoost model, we want to input the text of the Wikipedia articles, and output the genres. That brings us to the code.
The Code
I won’t go through everything, it’s posted on GitHub anyway. But I want to paint enough of a picture that it makes sense what the code is doing. In the code, I have two data frames created from .csv files: genre_df and jazz_df:
genre_df = pd.read_csv('subgenre.csv')
jazz_df = pd.read_csv('jazz.csv')
These stand for the albums that have specific genre labels, and those just labeled “jazz,” respectively.
In order to reduce the complexity of data and translate human words into number values for the computer, I used TF-IDF vectorization again:
vectorizer = TfidfVectorizer(max_features=1500,stop_words='english',min_df =.05,max_df=.95)
vectors = vectorizer.fit_transform(genre_df['text'])
Now, “vectors” is a grid of numbers where each column represents one of 1500 important words in the data, and each row represents the text one of the nearly 800 labeled articles. The number at position (i,j) in the grid describes the importance of the j-th word to the i-th article.
After that, we also need to encode the specific genre classifications as numeric values with a label encoder (see the code for necessary imports):
le = LabelEncoder()
y_encoded = le.fit_transform(genre_df['genre'])
Notice that in both these pieces of code, we called a method named “fit_transform.” Think of “fit” as like drawing a map with a path on it and “transform” as following the path. In the future, we’re going to have to translate other data back and forth with this same map, so we will only want to use the “transform” function. If we refit the encoders, we will lose the ability to translate our text data into numbers we can compare to our model.
The next step is to actually train the XGBoost model:
model = xgb.XGBClassifier(random_state=42)
model.fit(vectors,y_encoded)
But you will see if you look at the code on GitHub, there’s actually something I did in between:
xgb_cv = xgb.cv(dtrain=data_dmatrix,params=params, nfold=5,metrics='merror')
The cv above stands for cross-validation. Cross-validation is a common technique in data science that helps answer an important question. Does my model generalize well, or is it biased towards the training data? You don’t want to be in a situation where you think your guesses are 90% accurate, but in practice they are only 4% accurate. Think of a student who finds the answers to an exam and memorizes them. Yes, they will get 100% on the test, but what about when they get a question they haven’t seen before? Cross-validation helps us understand when the computer has “cheated” in this way.
We have 24 specific genres of jazz we are trying to label. So 4% would be about as good as a random guess strategy. When I ran the cross-validation, I saw that the XGBoost was making accurate predictions about 80% of the time, with a deviation less than random guessing. That’s really good evidence this model generalizes to our larger dataset, or can pass a test without having the answers ahead of time!
The last thing to do is predict the genres for the jazz data frame:
jazz_vectors = vectorizer.transform(jazz_df['text'])
y_pred = model.predict(jazz_vectors)
jazz_df['predicted_genre'] = le.inverse_transform(y_pred)
Notice again how we use “transform” and “inverse_transform” instead of “fit_transform” this time.
Visualizations
So, here’s two pie charts side-by-side of the data. First, we have only the data from the genre_df, meaning only counting pages that were clearly labeled as a specific type of jazz on Wikipedia. Following that, we have our predicted totals for all the genres in the dataset, including data that was labeled clearly and genres we had to deduce.
The same colors are used for the same genres in each chart. So it is easier to see side-by-side that the distribution changes. But not too dramatically. For example, there was more vocal jazz and less fusion that originally labeled. But these are still both in the top 4 categories. Also, to validate the data, I checked 50 of the top albums the model labeled “vocal jazz.” About 40 ended up being Ella Fitzgerald albums, suggesting over 80% accuracy again yet.
Conclusion
Overall, both 13.1% and 16.2% were higher than I was expecting for vocal jazz in this dataset. And although our predictions come with margins of error and uncertainty, intuitively hard bop being more popular than free jazz makes sense. Free jazz and avant-garde jazz are not as popular as bop, according to free jazz musicians I talked to recently. But most importantly, we got good practice using ML methods on a small dataset classification problem.
One response to “Predicting Jazz Genres with XGBoost Classifiers”
[…] case you didn’t read my tutorial on predicting jazz genres extreme gradient boosting, check it out now! In short, I used Wikipedia articles on jazz albums to predict the genre of the albums. The method […]