Sci-Fi Novel Data Visualization

I collected Wikipedia articles for nearly 900 science fiction novels, and clustered them via the TF-IDF and k-means algorithms. The dimensions are reduced for visualization via t-SNE, the colors represent the k-means clusters, and the dot size represents the prominence of an article among the other data. Read more about the methodology at this post. All methods are the same. Just substitute “novels” for “albums” and “authors” for “musicians.”

This is an improvement on a previous post, both in the crawling code, the data processing, and the visualization. The new Dask-friendly structure of the data allowed me to reuse the same scripts and notebooks as the jazz visualization to create this post. The code is available here.

Observations

Like with the original jazz data visualizations I did, I like to look for evidence in the visualization that the algorithms or data processing methods are uncovering more than surface-level structure. It’s a good omen to see how well certain authors like Philip K. Dick or Robert A. Heinlein sort out into their own mini-clusters.

But I think the best evidence there is something nontrivial here is the “Frankenstein” cluster, and by that I don’t mean the cluster that has Frankenstein by Mary Shelley. I mean the cluster that has Ursula Le Guin, Octavia Butler, N.K. Jemisin, Ann Leckie, and more of the most famous sci-fi authors and novels of all time. The clustering brought together a diverse group of novels based on their special distinctions rather than just their author or genre.

In fact, when I looked at the top keywords that the k-means algorithm used to associate the clusters for the above coloring, k-means said these were the top 5 most important “words” for grouping the magenta-colored datapoints together: award, novel, best, le & guin.

So, overall, like we predicted, there is more apparent reliance than the jazz dataset on similarities in the material rather than collaboration, but there is still an overpowering emphasis on individual authors in clustering. Perhaps including more genres in the dataset, we can get a more interesting clustering of subjects.