Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin’s piece, Dissecting Trump’s Most Rabid Online Following, looked at the toxic communities surrounding Donald Trump, notably r/The_Donald, by using a machine learning technique called latent semantic analysis. LSA uses words and concepts from two sets of documents and shows how closely they are related. Martin used this process to find the overlap between different subreddits; two different subreddits are more similar if users comment in both. He then goes further to use what he calls “subreddit algebra”. By adding or subtracting the subreddits together, other related subreddits can be revealed. For example, r/nba + r/minnesota = r/timberwolves. If you’re interested in semantic vector math, there’s a fun twitter bot that does this algebra several times per day.
As with all FiveThirtyEight’s data stories, they make their code freely available for readers to try out themselves. I thought it’d be interesting to take a peek at some subreddits that are a little closer to home (and a whole lot less racist and sexist). If you don’t want to run this yourself, feel free to skip to the results below.
If you want to follow along, you’ll need some familiarity with the Google Cloud Platform since that’s where everything will be run. Specifically, you’ll be using their BigQuery service, which is a tool for working with massive datasets. You’ll also want to set up a bucket in Google Storage. Your outputs will be quite large and they don’t allow you to export directly to your local file system. Finally, you’ll need some basic familiarity with the R language and an environment to run R scripts. RStudio is a great tool for this.
First, from your Google Cloud console, create a new project to contain the various tables you’ll be generating. Next, head over to BigQuery and create a new dataset under your project. You could call this something like ‘reddit’. This will hold your results. You’ll be querying against
fh-bigquery:reddit_comments set that is made available to you by default. Click on the Compose Query button and use this code from the fivethirtyeight GitHub repository. Change line 19 to the path of your own dataset you just created.
Take the resulting dataset that this query generates and export it to the storage bucket you created. From there, you can download it as a CSV file.
Now, in RStudio, load the vector analysis script from the repository. You’ll need to change the path to the CSV file on line 20 to your exported CSV. And, of course, change the various subreddits after line 59. Now the fun begins!
The first obvious search is for similar subreddits to r/IowaCity. What kinds of things do Iowa City folks post about? The higher the number, the more related the subreddits are.
Ok, not surprising. Surrounding cities plus beer drinking and smoking meats. Iowa City redditors are a chill bunch. What about the uiowa subreddit?
I’ll admit, there are a surprising amount of car discussion going on. Perhaps not when you see some of the cars downtown.
What happens when we take the uiowa out of Iowa City? IowaCity – uiowa =
Iowa City likes to grill out and drink. And play Poekmon Go. Let’s see what librarians are up to. From r/Libraries:
Congratulations, librarians, on your marriage and children! And your new fiber arts project. What happens when we remove the wedding planning from librarians’ reddit posts?
What other interesting algebra problems could we think up? Send me an email and I’ll try to post a few next week. After all, it’s Friday and I’m off to drink beer, grill some vegetarian food, and read sci-fi after I’m done parenting for the day. This weekend might be a good time to pick up knitting.