{"id":4993,"date":"2017-03-24T16:13:45","date_gmt":"2017-03-24T21:13:45","guid":{"rendered":"http:\/\/blog.lib.uiowa.edu\/studio\/?p=4993"},"modified":"2018-10-31T13:50:23","modified_gmt":"2018-10-31T18:50:23","slug":"subreddit-algebra","status":"publish","type":"post","link":"https:\/\/blog.lib.uiowa.edu\/studio\/2017\/03\/24\/subreddit-algebra\/","title":{"rendered":"Subreddit Algebra"},"content":{"rendered":"<figure id=\"attachment_4995\" aria-describedby=\"caption-attachment-4995\" style=\"width: 300px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/blog.lib.uiowa.edu\/studio\/files\/2017\/03\/112671034_99ae285c24_z.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-4995\" src=\"https:\/\/blog.lib.uiowa.edu\/studio\/files\/2017\/03\/112671034_99ae285c24_z-300x225.jpg\" alt=\"\" width=\"300\" height=\"225\" srcset=\"https:\/\/blog.lib.uiowa.edu\/studio\/files\/2017\/03\/112671034_99ae285c24_z-300x225.jpg 300w, https:\/\/blog.lib.uiowa.edu\/studio\/files\/2017\/03\/112671034_99ae285c24_z.jpg 640w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-4995\" class=\"wp-caption-text\">photo credit: Laura Crossett<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400\">Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin\u2019s piece, <\/span><a href=\"https:\/\/fivethirtyeight.com\/features\/dissecting-trumps-most-rabid-online-following\/\"><i><span style=\"font-weight: 400\">Dissecting Trump\u2019s Most Rabid Online Following<\/span><\/i><\/a><span style=\"font-weight: 400\">, looked at the toxic communities surrounding Donald Trump, notably r\/The_Donald, by using a machine learning technique called <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Latent_semantic_analysis\"><span style=\"font-weight: 400\">latent semantic analysis<\/span><\/a><span style=\"font-weight: 400\">. LSA uses\u00a0words and concepts from two sets of documents and shows how closely they are related. Martin used this process to find the overlap between different subreddits; two different subreddits are more similar if users comment in both. He then goes further to use what he calls \u201csubreddit algebra\u201d. By adding or subtracting the subreddits together, other related subreddits can be revealed. For example, r\/nba + r\/minnesota = r\/timberwolves. If you\u2019re interested in semantic vector math, there\u2019s a <\/span><a href=\"https:\/\/twitter.com\/wordofmath\"><span style=\"font-weight: 400\">fun twitter bot<\/span><\/a><span style=\"font-weight: 400\"> that does this algebra several times per day.<\/span><\/p>\n<p><span style=\"font-weight: 400\">As with all FiveThirtyEight\u2019s data stories, they make their code freely available for readers to try out themselves. I thought it\u2019d be interesting to take a peek at some subreddits that are a little closer to home (and a whole lot less racist and sexist). If you don\u2019t want to run this yourself, feel free to skip to the results below.<\/span><\/p>\n<h2><b>The Setup<\/b><\/h2>\n<p><span style=\"font-weight: 400\">If you want to follow along, you\u2019ll need some familiarity with the <\/span><a href=\"https:\/\/cloud.google.com\/\"><span style=\"font-weight: 400\">Google Cloud Platform<\/span><\/a><span style=\"font-weight: 400\"> since that\u2019s where everything will be run. Specifically, you\u2019ll be using their <a href=\"https:\/\/cloud.google.com\/bigquery\/\">BigQuery service<\/a>, which is a tool for working with massive datasets. You\u2019ll also want to set up a bucket in <a href=\"https:\/\/cloud.google.com\/storage\">Google Storage<\/a>. Your outputs will be quite large and they don\u2019t allow you to export directly to your local file system. Finally, you\u2019ll need some basic familiarity with the R language and an environment to run R scripts. <\/span><a href=\"https:\/\/www.rstudio.com\/\"><span style=\"font-weight: 400\">RStudio<\/span><\/a><span style=\"font-weight: 400\"> is a great tool for this.<\/span><\/p>\n<p><span style=\"font-weight: 400\">First, from your Google Cloud console, create a new project to contain the various tables you\u2019ll be generating. Next, head over to <\/span><span style=\"font-weight: 400\">BigQuery<\/span><span style=\"font-weight: 400\"> and create a new dataset under your project. You could call this something like \u2018reddit\u2019. This will hold your results. You\u2019ll be querying against <\/span><code><span style=\"font-weight: 400\">fh-bigquery:reddit_comments <\/span><\/code><span style=\"font-weight: 400\">set that is made available to you by default. Click on the Compose Query button and use <\/span><a href=\"https:\/\/github.com\/fivethirtyeight\/data\/blob\/master\/subreddit-algebra\/processData.sql\"><span style=\"font-weight: 400\">this code from the fivethirtyeight GitHub repository<\/span><\/a><span style=\"font-weight: 400\">. Change line 19 to the path of your own dataset you just created.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Take the resulting dataset that this query generates and export it to the storage bucket you created. From there, you can download it as a CSV file.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Now, in RStudio, load the <\/span><a href=\"https:\/\/github.com\/fivethirtyeight\/data\/blob\/master\/subreddit-algebra\/subredditVectorAnalysis.r\"><span style=\"font-weight: 400\">vector analysis script from the repository.<\/span><\/a><span style=\"font-weight: 400\"> You\u2019ll need to change the path to the CSV file on line 20 to your exported CSV. And, of course, change the various subreddits after line 59. Now the fun begins!<\/span><\/p>\n<h2><b>The Results<\/b><\/h2>\n<p><span style=\"font-weight: 400\">The first obvious search is for similar subreddits to <a href=\"https:\/\/www.reddit.com\/r\/IowaCity\/\">r\/IowaCity<\/a>. What kinds of things do Iowa City folks post about? The higher the number, the more related the subreddits are.<\/span><\/p>\n<p><strong>Cedarrapids<\/strong> <span style=\"font-weight: 400\">0.4627451<br \/>\n<\/span><strong>Madisonwi<\/strong> <span style=\"font-weight: 400\">0.4278260<br \/>\n<\/span><strong>Uiowa<\/strong> <span style=\"font-weight: 400\">0.4216467<br \/>\n<\/span><strong>Milwaukee<\/strong> <span style=\"font-weight: 400\">0.4069844<br \/>\n<\/span><strong>Homebrewing<\/strong> <span style=\"font-weight: 400\">0.3992629<br \/>\n<\/span><strong>Beer<\/strong> <span style=\"font-weight: 400\">0.3941419<br \/>\n<\/span><strong>Chicago<\/strong> <span style=\"font-weight: 400\">0.3916151<br \/>\n<\/span><strong>Indianapolis<\/strong> <span style=\"font-weight: 400\">0.3868063<br \/>\n<\/span><strong>Iowa<\/strong> <span style=\"font-weight: 400\">0.3850677<br \/>\n<\/span><strong>Smoking<\/strong> <span style=\"font-weight: 400\">0.3823774<\/span><\/p>\n<p><span style=\"font-weight: 400\">Ok, not surprising. Surrounding cities plus beer drinking and smoking meats. Iowa City redditors are a chill bunch. What about the uiowa subreddit?<\/span><\/p>\n<p><strong>IowaCity<\/strong> <span style=\"font-weight: 400\">0.4216467<br \/>\n<\/span><strong>Mazdaspeed6<\/strong> <span style=\"font-weight: 400\">0.2913548<br \/>\n<\/span><strong>Swimming<\/strong> <span style=\"font-weight: 400\">0.2766708<br \/>\n<\/span><strong>Projectcar<\/strong> <span style=\"font-weight: 400\">0.2719264<br \/>\n<\/span><strong>Madisonwi<\/strong> 0.2699070<br \/>\n<strong>Cartalk<\/strong> <span style=\"font-weight: 400\">0.2696891<br \/>\n<\/span><strong>College<\/strong> <span style=\"font-weight: 400\">0.2646985<br \/>\n<\/span><strong>Cars<\/strong> <span style=\"font-weight: 400\">0.2642775<br \/>\n<\/span><strong>Civilengineering<\/strong> <span style=\"font-weight: 400\">0.2637309<br \/>\n<\/span><strong>Milwaukee<\/strong> <span style=\"font-weight: 400\">0.2634588<\/span><\/p>\n<p><span style=\"font-weight: 400\">I\u2019ll admit, there are a surprising amount of car discussion going on. Perhaps not when you see some of the cars downtown.<\/span><\/p>\n<p><span style=\"font-weight: 400\">What happens when we take the uiowa out of Iowa City? IowaCity &#8211; uiowa =<\/span><\/p>\n<p><strong>PoGoIC<\/strong> <span style=\"font-weight: 400\">0.2447359<br \/>\n<\/span><strong>Smoking<\/strong> <span style=\"font-weight: 400\">0.2135908<br \/>\n<\/span><strong>Homebrewing<\/strong> <span style=\"font-weight: 400\">0.2053004<br \/>\n<\/span><strong>BBQ<\/strong> <span style=\"font-weight: 400\">0.2028280<br \/>\n<\/span><strong>Grilling<\/strong> <span style=\"font-weight: 400\">0.1997918<br \/>\n<\/span><strong>Sousvide<\/strong> <span style=\"font-weight: 400\">0.1983743<br \/>\n<\/span><strong>Wine<\/strong> <span style=\"font-weight: 400\">0.1961068<br \/>\n<\/span><strong>Cedarrapids<\/strong> <span style=\"font-weight: 400\">0.1937385<br \/>\n<\/span><strong>Bourbon<\/strong> <span style=\"font-weight: 400\">0.1917187<br \/>\n<\/span><strong>Spicy<\/strong> <span style=\"font-weight: 400\">0.1895046<\/span><\/p>\n<p><span style=\"font-weight: 400\">Iowa City likes to grill out and drink. And play Poekmon Go. Let\u2019s see what librarians are up to. From <a href=\"https:\/\/www.reddit.com\/r\/Libraries\/\">r\/Libraries<\/a>:<\/span><\/p>\n<p><strong>Librarians<\/strong> <span style=\"font-weight: 400\">0.6681721<br \/>\n<\/span><strong>Teachers<\/strong> <span style=\"font-weight: 400\">0.6463503<br \/>\n<\/span><strong>Knitting<\/strong> <span style=\"font-weight: 400\">0.6231567<br \/>\n<\/span><strong>Parenting<\/strong> 0.6165957<br \/>\n<strong>Weddingplanning<\/strong> <span style=\"font-weight: 400\">0.6118699<br \/>\n<\/span><strong>Genealogy<\/strong> 0.6118073<br \/>\n<strong>Wedding<\/strong> 0.6039990<br \/>\n<strong>Femalefashionadvice<\/strong> <span style=\"font-weight: 400\">0.6024974<br \/>\n<\/span><strong>Crochet<\/strong> <span style=\"font-weight: 400\">0.6010991<br \/>\n<\/span><strong>Vegetarian<\/strong> <span style=\"font-weight: 400\">0.5975182<\/span><\/p>\n<p><span style=\"font-weight: 400\">Congratulations, librarians, on your marriage and children! And your new fiber arts project. What happens when we remove\u00a0the wedding planning from librarians\u2019 reddit posts?<\/span><\/p>\n<p><strong>Corruption<\/strong> <span style=\"font-weight: 400\">0.3048685<br \/>\n<\/span><strong>HistoryofIdeas<\/strong> 0.2961678<br \/>\n<strong>CornbreadLiberals<\/strong> <span style=\"font-weight: 400\">0.2932469<br \/>\n<\/span><strong>TrueProgressive<\/strong> <span style=\"font-weight: 400\">0.2924358<br \/>\n<\/span><strong>Scifi<\/strong> <span style=\"font-weight: 400\">0.2919506<br \/>\n<\/span><strong>Media<\/strong> <span style=\"font-weight: 400\">0.2833257<br \/>\n<\/span><strong>WarOnComcast<\/strong> 0.2797392<br \/>\n<strong>TechNewsToday<\/strong> <span style=\"font-weight: 400\">0.2789546<br \/>\n<\/span><strong>InCaseYouMissedIt<\/strong> <span style=\"font-weight: 400\">0.2789388<br \/>\n<\/span><strong>Obama<\/strong> <span style=\"font-weight: 400\">0.2774487<\/span><\/p>\n<p><span style=\"font-weight: 400\">What other interesting algebra problems\u00a0could we think up? <a href=\"mailto:matthew-butler@uiowa.edu\">Send me an email<\/a> and I\u2019ll try to post a few next week. After all, it\u2019s Friday and I\u2019m off to drink beer, grill some vegetarian food, and read sci-fi after I\u2019m done parenting for the day. This weekend might be a good time to pick up knitting.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin\u2019s piece, Dissecting Trump\u2019s Most Rabid Online Following, looked at the toxic communities surrounding Donald Trump, notably r\/The_Donald, by using a machine learning technique called latent semantic analysis. LSA uses\u00a0words and concepts from two sets of documents<a class=\"more-link\" href=\"https:\/\/blog.lib.uiowa.edu\/studio\/2017\/03\/24\/subreddit-algebra\/\">Continue reading <span class=\"screen-reader-text\">&#8220;Subreddit Algebra&#8221;<\/span><\/a><\/p>\n","protected":false},"author":153,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"syndication":[],"_links":{"self":[{"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/posts\/4993"}],"collection":[{"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/users\/153"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/comments?post=4993"}],"version-history":[{"count":11,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/posts\/4993\/revisions"}],"predecessor-version":[{"id":5007,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/posts\/4993\/revisions\/5007"}],"wp:attachment":[{"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/media?parent=4993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/categories?post=4993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/tags?post=4993"},{"taxonomy":"syndication","embeddable":true,"href":"https:\/\/blog.lib.uiowa.edu\/studio\/wp-json\/wp\/v2\/syndication?post=4993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}