Latest Headlines
0

Neural Network Poetry

An Iowa writer hard at work

As you may know, April is national poetry month, an annual series of events by the Academy of American Poets to help support the appreciation of American poetry. If you’re looking for great book-length collections of poems, you might be interested in the Iowa Poetry Prize winners. Many of the previous years’ winners are made available in PDF form at Iowa Research Online. What you may not know is that April is also National Poetry Generation Month, an annual tradition where programmers and creative coders spend the month writing code that generates poetry.

In honor of this time of year, I thought I’d take a look at the Iowa Poetry Prize winners through code. There are many methods for analyzing and generating natural language, but one system that has received a lot of attention recently is neural networks. A neural network is a large collection of artificial neurons based very loosely on a biological brain. These neurons exist in layers that perform statistical calculations and affect the state of other connected neurons. It differs from other computational models in that there is no knowledge hard coded and controlled by elaborate conditional statements (if this then that). Rather, neural networks learn to solve tasks by observing data and producing optimal functions that will produce similar outputs given new data it’s never seen before. The uses for such a system include image and speech recognition, classification problems, and many forms of prediction and decision making. For example, a neural net could be trained to detect images of cats by observing tens of thousands of labeled images of cats. Google has recently launched a new project that uses this technique to match your doodles with professional drawings.

What happens when we train an artificial intelligence to write english language having only read Iowa Poetry Prize winners? Let’s find out!

To start, I downloaded all of the IPP winners from Iowa Research Online, extracted the poems as plain text, and concatenated them all into a single text file named poems.txt. This served as the training set. Next, I set up this Torch-based Docker container implementation of a recurrent neural network based on work by Andrej Karpathy and Justin Johnson. It was tempting to spin up the Google cloud VM with an attached GPU, since these types of machine learning tasks are sped up greatly running on a graphics processing unit with CUDA, but it’s also quite expensive at 75 cents-per-hour. Once I had it working, I started the preprocessing and training, which took about 16 hours to complete.

After a lot of experimentation to create some useful training models and keep the network from overfitting and underfitting the data, I had something that was acceptable and so began sampling output. One parameter of sampling that was fun to play with was the “temperature” of the sample. A lower temperature produced output that was much more predictable and less error prone while a higher temperature was much more inventive but riddled with mistakes. I decided to split the difference and start at 0.5. Here’s the first poem.

Speritas Of The Stars

Morning comes of the sun
to the thin world is a star of her light.

The sheet and the body of parts
of the flame is a light, the body
sees of the wars beautiful on the street.
The sun, the stars of the sound, and desire,
and a man could love the streets.

The single shiller of light,
and the single stranger falls countal.
Father and she were the sutters of the body
instraining to the complete
window of light, still.

You’ll notice a few words in this poem that don’t actually exist in english. That’s because this RNN operates at the character level, not the word level. It has to learn, from scratch, how to write english. It starts with random strings of letters and slowly, after many iterations, learns about spaces, proper punctuation, and finally readable words. The higher the sampling temperature, the more invented words. Let’s look at a “hot” poem.

Pelies, One Yighter

The shadows just plance croved
I am one
its funlet from the wind
staskaccus, gring of detches of hearts face eashog
what wing to the streed in the resert of change, a glince
the life.
She read on his fill bathered, a hand the
marks
with beautiful, casty, stery, kooms, in one father

something the mouth cold leaves.
A night and no one is a woman; you green her

My spere would must not the look teering mower
I see itselfor.
At that sign they thought the remelled the mum,
but like an wait they mite of ammiral
after things of the body
which children would love
now, not
the forest flowers and hark a path.
The shawr rate in a ruched parts in humstily
his poom her as of the trabs conterlity.

Much more Jabberwockyesque. If we ease up just a little on this we get

A Badicar Flower

The watcher blue says
they would have shapes,
the night dreaming,
a painted nother
tricks me, the wind,
the dayed from the boging feeling
of the histance in his everyness.

What do you think — poetry prize worthy? While writing poetry is fun, there are, of course, practical applications too. I’m currently working with faculty member Mariola Espinosa on a HathiTrust project called Fighting Fever in the Caribbean: Medicine and Empire, 1650-1902. We have 9.3 million pages of medical journals and need to find references to yellow fever in multiple languages. A trained neural network could look through these quickly and find references that a human might miss. I’m also working on another project with Heidi Renee Aijala looking for references to coal smoke in Victorian literature. Perhaps a neural net could be trained to look for non-keyword references.

While I’m probably not going to put a poet out of work any time soon, you can imagine many real-world uses. There is a tremendous potential for neural networks and other types of machine learning to caption images, transcribe handwriting, translate documents, understand the spoken word, and play chess at the international master level. Perhaps someday it might also write a meaningful poem.

0

Subreddit Algebra

photo credit: Laura Crossett

Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin’s piece, Dissecting Trump’s Most Rabid Online Following, looked at the toxic communities surrounding Donald Trump, notably r/The_Donald, by using a machine learning technique called latent semantic analysis. LSA uses words and concepts from two sets of documents and shows how closely they are related. Martin used this process to find the overlap between different subreddits; two different subreddits are more similar if users comment in both. He then goes further to use what he calls “subreddit algebra”. By adding or subtracting the subreddits together, other related subreddits can be revealed. For example, r/nba + r/minnesota = r/timberwolves. If you’re interested in semantic vector math, there’s a fun twitter bot that does this algebra several times per day.

As with all FiveThirtyEight’s data stories, they make their code freely available for readers to try out themselves. I thought it’d be interesting to take a peek at some subreddits that are a little closer to home (and a whole lot less racist and sexist). If you don’t want to run this yourself, feel free to skip to the results below.

The Setup

If you want to follow along, you’ll need some familiarity with the Google Cloud Platform since that’s where everything will be run. Specifically, you’ll be using their BigQuery service, which is a tool for working with massive datasets. You’ll also want to set up a bucket in Google Storage. Your outputs will be quite large and they don’t allow you to export directly to your local file system. Finally, you’ll need some basic familiarity with the R language and an environment to run R scripts. RStudio is a great tool for this.

First, from your Google Cloud console, create a new project to contain the various tables you’ll be generating. Next, head over to BigQuery and create a new dataset under your project. You could call this something like ‘reddit’. This will hold your results. You’ll be querying against fh-bigquery:reddit_comments set that is made available to you by default. Click on the Compose Query button and use this code from the fivethirtyeight GitHub repository. Change line 19 to the path of your own dataset you just created.

Take the resulting dataset that this query generates and export it to the storage bucket you created. From there, you can download it as a CSV file.

Now, in RStudio, load the vector analysis script from the repository. You’ll need to change the path to the CSV file on line 20 to your exported CSV. And, of course, change the various subreddits after line 59. Now the fun begins!

The Results

The first obvious search is for similar subreddits to r/IowaCity. What kinds of things do Iowa City folks post about? The higher the number, the more related the subreddits are.

Cedarrapids 0.4627451
Madisonwi 0.4278260
Uiowa 0.4216467
Milwaukee 0.4069844
Homebrewing 0.3992629
Beer 0.3941419
Chicago 0.3916151
Indianapolis 0.3868063
Iowa 0.3850677
Smoking 0.3823774

Ok, not surprising. Surrounding cities plus beer drinking and smoking meats. Iowa City redditors are a chill bunch. What about the uiowa subreddit?

IowaCity 0.4216467
Mazdaspeed6 0.2913548
Swimming 0.2766708
Projectcar 0.2719264
Madisonwi 0.2699070
Cartalk 0.2696891
College 0.2646985
Cars 0.2642775
Civilengineering 0.2637309
Milwaukee 0.2634588

I’ll admit, there are a surprising amount of car discussion going on. Perhaps not when you see some of the cars downtown.

What happens when we take the uiowa out of Iowa City? IowaCity – uiowa =

PoGoIC 0.2447359
Smoking 0.2135908
Homebrewing 0.2053004
BBQ 0.2028280
Grilling 0.1997918
Sousvide 0.1983743
Wine 0.1961068
Cedarrapids 0.1937385
Bourbon 0.1917187
Spicy 0.1895046

Iowa City likes to grill out and drink. And play Poekmon Go. Let’s see what librarians are up to. From r/Libraries:

Librarians 0.6681721
Teachers 0.6463503
Knitting 0.6231567
Parenting 0.6165957
Weddingplanning 0.6118699
Genealogy 0.6118073
Wedding 0.6039990
Femalefashionadvice 0.6024974
Crochet 0.6010991
Vegetarian 0.5975182

Congratulations, librarians, on your marriage and children! And your new fiber arts project. What happens when we remove the wedding planning from librarians’ reddit posts?

Corruption 0.3048685
HistoryofIdeas 0.2961678
CornbreadLiberals 0.2932469
TrueProgressive 0.2924358
Scifi 0.2919506
Media 0.2833257
WarOnComcast 0.2797392
TechNewsToday 0.2789546
InCaseYouMissedIt 0.2789388
Obama 0.2774487

What other interesting algebra problems could we think up? Send me an email and I’ll try to post a few next week. After all, it’s Friday and I’m off to drink beer, grill some vegetarian food, and read sci-fi after I’m done parenting for the day. This weekend might be a good time to pick up knitting.