Skip to content
Skip to main content

As You Wish: An Honest Summary of My Summer Work So Far. . .

I have been tasked with writing an engaging and honest blog about my work as one of the Digital Scholarship and Publishing Studio fellows but I have a problem. I rarely use the adjective “engaging” to describe my writing. I have, I think, managed to find a creative solution to this particular conundrum. Using something I love, the 1989 adventure movie The Princess Bride, I will review the three biggest challenges I have faced during my summer work at the Studio. I hope you enjoy!

**A Note About Links: When I mention a tool or program I use, I linked a resource from a digital humanities scholar who provides an introduction or overview of the same tool! Happy clicking!**

Method to My Madness: Which Way’s My Way?

What could I do with my data?

My dissertation project examines American Indian civil and legal protest at earthworks and burial sites in the Midwest at the turn of the twentieth century. Pretty cool, right? Though I do not have many digital sources,  I have a plethora of textual sources from professional and amateur archaeologists who observed a handful of American Indian people who camped near earthworks and burial sites. Using Google forms and Excel, I translated correspondence from the papers of Ellison Orr, an amateur archaeologist active in Iowa from 1916 to 1951, and Charles Reuben Keyes, the first director of the Iowa Archaeological Survey (1921-1951).  I parsed these letters for geographic data in the letter (To and From address) and used categories I created to organize the correspondents. But, what to do with this data? What questions could I ask it? Though I am not a digital methods skeptic, I am not an expert on digital methods and there are moments when I feel genuinely overwhelmed by the possibilities.Social network analysis, textual analysis, and mapping are just a few of the many methods that digital humanities scholars use with data sets like mine! Much like Fezzik, I found myself wondering, “Which way’s my way?”

Collaboration saved my project and my sanity. Nikki White and Rob Shepard, my Studio points of contact, reviewed my data and made some recommendations on the kinds of visualizations I could create. Their expertise has been invaluable. Unfortunately, the data I have would not make an analytically valuable social network visualization. But, I did have interesting geographic data and perhaps with some creativity and effort I could create a dynamic and interactive map!  

Data Cleaning: Am I Going Mad?

I asked myself this soooo many times while cleaning my data.

Armed with a mapping method, I moved to cleaning my data. I used Excel pivot tables to move through my columns and  I opened and closed OpenRefine to clean my data. Even though I did not crowd source my data, it was still quite messy. Some messiness was human error–capitalization, spelling, spacing and the like.  I was driven to the brink of insanity when forced to grapple with inconsistencies that have nothing to do with spelling or capitalization. Rather, I had to make and document choices about the data in my spreadsheets because the questions changed between when I started gathering the data (last year) and now. Column after column and row after row, I made decisions about each cell. My method of cleaning data is not new or revolutionary. Digital humanities scholars have been publishing excellent scholarship on working with and cleaning data for decades. Three excellent pieces from humanities scholars on the challenges of working with data are:

**If you are a digital novice and the term data makes you feel queasy, I would highly recommend Daniel Rosenberg’s Data before the Fact**

In Lieu of a Conclusion: Where I am Going Next

Click this image to view the clip from the movie on YouTube

Armed with clean data, I am very excited to work on three visualizations that I hope will help me answer the research question at the heart of my dissertation. For the sake of brevity, I will only describe the first visualization I hope to create. Primary source research has revealed the activism of Emma Big Bear (1869-1968). Emma was a Ho-Chunk woman who resumed residency near the mounds in northeastern Iowa from 1917 to her death in 1968. 

Emma Big Bear returned to northeastern Iowa from the Winnebago Reservation in Nebraska around 1917. She and her husband Henry Holt lived in a wickiup, spoke only Ho-Chunk, and lived near the earthworks in northeastern Iowa for almost five decades. Simultaneous to Emma’s site occupation, amateur archaeologists opened earthworks and hunted for artifacts. I hope to visualize Emma’s campsites alongside a map of the sites that artifact collectors frequently targeted. Artifact collectors described the sites they excavated when they wrote to Ellison Orr (mentioned above). Did Emma and Henry ever cross paths with the most active amateur archaeologists in the region? Did she and Henry ever observe the excavations of amateur archaeologists like Ellison Orr, Dale Henning, or Paul Rowe? Using oral histories and amateur archaeologist correspondence, I hope to create an interactive public exhibit using ArcGIS Online based on my data.

Wish me luck!

About Me: My name is Mary Wise and I am a PhD candidate in the history department. My dissertation examines the history of American Indian activism at earthworks and effigy mounds  in the Midwest from 1890 to 1950. 

Neural Network Poetry

An Iowa writer hard at work

As you may know, April is national poetry month, an annual series of events by the Academy of American Poets to help support the appreciation of American poetry. If you’re looking for great book-length collections of poems, you might be interested in the Iowa Poetry Prize winners. Many of the previous years’ winners are made available in PDF form at Iowa Research Online. What you may not know is that April is also National Poetry Generation Month, an annual tradition where programmers and creative coders spend the month writing code that generates poetry.

In honor of this time of year, I thought I’d take a look at the Iowa Poetry Prize winners through code. There are many methods for analyzing and generating natural language, but one system that has received a lot of attention recently is neural networks. A neural network is a large collection of artificial neurons based very loosely on a biological brain. These neurons exist in layers that perform statistical calculations and affect the state of other connected neurons. It differs from other computational models in that there is no knowledge hard coded and controlled by elaborate conditional statements (if this then that). Rather, neural networks learn to solve tasks by observing data and producing optimal functions that will produce similar outputs given new data it’s never seen before. The uses for such a system include image and speech recognition, classification problems, and many forms of prediction and decision making. For example, a neural net could be trained to detect images of cats by observing tens of thousands of labeled images of cats. Google has recently launched a new project that uses this technique to match your doodles with professional drawings.

What happens when we train an artificial intelligence to write english language having only read Iowa Poetry Prize winners? Let’s find out!

To start, I downloaded all of the IPP winners from Iowa Research Online, extracted the poems as plain text, and concatenated them all into a single text file named poems.txt. This served as the training set. Next, I set up this Torch-based Docker container implementation of a recurrent neural network based on work by Andrej Karpathy and Justin Johnson. It was tempting to spin up the Google cloud VM with an attached GPU, since these types of machine learning tasks are sped up greatly running on a graphics processing unit with CUDA, but it’s also quite expensive at 75 cents-per-hour. Once I had it working, I started the preprocessing and training, which took about 16 hours to complete.

After a lot of experimentation to create some useful training models and keep the network from overfitting and underfitting the data, I had something that was acceptable and so began sampling output. One parameter of sampling that was fun to play with was the “temperature” of the sample. A lower temperature produced output that was much more predictable and less error prone while a higher temperature was much more inventive but riddled with mistakes. I decided to split the difference and start at 0.5. Here’s the first poem.

Speritas Of The Stars

Morning comes of the sun
to the thin world is a star of her light.

The sheet and the body of parts
of the flame is a light, the body
sees of the wars beautiful on the street.
The sun, the stars of the sound, and desire,
and a man could love the streets.

The single shiller of light,
and the single stranger falls countal.
Father and she were the sutters of the body
instraining to the complete
window of light, still.

You’ll notice a few words in this poem that don’t actually exist in english. That’s because this RNN operates at the character level, not the word level. It has to learn, from scratch, how to write english. It starts with random strings of letters and slowly, after many iterations, learns about spaces, proper punctuation, and finally readable words. The higher the sampling temperature, the more invented words. Let’s look at a “hot” poem.

Pelies, One Yighter

The shadows just plance croved
I am one
its funlet from the wind
staskaccus, gring of detches of hearts face eashog
what wing to the streed in the resert of change, a glince
the life.
She read on his fill bathered, a hand the
marks
with beautiful, casty, stery, kooms, in one father

something the mouth cold leaves.
A night and no one is a woman; you green her

My spere would must not the look teering mower
I see itselfor.
At that sign they thought the remelled the mum,
but like an wait they mite of ammiral
after things of the body
which children would love
now, not
the forest flowers and hark a path.
The shawr rate in a ruched parts in humstily
his poom her as of the trabs conterlity.

Much more Jabberwockyesque. If we ease up just a little on this we get

A Badicar Flower

The watcher blue says
they would have shapes,
the night dreaming,
a painted nother
tricks me, the wind,
the dayed from the boging feeling
of the histance in his everyness.

What do you think — poetry prize worthy? While writing poetry is fun, there are, of course, practical applications too. I’m currently working with faculty member Mariola Espinosa on a HathiTrust project called Fighting Fever in the Caribbean: Medicine and Empire, 1650-1902. We have 9.3 million pages of medical journals and need to find references to yellow fever in multiple languages. A trained neural network could look through these quickly and find references that a human might miss. I’m also working on another project with Heidi Renee Aijala looking for references to coal smoke in Victorian literature. Perhaps a neural net could be trained to look for non-keyword references.

While I’m probably not going to put a poet out of work any time soon, you can imagine many real-world uses. There is a tremendous potential for neural networks and other types of machine learning to caption images, transcribe handwriting, translate documents, understand the spoken word, and play chess at the international master level. Perhaps someday it might also write a meaningful poem.

Subreddit Algebra

photo credit: Laura Crossett

Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin’s piece, Dissecting Trump’s Most Rabid Online Following, looked at the toxic communities surrounding Donald Trump, notably r/The_Donald, by using a machine learning technique called latent semantic analysis. LSA uses words and concepts from two sets of documents and shows how closely they are related. Martin used this process to find the overlap between different subreddits; two different subreddits are more similar if users comment in both. He then goes further to use what he calls “subreddit algebra”. By adding or subtracting the subreddits together, other related subreddits can be revealed. For example, r/nba + r/minnesota = r/timberwolves. If you’re interested in semantic vector math, there’s a fun twitter bot that does this algebra several times per day.

As with all FiveThirtyEight’s data stories, they make their code freely available for readers to try out themselves. I thought it’d be interesting to take a peek at some subreddits that are a little closer to home (and a whole lot less racist and sexist). If you don’t want to run this yourself, feel free to skip to the results below.

The Setup

If you want to follow along, you’ll need some familiarity with the Google Cloud Platform since that’s where everything will be run. Specifically, you’ll be using their BigQuery service, which is a tool for working with massive datasets. You’ll also want to set up a bucket in Google Storage. Your outputs will be quite large and they don’t allow you to export directly to your local file system. Finally, you’ll need some basic familiarity with the R language and an environment to run R scripts. RStudio is a great tool for this.

First, from your Google Cloud console, create a new project to contain the various tables you’ll be generating. Next, head over to BigQuery and create a new dataset under your project. You could call this something like ‘reddit’. This will hold your results. You’ll be querying against fh-bigquery:reddit_comments set that is made available to you by default. Click on the Compose Query button and use this code from the fivethirtyeight GitHub repository. Change line 19 to the path of your own dataset you just created.

Take the resulting dataset that this query generates and export it to the storage bucket you created. From there, you can download it as a CSV file.

Now, in RStudio, load the vector analysis script from the repository. You’ll need to change the path to the CSV file on line 20 to your exported CSV. And, of course, change the various subreddits after line 59. Now the fun begins!

The Results

The first obvious search is for similar subreddits to r/IowaCity. What kinds of things do Iowa City folks post about? The higher the number, the more related the subreddits are.

Cedarrapids 0.4627451
Madisonwi 0.4278260
Uiowa 0.4216467
Milwaukee 0.4069844
Homebrewing 0.3992629
Beer 0.3941419
Chicago 0.3916151
Indianapolis 0.3868063
Iowa 0.3850677
Smoking 0.3823774

Ok, not surprising. Surrounding cities plus beer drinking and smoking meats. Iowa City redditors are a chill bunch. What about the uiowa subreddit?

IowaCity 0.4216467
Mazdaspeed6 0.2913548
Swimming 0.2766708
Projectcar 0.2719264
Madisonwi 0.2699070
Cartalk 0.2696891
College 0.2646985
Cars 0.2642775
Civilengineering 0.2637309
Milwaukee 0.2634588

I’ll admit, there are a surprising amount of car discussion going on. Perhaps not when you see some of the cars downtown.

What happens when we take the uiowa out of Iowa City? IowaCity – uiowa =

PoGoIC 0.2447359
Smoking 0.2135908
Homebrewing 0.2053004
BBQ 0.2028280
Grilling 0.1997918
Sousvide 0.1983743
Wine 0.1961068
Cedarrapids 0.1937385
Bourbon 0.1917187
Spicy 0.1895046

Iowa City likes to grill out and drink. And play Poekmon Go. Let’s see what librarians are up to. From r/Libraries:

Librarians 0.6681721
Teachers 0.6463503
Knitting 0.6231567
Parenting 0.6165957
Weddingplanning 0.6118699
Genealogy 0.6118073
Wedding 0.6039990
Femalefashionadvice 0.6024974
Crochet 0.6010991
Vegetarian 0.5975182

Congratulations, librarians, on your marriage and children! And your new fiber arts project. What happens when we remove the wedding planning from librarians’ reddit posts?

Corruption 0.3048685
HistoryofIdeas 0.2961678
CornbreadLiberals 0.2932469
TrueProgressive 0.2924358
Scifi 0.2919506
Media 0.2833257
WarOnComcast 0.2797392
TechNewsToday 0.2789546
InCaseYouMissedIt 0.2789388
Obama 0.2774487

What other interesting algebra problems could we think up? Send me an email and I’ll try to post a few next week. After all, it’s Friday and I’m off to drink beer, grill some vegetarian food, and read sci-fi after I’m done parenting for the day. This weekend might be a good time to pick up knitting.