Digital Scholarship & Publishing Studio

As You Wish: An Honest Summary of My Summer Work So Far. . .

I have been tasked with writing an engaging and honest blog about my work as one of the Digital Scholarship and Publishing Studio fellows but I have a problem. I rarely use the adjective “engaging” to describe my writing. I have, I think, managed to find a creative solution to this particular conundrum. Using something I love, the 1989 adventure movie The Princess Bride, I will review the three biggest challenges I have faced during my summer work at the Studio. I hope you enjoy!

**A Note About Links: When I mention a tool or program I use, I linked a resource from a digital humanities scholar who provides an introduction or overview of the same tool! Happy clicking!**

Method to My Madness: Which Way’s My Way?

My dissertation project examines American Indian civil and legal protest at earthworks and burial sites in the Midwest at the turn of the twentieth century. Pretty cool, right? Though I do not have many digital sources, I have a plethora of textual sources from professional and amateur archaeologists who observed a handful of American Indian people who camped near earthworks and burial sites. Using Google forms and Excel, I translated correspondence from the papers of Ellison Orr, an amateur archaeologist active in Iowa from 1916 to 1951, and Charles Reuben Keyes, the first director of the Iowa Archaeological Survey (1921-1951). I parsed these letters for geographic data in the letter (To and From address) and used categories I created to organize the correspondents. But, what to do with this data? What questions could I ask it? Though I am not a digital methods skeptic, I am not an expert on digital methods and there are moments when I feel genuinely overwhelmed by the possibilities.Social network analysis, textual analysis, and mapping are just a few of the many methods that digital humanities scholars use with data sets like mine! Much like Fezzik, I found myself wondering, “Which way’s my way?”

Collaboration saved my project and my sanity. Nikki White and Rob Shepard, my Studio points of contact, reviewed my data and made some recommendations on the kinds of visualizations I could create. Their expertise has been invaluable. Unfortunately, the data I have would not make an analytically valuable social network visualization. But, I did have interesting geographic data and perhaps with some creativity and effort I could create a dynamic and interactive map!

Data Cleaning: Am I Going Mad?

I asked myself this soooo many times while cleaning my data.

Armed with a mapping method, I moved to cleaning my data. I used Excel pivot tables to move through my columns and I opened and closed OpenRefine to clean my data. Even though I did not crowd source my data, it was still quite messy. Some messiness was human error–capitalization, spelling, spacing and the like. I was driven to the brink of insanity when forced to grapple with inconsistencies that have nothing to do with spelling or capitalization. Rather, I had to make and document choices about the data in my spreadsheets because the questions changed between when I started gathering the data (last year) and now. Column after column and row after row, I made decisions about each cell. My method of cleaning data is not new or revolutionary. Digital humanities scholars have been publishing excellent scholarship on working with and cleaning data for decades. Three excellent pieces from humanities scholars on the challenges of working with data are:

Big? Smart? Clean? Messy? Data in the Humanities by Chris Schoch
Lost in Space: Confessions of an Accidental Geographer by Colin Gordon
Humanities Approaches to Graphical Display by Johanna Drucker

**If you are a digital novice and the term data makes you feel queasy, I would highly recommend Daniel Rosenberg’s Data before the Fact**

In Lieu of a Conclusion: Where I am Going Next

Click this image to view the clip from the movie on YouTube

Armed with clean data, I am very excited to work on three visualizations that I hope will help me answer the research question at the heart of my dissertation. For the sake of brevity, I will only describe the first visualization I hope to create. Primary source research has revealed the activism of Emma Big Bear (1869-1968). Emma was a Ho-Chunk woman who resumed residency near the mounds in northeastern Iowa from 1917 to her death in 1968.

Emma Big Bear returned to northeastern Iowa from the Winnebago Reservation in Nebraska around 1917. She and her husband Henry Holt lived in a wickiup, spoke only Ho-Chunk, and lived near the earthworks in northeastern Iowa for almost five decades. Simultaneous to Emma’s site occupation, amateur archaeologists opened earthworks and hunted for artifacts. I hope to visualize Emma’s campsites alongside a map of the sites that artifact collectors frequently targeted. Artifact collectors described the sites they excavated when they wrote to Ellison Orr (mentioned above). Did Emma and Henry ever cross paths with the most active amateur archaeologists in the region? Did she and Henry ever observe the excavations of amateur archaeologists like Ellison Orr, Dale Henning, or Paul Rowe? Using oral histories and amateur archaeologist correspondence, I hope to create an interactive public exhibit using ArcGIS Online based on my data.

Wish me luck!

About Me: My name is Mary Wise and I am a PhD candidate in the history department. My dissertation examines the history of American Indian activism at earthworks and effigy mounds in the Midwest from 1890 to 1950.

The Journey through the Realm of Process

“Process is the new god…Digital Humanities mean iterative scholarship…It honors the quality of results; but it also honors the steps by mean of which results are obtained as a form of publication of comparable value. Untapped gold mines of knowledge are to be found in the realm of process” (The Digital Humanities Manifesto 2.0, 5).

When I’m working on my project, I often think back to this quote. My project deals with doing the same thing over and over again. I’m working on a sentiment analysis of Marcus Tullius Cicero’s orations paired with a network analysis of his letters in the Ad Familiares. It involves hours of data entry, running the same tests in R with slight variations, and corpus preparation (otherwise known as me struggling to put spaces in between the paragraphs of the .txt files). “Iterative scholarship” is a nice way to put it; “torture” is another way.

However, Digital Humanities requires a dash of optimism and a handful of perseverance, an ability to squint and see the gold mines hidden “in the realm of process.” And indeed, within my first two weeks of being at the Digital Studios, I do feel like I’ve gleaned little, golden nuggets of knowledge.

While struggling with the spreadsheets for Gephi, I became acquainted with Cicero’s familiars and the world they all lived in. I’ve made many mistakes while putting these spreadsheets together. I mixed up the Catones and the Bruti often and realized that there was more than one Caesar. But through my errors and through the process in general, I’ve already gained insights into Cicero’s world without even running the data through Gephi yet. I’ve already seen and counted the people that Cicero writes about the most (although it comes as no surprise that he mentions Julius Caesar the most).

My most recent struggle has been with inserting breaks within the .txt files so that I can measure sentiment paragraph by paragraph. At first, I was doing this manually, which is just as tedious as it sounds. Then, with the guidance of Matthew Butler, the Senior Developer here at the UIowa Digital Scholarship & Publishing Studio, we managed to figure out a way to have the computer do the tedious work for us through Python. It was my first time grappling with Python, but I had a great guide who forged the path for me through Python jungle in the realm of process. Not only did I become familiar with Python, but when I was manually inserting breaks, I also became further acquainted with my corpus. My files had been “lemmatized,” or in short all of the Latin words are reverted back to their stem, in theory. For some reason, this lemmatizer changes first conjugation infinitives to the second person singular passive. For example putare becomes putaris instead of puto. I was able to make changes to my sentiment lexicon accordingly and improved it.

This iteration also occurs in traditional scholarship. We read and reread the same works repeatedly, pouring ourselves over the same lines for close readings. Articles are written about with the same argument but with minor tweaks. Then why does digital scholarship seem so different? It’s not. As I pass my time here, I realize that the gap between traditional and digital scholarship isn’t as wide as it first seemed. I’m spending much more time with the texts than I am with R and Gephi. Furthermore, I understand that Digital Humanities can be intimidating. The realm of process can seem like a hopeless waste(of time)land to some. But new, golden discoveries, both personal and professional, lie in wait; you just have to persevere. And if you’re a Classicist and you can successfully struggle through Tacitus and Aristotle, you can also defeat the mighty Python and wrestle with Gephi.

Reflections on Machine Translation

Hello blog readers! I’m Andrea, one of the fellows in the UIowa Digital Scholarship & Publishing Studio this summer. I’m in the middle of the MFA in Literary Translation Program at UIowa, and I translate from French to English. This summer I’m working on a website that includes some translations with digital features, inviting the reader to think like a translator and explore how translators make decisions. For this post, I’m going to throw out some ideas that I’ve been mulling around in my head regarding Machine Translation.

I could conduct research and practice digital translation for the next ten years and still not be done, and the research from my first year would become obsolete by the end of the decade. This may seem daunting, but for me, young and naïve 23-year-old academic who still doesn’t drink caffeine, the potential of this field is exciting.

I’ve been reading a lot of theory recently about machine translation and technology, specifically Google Translate which has gained dominance in the last five years. I love the ways these articles/essays both underscore and undermine traditional translation theory. Translation theory is thus incredibly dynamic, constantly altered by the technologies that necessitate and facilitate its existence.

The field of translation is growing, and so is the fear that one day we translators will be completely replaced by Translation Machines. If you’re not scared yet, read about Google Translate’s new “Neural Machine,” which debuted last fall and is gaining traction in new languages:

A brief quote that explains how it works:

“The team trained its system on hundreds of hours of Spanish audio with corresponding English text. In each case, it used several layers of neural networks – computer systems loosely modelled on the human brain – to match sections of the spoken Spanish with the written translation. To do this, it analysed the waveform of the Spanish audio to learn which parts seemed to correspond with which chunks of written English. When it was then asked to translate, each neural layer used this knowledge to manipulate the audio waveform until it was turned into the corresponding section of written English.”

In other words, it’s translating “wave-to-wave” instead of “word-to-word.” This is a translation on a neurological level, not just a transcription of bilingual dictionaries. The emphasis is on accuracy of meaning without the obstacle of the words themselves.

And, if you thought that Google had the monopoly on machine translation, just this week Amazon announced working on its own Machine Translation services.

I entered this program at UIowa because it seemed like a perfect fit for my interests in French literature. But will the skills I gain here be superseded by Machine Translation? Will my career be safe (if it was even safe in the first place)? Should I consider pursuing other fields?

Something I like to tell myself: “Machines will never be able to fully grasp a literary text’s meaning, its artistry, or its historical, cultural, and etymological context.”

“Maybe, maybe not,” I cynically respond. Imagine a Google translate of the future that allows settings for input and output. Translate one of Basho’s haikus into “teenage German vernacular.” Translate Romeo and Juliet into “Latin, in the style of Cicero.” Translate Kafka’s Metamorphosis to English using predominantly Anglo-Saxon words. What if one day Machine Translation has this ability to alter its algorithms to include stylistic input and output? I don’t think this is impossible. After all, Google Translate learns from its usage. The more translations that are run through it, the stronger it gets, and the smarter the translators get, the smarter it gets. With the right upkeep, it will always have more memory, more resources, more algorithms than its human counterparts.

Many translators, including myself, sometimes use Google Translate for reference, and after getting a rough draft of an excerpt, they work from there to imbue the text with the elements “lost in translation.” If Machine Translation can help us translate more efficiently, without completely taking over our careers, we should celebrate its consequence: the increase of world literature available in English. As translators, isn’t that our ultimate goal?

So what does the future look like for translators? Will we become obsolete (or when?)? Will we have to pursue careers managing and coding Translation Machines? I can’t say. For now, Google Translate is unable to capture style, rhyme, and rhythm; it can only reproduce meaning. For now, Translation Machines are unable to adequately compute the artistic qualities of texts. For now, human translation prevails.

Maybe one day Translation Machines will be able to interpret a text’s beauty. Until then, we need tech-savvy human translators working on world literature. Until then, we need innovative and passionate translators to do what Machines cannot. Until then, I’ll be translating.

The Studio Pilots Summer Fellowship Program

This summer the Studio will pilot a new fellowship program with the help of the University of Iowa Graduate College and the Studio Steering Committee. Nine current graduate students have been named Summer Studio Fellows. The students will soon take part in an 8-week course that provides mentored digital scholarship experience, as well as training in skills and tools they might use as they pursue innovative ways of thinking about and sharing their creative endeavors. Below you can read more about new fellows and a description of their proposed projects.

Hayder Alalwan, PhD student, Chemical and Biochemical Engineering Department
Currently working on a PhD in the Chemical and Biochemical Engineering department, Hayder Alalwan will continue work on a project started in the Spring of 2014. He will explore the creation of a website to publicly share information on chemical looping combustion (CLC). That process process uses the lattice oxygen molecules of metal oxides to decompose the gas, instead of air, which minimizes formation of pollutant byproducts such as NO2, N2O, or NO, which form when the reaction occurs in air (e.g., N2 and O2). In addition, the CLC process is highly efficient at decomposing gas with little to no side reaction. Hayder’s work will help bring his research findings to a broader public as part of his work in science communication.

Alexander Ashland, PhD student, English Department
Alexander Ashland plans to expand on his work of Mapping Whitman’s Correspondence, integrating new data into an existing database, dedicating time to revisiting the existing prototype, and exploring the possibilities for implementing crucial features, such as search functionality, timescale manipulation, dynamic proportional symbols, and filterable keywords. Ashland’s current data has been gathered from the Civil War, Reconstruction (1867-1876), Post Construction (1877-1887), and Old Age (1888-1892) eras.

Sonia Farmer, MFA student, Center for the Book
Sonia Farmer plans to launch a podcast that shares the rich world of Caribbean literature. The podcast will provide Caribbean writers with a platform share their writing, and grant people easy access to a multitude of voices. Farmer comes to us from the UI’s Center for the Book to hone her digital editing skills and develop the platform.

Andrea Lakiotis, MFA student, Literary Translation Program

Andrea Lakiotis will explore online digital publishing while engaging with translation theory and practice. She brings experience in digitizing data, mapping, and code to the digital translation work she will be doing with the Studio.

Caitlin Marley, PhD student, Classics Department
Classics student Caitlin Marley plans to analyze Marcus Tullius Cicero’s corpus through computing algorithms by using his orations and social network. With this information she will map the “emotional plot” of the orations as well as the networks across space and time.

Ben J. Miller, PhD student, Psychological and Quantitative Foundations Department
Ben J. Miller studies the educational needs of pediatric patients and their families. Efficient and effective education plays a large part in regard to their care. This summer, Ben will refine his digital design skills in service to educating parents on using distraction to help their children cope during painful medical procedures. Ben is designing an infographic for use in pediatric waiting rooms that demonstrates how to harness the power of their smartphones and tablets for distraction.

Arianna Russ, MFA student, Dance Department
As an MFA student in Dance Performance, Arianna Russ explores the integration of digital media into her artistic work. In collaboration with Dance and Theatre Arts Assistant Professor Dan Fine, Arianna will deepen her understanding of motion capture and digital artistic practice.

Katherine Wetzel, PhD student, English Department
As a doctoral candidate in the department of English, Katherine Wetzel plans to continue her work on Met-Memory that she is currently constructing as part of her Studio Scholars Initiative. This project examines the tensions within local, national, and global expressions of Britishness as they occur in late-Victorian literature. The summer fellowship will also provide her with opportunities to explore the place of theory within the digital humanities.

Mary Wise, PhD student, History Department
A PhD candidate in the History Department, Mary Wise plans to construct an interactive and publicly accessible map that examines the American Indian earthwork excavations in the Upper Midwest between 1890 and 1930. With training and support from Studio staff, she sees this project leading to the creation of an all-digital history dissertation.

Saving Endangered Data: What Can Digital Humanists and Libraries Do?

In a blog post last week, I addressed Endangered Data Week and the history of political parties hiding, removing, or altogether abolishing public access to government documents. However, my post wasn’t alone in trying to shed light on this serious issue. In schools, universities, libraries, and classrooms across the world, hundreds of concerned people came together to bring awareness to the issue of endangered and disappearing data. And while Endangered Data Week is now over, the threat is not. So this week, I teamed up with the Digital Scholarship & Publishing Studio to highlight some of the excellent work currently being done by digital humanists and to provide some advice on how to get involved.

*The* Adlocutio relief of the Plueti Traiani (Late 2nd c. CE). Now in the Curia Senatus in the Forum Romanum, Rome. Photo from Wikimedia but originally taken by Diane Favro (UCLA) for her “Death in Motion” article. It depicts debt records being burned.

First, I visited with Tom Keegan, Head of the Digital Scholarship & Publishing Studio, and Matt Butler, the Studio’s Senior Developer, to discuss the services offered by university libraries to keep scholarly data safe. They stressed the import of digital institutional repositories in helping scholars to maintain their own data and make it accessible to others free of charge. The University of Iowa’s institutional repository, Iowa Research Online, houses an array of faculty, graduate, and undergraduate work. Librarians work closely with faculty, staff, and students to ensure these materials are properly archived and made available according to agreed upon standards. As I have pointed out before, non-university repositories like Academia.edu are for-profit and will indeed use your data in order to make them money.

Profit is a big factor to consider when thinking about where to put data. As Eric Kansa, founder of Open Context emphasized to me: “We need to maintain nonprofit (civil society) infrastructure to help maintain data (and backup internationally) during political crises. Organizations like the Internet Archive, and other libraries (including university libraries) are critical, because they have the expertise and infrastructure needed to maintain public records.” Kansa rightfully points out that libraries are integral to this fight, but notes that individuals need to know more about the vulnerability of data as well.

So, what do we do about all the government data (e.g. climate data) that is currently being pulled from government websites? This was just one question addressed by the group behind the formation of Endangered Data Week. Like most DH projects, EDW was forged by proactive academics who wanted to make a difference by using the biggest megaphone in the world: The Web. Michigan State University professor and digital humanist Brandon Locke, in collaboration with Jason A. Heppler, Bethany Nowviskie, and Wayne Graham, designed EDW on the model provided by Banned Books Week and Open Access Week. From there they brought the project to the Digital Library Federation‘s new interest group on Government Records Transparency/Accountability, directed by Rachel Mattson.

In order to find out more about this initiative and the problems they are addressing, I spoke to Bethany Nowviskie, Director of the Digital Library Federation (DLF) at CLIR and a Research Associate Professor of Digital Humanities, UVa. Prof. Nowviskie was kind enough to answer a number of questions I had about endangered data and how to get more involved in the fight to save it:

SB: Who owns federal data? In other words, should data be available to us because we pay taxes and fund data-producing institutions like HUD? The EPA? Why is the Executive in control of so much of this open data?

BN: Except where issues of personal privacy and cultural sensitivity are involved, data collected or produced by taxpayer-funded agencies of the federal government should be openly available to everyone. It’s a matter of transparency for the health of the republic — sunlight being, as they say, the best disinfectant — and of accountability of the government to its people. These are our datasets, and we should have the ability to analyze and build on them — using them to understand our world better, as it is, and to be able to *make it better.*

SB: How do we create a more centralized, non-profit infrastructure that can maintain data during political crises?

BN: Most pieces of our needed infrastructure are already in place. We call them libraries. The DLF will join a large number of allied groups in early May, convened by DataRefuge (our Endangered Data Week partner) and the Association of Research Libraries, to discuss a new “Libraries+ Network,” to take on exactly this issue: https://libraries.network/about/ Some questions that will motivate us: how can we create greater coherence among the many governmental, non-profit, and even commercial groups with longstanding commitments and expertise in particular areas of the data preservation enterprise? Might we re-energize and re-imagine something like the Federal Depository Library program for the digital age? What would it take for governmental agencies to implement data management plans for the full lifecycle of their information, just as researchers who receive federal funds are now typically required to do?

SB: What can regular non-specialists do to contribute?

BN: This is one reason DLF jumped at the chance to support grassroots efforts to organize the first annual Endangered Data Week. The goals expressed and audiences implied in our capsule summary (“raising awareness of threats to publicly available data; exploring the power dynamics of data creation, sharing, and retention; and teaching ways to make endangered data more accessible and secure”) go far beyond the professional research data management and data stewardship community. Probably the most useful thing a non-specialist can do is to educate herself on the issues and represent the value of open data legislation and the advances in open government we saw under the Obama administration to her representatives. We also need to urge follow-through on past bipartisan commitments in this sphere, such as the OPEN Government Data Act: https://www.datacoalition.org/open-government-data-act/

SB: Can you give some examples of digital projects or initiatives that depend on federal data to reveal racial inequity (e.g. redlining projects), bias, or certain dangers (e.g. lead poisoning)?

BN: Well, FOIA requests played an important role [in the Flint water crisis]— as they have done in Title IX enforcement on college campuses. In this sphere, I also think it’s worth mentioning that identical bills were recently introduced in both the House and Senate that would prohibit federal funds from being “used to design, build, maintain, utilize, or provide access to a Federal database of geospatial information on community racial disparities or disparities in access to affordable housing.” [House Bill, Senate Bill]. They went nowhere, and were ostensibly meant to “protect local zoning decisions,” but *what is up with that?* This is the kind of thing that should energize non-specialist readers.

SB: How can we have trust in the integrity of datasets that have been given over to private institutions or saved by non-federal entities? In other words, who will hold the “control” copy (e.g. like a seed bank) that can assure us that datasets that have been saved were not then tampered with?

BN: So, there’s a huge professional community — many of them are DLF members or members of the National Digital Stewardship Alliance which we host — whose whole focus is on questions like this, and there are excellent protocols and procedures for ensuring data integrity. I’m not familiar enough with the ins and outs to give you a good quote, but it’s not a new problem, for sure, and methods for auditing and certifying digital repositories and verifying the integrity and security of the data within them are well established. As always, matters of policy, funding, and the professional development and nurturing of the communities who do the work are a bigger challenge than the technology!

*Software developers at UC Berkeley (School of Information, South Hall) coding to crawl and archive federal databases (Photo via Eric Kansa).*

Bethany’s comments above echo what others on campuses across the US are saying: data is a resource. Like water or electricity, access to it ought not be taken for granted. We must continue to be vigilant in the face of lazy and aggressive attitudes, alike. Libraries and library associations remain a big part of the fight to preserve this data, but all of us can play a part by being more aware, spreading the word, and getting involved in the movement.

Neural Network Poetry

As you may know, April is national poetry month, an annual series of events by the Academy of American Poets to help support the appreciation of American poetry. If you’re looking for great book-length collections of poems, you might be interested in the Iowa Poetry Prize winners. Many of the previous years’ winners are made available in PDF form at Iowa Research Online. What you may not know is that April is also National Poetry Generation Month, an annual tradition where programmers and creative coders spend the month writing code that generates poetry.

In honor of this time of year, I thought I’d take a look at the Iowa Poetry Prize winners through code. There are many methods for analyzing and generating natural language, but one system that has received a lot of attention recently is neural networks. A neural network is a large collection of artificial neurons based very loosely on a biological brain. These neurons exist in layers that perform statistical calculations and affect the state of other connected neurons. It differs from other computational models in that there is no knowledge hard coded and controlled by elaborate conditional statements (if this then that). Rather, neural networks learn to solve tasks by observing data and producing optimal functions that will produce similar outputs given new data it’s never seen before. The uses for such a system include image and speech recognition, classification problems, and many forms of prediction and decision making. For example, a neural net could be trained to detect images of cats by observing tens of thousands of labeled images of cats. Google has recently launched a new project that uses this technique to match your doodles with professional drawings.

What happens when we train an artificial intelligence to write english language having only read Iowa Poetry Prize winners? Let’s find out!

To start, I downloaded all of the IPP winners from Iowa Research Online, extracted the poems as plain text, and concatenated them all into a single text file named poems.txt. This served as the training set. Next, I set up this Torch-based Docker container implementation of a recurrent neural network based on work by Andrej Karpathy and Justin Johnson. It was tempting to spin up the Google cloud VM with an attached GPU, since these types of machine learning tasks are sped up greatly running on a graphics processing unit with CUDA, but it’s also quite expensive at 75 cents-per-hour. Once I had it working, I started the preprocessing and training, which took about 16 hours to complete.

After a lot of experimentation to create some useful training models and keep the network from overfitting and underfitting the data, I had something that was acceptable and so began sampling output. One parameter of sampling that was fun to play with was the “temperature” of the sample. A lower temperature produced output that was much more predictable and less error prone while a higher temperature was much more inventive but riddled with mistakes. I decided to split the difference and start at 0.5. Here’s the first poem.

Speritas Of The Stars

Morning comes of the sun
to the thin world is a star of her light.

The sheet and the body of parts
of the flame is a light, the body
sees of the wars beautiful on the street.
The sun, the stars of the sound, and desire,
and a man could love the streets.

The single shiller of light,
and the single stranger falls countal.
Father and she were the sutters of the body
instraining to the complete
window of light, still.

You’ll notice a few words in this poem that don’t actually exist in english. That’s because this RNN operates at the character level, not the word level. It has to learn, from scratch, how to write english. It starts with random strings of letters and slowly, after many iterations, learns about spaces, proper punctuation, and finally readable words. The higher the sampling temperature, the more invented words. Let’s look at a “hot” poem.

Pelies, One Yighter

The shadows just plance croved
I am one
its funlet from the wind
staskaccus, gring of detches of hearts face eashog
what wing to the streed in the resert of change, a glince
the life.
She read on his fill bathered, a hand the
marks
with beautiful, casty, stery, kooms, in one father

something the mouth cold leaves.
A night and no one is a woman; you green her

My spere would must not the look teering mower
I see itselfor.
At that sign they thought the remelled the mum,
but like an wait they mite of ammiral
after things of the body
which children would love
now, not
the forest flowers and hark a path.
The shawr rate in a ruched parts in humstily
his poom her as of the trabs conterlity.

Much more Jabberwockyesque. If we ease up just a little on this we get

A Badicar Flower

The watcher blue says
they would have shapes,
the night dreaming,
a painted nother
tricks me, the wind,
the dayed from the boging feeling
of the histance in his everyness.

What do you think — poetry prize worthy? While writing poetry is fun, there are, of course, practical applications too. I’m currently working with faculty member Mariola Espinosa on a HathiTrust project called Fighting Fever in the Caribbean: Medicine and Empire, 1650-1902. We have 9.3 million pages of medical journals and need to find references to yellow fever in multiple languages. A trained neural network could look through these quickly and find references that a human might miss. I’m also working on another project with Heidi Renee Aijala looking for references to coal smoke in Victorian literature. Perhaps a neural net could be trained to look for non-keyword references.

While I’m probably not going to put a poet out of work any time soon, you can imagine many real-world uses. There is a tremendous potential for neural networks and other types of machine learning to caption images, transcribe handwriting, translate documents, understand the spoken word, and play chess at the international master level. Perhaps someday it might also write a meaningful poem.

Subreddit Algebra

Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin’s piece, Dissecting Trump’s Most Rabid Online Following, looked at the toxic communities surrounding Donald Trump, notably r/The_Donald, by using a machine learning technique called latent semantic analysis. LSA uses words and concepts from two sets of documents and shows how closely they are related. Martin used this process to find the overlap between different subreddits; two different subreddits are more similar if users comment in both. He then goes further to use what he calls “subreddit algebra”. By adding or subtracting the subreddits together, other related subreddits can be revealed. For example, r/nba + r/minnesota = r/timberwolves. If you’re interested in semantic vector math, there’s a fun twitter bot that does this algebra several times per day.

As with all FiveThirtyEight’s data stories, they make their code freely available for readers to try out themselves. I thought it’d be interesting to take a peek at some subreddits that are a little closer to home (and a whole lot less racist and sexist). If you don’t want to run this yourself, feel free to skip to the results below.

The Setup

If you want to follow along, you’ll need some familiarity with the Google Cloud Platform since that’s where everything will be run. Specifically, you’ll be using their BigQuery service, which is a tool for working with massive datasets. You’ll also want to set up a bucket in Google Storage. Your outputs will be quite large and they don’t allow you to export directly to your local file system. Finally, you’ll need some basic familiarity with the R language and an environment to run R scripts. RStudio is a great tool for this.

First, from your Google Cloud console, create a new project to contain the various tables you’ll be generating. Next, head over to BigQuery and create a new dataset under your project. You could call this something like ‘reddit’. This will hold your results. You’ll be querying against fh-bigquery:reddit_comments set that is made available to you by default. Click on the Compose Query button and use this code from the fivethirtyeight GitHub repository. Change line 19 to the path of your own dataset you just created.

Take the resulting dataset that this query generates and export it to the storage bucket you created. From there, you can download it as a CSV file.

Now, in RStudio, load the vector analysis script from the repository. You’ll need to change the path to the CSV file on line 20 to your exported CSV. And, of course, change the various subreddits after line 59. Now the fun begins!

The Results

The first obvious search is for similar subreddits to r/IowaCity. What kinds of things do Iowa City folks post about? The higher the number, the more related the subreddits are.

Cedarrapids 0.4627451
Madisonwi 0.4278260
Uiowa 0.4216467
Milwaukee 0.4069844
Homebrewing 0.3992629
Beer 0.3941419
Chicago 0.3916151
Indianapolis 0.3868063
Iowa 0.3850677
Smoking 0.3823774

Ok, not surprising. Surrounding cities plus beer drinking and smoking meats. Iowa City redditors are a chill bunch. What about the uiowa subreddit?

IowaCity 0.4216467
Mazdaspeed6 0.2913548
Swimming 0.2766708
Projectcar 0.2719264
Madisonwi 0.2699070
Cartalk 0.2696891
College 0.2646985
Cars 0.2642775
Civilengineering 0.2637309
Milwaukee 0.2634588

I’ll admit, there are a surprising amount of car discussion going on. Perhaps not when you see some of the cars downtown.

What happens when we take the uiowa out of Iowa City? IowaCity – uiowa =

PoGoIC 0.2447359
Smoking 0.2135908
Homebrewing 0.2053004
BBQ 0.2028280
Grilling 0.1997918
Sousvide 0.1983743
Wine 0.1961068
Cedarrapids 0.1937385
Bourbon 0.1917187
Spicy 0.1895046

Iowa City likes to grill out and drink. And play Poekmon Go. Let’s see what librarians are up to. From r/Libraries:

Librarians 0.6681721
Teachers 0.6463503
Knitting 0.6231567
Parenting 0.6165957
Weddingplanning 0.6118699
Genealogy 0.6118073
Wedding 0.6039990
Femalefashionadvice 0.6024974
Crochet 0.6010991
Vegetarian 0.5975182

Congratulations, librarians, on your marriage and children! And your new fiber arts project. What happens when we remove the wedding planning from librarians’ reddit posts?

Corruption 0.3048685
HistoryofIdeas 0.2961678
CornbreadLiberals 0.2932469
TrueProgressive 0.2924358
Scifi 0.2919506
Media 0.2833257
WarOnComcast 0.2797392
TechNewsToday 0.2789546
InCaseYouMissedIt 0.2789388
Obama 0.2774487

What other interesting algebra problems could we think up? Send me an email and I’ll try to post a few next week. After all, it’s Friday and I’m off to drink beer, grill some vegetarian food, and read sci-fi after I’m done parenting for the day. This weekend might be a good time to pick up knitting.

Celebrating Women in Iowa’s Past

Mary Louise Smith and Louise Noun, Des Moines, Iowa, October 1, 1996. University of Iowa Libraries. Iowa Women's Archives.

Edna Griffin photographs. University of Iowa Libraries. Iowa Women's Archives. — Edna Griffin photographs. University of Iowa Libraries. Iowa Women’s Archives.

Today, in celebration of International Women’s Day, we reflect on the progress and many achievements that women, past and present, have made around the world. The origins of this day can be traced back to the early 1900s, marked by a strike for better working conditions for women in the garment industry. While the strike didn’t take place in Iowa we’d like to spotlight a few Libraries-housed resources and collections which help to give a little more meaning to the day.

The Iowa Women’s Archives, established by Louise Noun and Mary Louise Smith in 1992 provide a trove of collections and work highlighting women at the University of Iowa and in the state. Some of the fascinating work included in this repository are the Edna Griffin Papers, which share a story through photographs, interviews, and newspaper clippings highlighting the life of this remarkable Iowan and civil rights activist. You can even transcribe her FBI file from 1948 to 1951 in DIY History.

"Queen of the campus" February 4, 1956. University of Iowa Libraries. University Archives — “Queen of the campus” February 4, 1956. University of Iowa Libraries. University Archives.

Another resource that connects the Iowa Women’s Archives with Iowa Research Online and the Iowa Digital Library is Scholarship@Iowa. Here you will find theses, dissertations, articles, and collections that present work related to fostering and promoting diversity. Spend a little time here and you might find yourself listening to an interview with Dora Martin Bailey, who in 1955 became the first African American student to be awarded Miss State University of Iowa.

Take a few moments to enjoy the rich history of women in Iowa, and remember that they have played an important role in shaping our past, present, and future. Take time to appreciate these strides and to discover the ways they have positively impacted your life and the world around you.

Studio Staff to Present at DH 2017

Last week the Alliance of Digital Humanities Organizations notified presenters of their acceptance to its massive annual conference, DH 2017. Held in Montreal this August, the conference brings together digital humanists from around the world to share their work. We’re excited to announce that four Studio staff will be among those UI faculty and staff presenting their work! Here’s a short run-down on who’s presenting and what will be discussed.

Rob Shepard (GIS Specialist) will present his paper on Placing Segregation:

Placing Segregation is a new open access digital project that explores research questions about housing segregation and socioeconomic disparities across nineteenth century American cities through interactive maps and interpretations. Rather than using aggregate data collected at city ward levels to make inferences about past urban geographies, this work has combined city directories and period advertisements with census records to rebuild historical address systems and geolocate every possible family in the 1860 census for the cities of Washington, D.C., Nashville, Tennessee, and, for the 1870 census, the city of Omaha, Nebraska. Mid-nineteenth century census records contain extensive details which were not collected in subsequent decades, so these geolocated individuals provide rich new datasets for historical researchers. This paper introduces core functionality of the digital exhibit (e.g. using the interactive map or its search to access information about individuals) and also explains the process of developing the data and the website.

Together Hannah Scates Kettler (Digital Humanities & Instruction Librarian) and Mark Anderson (Digital Scholarship & Collections Librarian) will present a poster of their work with Spanish & Portuguese Lecturer Julia Oliver Rajan, on a unique bilingual (Spanish and English) digital archive of oral history videos – Coffee Zone: Del cafetal al futuro / From the Coffee Fields to the Future:

Coffee Zone: Del cafetal al futuro/ From the Coffee Fields to the Future documents a vanishing dialect of Spanish spoken in the mountainous coffee growing regions of Puerto Rico. Currently consisting of over 600 short video clips in 16 topical categories, the site can serve as a template for other researchers who are documenting similarly endangered languages or dialects in other parts of the world. The poster will present the progress and challenges of this digital humanities project, how it acts as a resource for scholars and students in a wide variety of disciplines (ecology, horticulture, psychology, and obviously linguistics, just to name a few), and the upcoming features we are working to implement.

Tom Keegan (Head, Digital Scholarship & Publishing Studio) and UI Classics Professor Sarah Bond will share their work on Quotidian Reading: Digitally Mapping Literary and Personal Geographies:

Petronius’ Satyricon and James Joyce’s Ulysses are big books that are too often cast as things to be conquered or “done” rather than encountered as portals to better understanding ourselves and the world in which we live. In this long paper, we offer an alternate approach to reading texts in which the experiential learning advocated for by John Dewey (and often averred by literary theorists) is combined with a host of digital mapping tools, broadly understood. We describe our work in two courses—one in Classics and one in English—as aimed at connecting the content of Petronius’ and Joyce’s novels with the daily lives of our students. In our courses students undertook a kind of “quotidian reading” in which they identified spaces and practices in the novels and relocated those elements in their own lives, sharing their observations through mapping, blogging, and podcasting.

Congratulations to everyone else who will be presenting their findings this summer. We hope to see you there!

Another Milestone for DIY History!

It was just over two years ago that DIY History reached its amazing 50,000-page transcription benchmark! This past week we achieved 75,000, and we’d like to take the opportunity to talk a little bit about the elements that have led to this amazing growth.

Transcribed report from Keith-Albee Collection — Transcribed report from Keith-Albee collection

Available Collections

Since reaching 50,000 pages, six new collections have been added including Scholarship at Iowa, Museum of Natural History Egg Cards, Social Justice, the Van Allen Papers, Germans in Iowa, and most recently, the Keith-Albee Vaudeville notebooks. Each of these collections allows a glimpse into a certain moment in history from early 1900s managers’ reports on a wide array of entertainment to important documents detailing correspondences in Iowa during the Civil Rights era.

Egg Card Transcription Blitz from Fall 2016

Outreach

Expanding across platforms like Twitter has helped us continually make these efforts more accessible, and has allowed for additional crowdsourcing opportunities in classrooms and at other institutions, like the Nova Scotia Archives. Just this past fall, locally and in conjunction with a global transcribing effort, the UI’s Museum of Natural History partnered with WeDigBio for a week-long egg card transcription blitz. In short, these collections are getting to see the light of day in an exciting and participatory manner and we couldn’t be happier with the response.

The Community

Lastly, reaching this milestone wouldn’t be possible without the continued support of the wonderful community of volunteer transcriptionists who help to make these collections come to life by making them available and searchable for researchers and historians. So to them we say thank you, and we hope this sentiment will inspire others to help build the historical record!

The Team

DIY History is made possible by the wonderful team here at the Digital Scholarship & Publishing Studio, specifically by the Studio’s Senior Developer, Matthew Butler, and our Digital Scholarship & Collections Librarian, Mark Anderson.