The DIY History transcription API

Since the launch of the Civil War Diaries & Transcription Project, the goal of DIY History has been to promote the University of Iowa Libraries digital collections. Part of this mission includes making the trove of transcriptions from handwritten diaries, manuscripts, and letters widely available to researchers for use in their work.

At the time of this writing there are 61,987 transcribed pages spanning nine public collections in such wide-ranging historical topics as pioneer diaries, war letters, culinary manuscripts and recipes, railroadiana, and specimen cards. Each page has been transcribed and checked by one or more volunteers from around the world. This type of crowdsourcing effort transcends our ability as a staff to catalog, display, and transcribe every handwritten item in the library.  A project such as DIY History invites library users to do more than just visit and browse, but to actively participate in and transform the archive by typing what they read.

Participatory Archives

While DIY History items have always been indexed by search engines, there hasn’t been a robust method of making the entire body of transcription text available to researchers. Previously, a scholar interested in mining the entire collection of cookbooks from the Szathmary collection, for example, would need to perform hundreds of tedious queries using simple keywords. A better method of making our transcription data available was needed.

Early in the Fall 2015 semester, we debuted the DIY History Application Programming Interface. This API provides researchers with the ability to access much more metadata on each file than is displayed on the DIY History website. The API makes DIY History a platform on which to build applications, research projects, and other potentially innovative tools using the transcription data provided by our volunteers.

Read more about APIs

This opens the possibility to text analysis not practical before, even at the web-scale of crowdsourcing. Having programmatic access to the item-level and file-level metadata means researchers can use machine learning techniques to extract and analyze named entities. Just as it’s not feasible for a small staff to transcribe 100,000 scanned pages, it’s not feasible for a small crowd to tag an arbitrary number of entities each time a new research question arises. This is a realistic task, however, for a properly trained machine.

To demonstrate this idea, here is a simple entity extractor using DIY History transcription text, an NER module provided by MonkeyLearn, and Google’s Geocoding service to return latitude and longitude values for extracted place names.

Historical Manuscript Entity Extraction


To test this demo app, enter any DIY History file ID (the last parameter of a transcription URL) and it will automatically return any person, place, or organization it detects. This is a probabilistic method, so results may not be accurate is there is ambiguity. For this demonstration, each ID is entered manually, but a production application could iterate through thousands of records, storing results in a database.

For example in the following transcription URL, the ID is 73120

Here are a few IDs to get started with:

73120 – Eno family letters, November 1813-September 1827 1824-09 Page 1

2392 – Nile Kinnick correspondence, December 1942-March 1943 1942-12-13: Page 01

33315 – Wise-Clark family papers, December 1864-February 1865 1864-12-18-Page 04


If you have questions about the DIY History API contact


Science fiction fanzines planned for DIY History

Selected fanzines from the Hevelin Collection, featuring hectographed and hand-colored covers and writing from early science fiction fans. Images courtesy of UI Libraries and Special Collections.
Selected fanzines from the Hevelin Collection, featuring hectographed and hand-colored covers and writing from early science fiction fans. Images courtesy of UI Libraries and Special Collections.

The University of Iowa Libraries has announced a major digitization initiative, in partnership with the UI Office of the Vice President for Research and Economic Development. 10,000 science fiction fanzines will be digitized from the James L. “Rusty” Hevelin Collection, representing the entire history of science fiction as a popular genre and providing the content for a database that documents the development of science fiction fandom…

Science fiction fanzines are amateur publications made by individuals or groups that discuss books, films, politics, and many other public and personal matters. They were initially written for a limited audience and distributed via personal connections and gatherings, beginning in the 1930s in the United States and Europe. Within the pages of science fiction fanzines lies previously inaccessible and unstudied primary documentation of the social history and popular culture of the 20th century.

Science fiction fanzine writers were intimately involved with many aspects of science fiction literature during the golden years of its development. The list of names is impressive: Ray Bradbury, Robert Heinlein, Arthur C. Clark, Robert Bloch, Leigh Brackett, Frederik Pohl, Harlan Ellison, Joe Haldeman, Michael Moorcock, Roger Zelazny, Marion Zimmer Bradley, Robert Silverberg, Roger Ebert, George R.R. Martin, Forrest Ackerman, and many others were actively involved in fanzine culture….

Once digitized, the fanzines will be incorporated into the UI Libraries’ DIY History crowdsourcing site, where a select number of interested fans (up to 30) will be provided with secure access to transcribe, annotate, and index the contents of the fanzines. This group will be modeled on an Amateur Press Association (APA) structure, a fanzine distribution system developed in the early days of the medium that required contributions of content from members in order to qualify for, and maintain, membership in the organization. The transcription will enable the UI Libraries to construct a full-text searchable fanzine resource, with links to authors, editors, and topics, while protecting privacy and copyright by limiting access to the full set of page images.

Read full press release