Since the launch of the Civil War Diaries & Transcription Project, the goal of DIY History has been to promote the University of Iowa Libraries digital collections. Part of this mission includes making the trove of transcriptions from handwritten diaries, manuscripts, and letters widely available to researchers for use in their work.
At the time of this writing there are 61,987 transcribed pages spanning nine public collections in such wide-ranging historical topics as pioneer diaries, war letters, culinary manuscripts and recipes, railroadiana, and specimen cards. Each page has been transcribed and checked by one or more volunteers from around the world. This type of crowdsourcing effort transcends our ability as a staff to catalog, display, and transcribe every handwritten item in the library. A project such as DIY History invites library users to do more than just visit and browse, but to actively participate in and transform the archive by typing what they read.
While DIY History items have always been indexed by search engines, there hasn’t been a robust method of making the entire body of transcription text available to researchers. Previously, a scholar interested in mining the entire collection of cookbooks from the Szathmary collection, for example, would need to perform hundreds of tedious queries using simple keywords. A better method of making our transcription data available was needed.
Early in the Fall 2015 semester, we debuted the DIY History Application Programming Interface. This API provides researchers with the ability to access much more metadata on each file than is displayed on the DIY History website. The API makes DIY History a platform on which to build applications, research projects, and other potentially innovative tools using the transcription data provided by our volunteers.
This opens the possibility to text analysis not practical before, even at the web-scale of crowdsourcing. Having programmatic access to the item-level and file-level metadata means researchers can use machine learning techniques to extract and analyze named entities. Just as it’s not feasible for a small staff to transcribe 100,000 scanned pages, it’s not feasible for a small crowd to tag an arbitrary number of entities each time a new research question arises. This is a realistic task, however, for a properly trained machine.
To demonstrate this idea, here is a simple entity extractor using DIY History transcription text, an NER module provided by MonkeyLearn, and Google’s Geocoding service to return latitude and longitude values for extracted place names.
Historical Manuscript Entity Extraction
To test this demo app, enter any DIY History file ID (the last parameter of a transcription URL) and it will automatically return any person, place, or organization it detects. This is a probabilistic method, so results may not be accurate is there is ambiguity. For this demonstration, each ID is entered manually, but a production application could iterate through thousands of records, storing results in a database.
For example in the following transcription URL, the ID is 73120
Here are a few IDs to get started with:
73120 – Eno family letters, November 1813-September 1827 1824-09 Page 1
2392 – Nile Kinnick correspondence, December 1942-March 1943 1942-12-13: Page 01
33315 – Wise-Clark family papers, December 1864-February 1865 1864-12-18-Page 04
If you have questions about the DIY History API contact firstname.lastname@example.org.