Just as I was about to compose two articles this morning on metadata problems in Google Book Search and in library catalogs … lo and behold … I came across science-publishing-library blogger Eric Hellman’s article White Dielectric Substance in Library Metadata on much the same theme — It has some good narrative far down in the article, that I suspect will get overlooked, so I’m exerpting the last several paragraphs of the article, which has the words on metadata. Hellman is discussing Geoff Nunberg’s talk at last Friday’s symposium at UC Berekeley on Google Book Search (boldface added):
Reading Nunbergs blog post corresponding to the talk is very entertaining in a juvenile sort of way. The poor guy has been trying to use Google Books as a linguistic research corpus, and has discovered to his professed horror that there are all sorts of errors, many of them humorous, in its metadata.
I must now let you in on a closely held secret among library metadata technologists which due to the Google Books metadata fiasco must now be revealed to the general public. There is some crappy data in library catalogs. How much is an interesting question, and my ability to comment on how much is limited by confidentiality obligations. However, I am free to observe that studies have been published on the error rate in OpenURL linking. OpenURL linking usually depends on matching of metadata between a source metadata file and a target metadata file; errors in either file can cause a linking error. Reported error rates are in excess of 1%. In his response to Nunberg blog post, Jon Orwant points out that a one in a million error occurs a million times if you have a trillion metadata items; my guess is that an error rate of one part per million may be overly optimistic by four orders of magnitude when applied to library metadata.
In my post on “collecting too much data”, I wrote that a huge challenge of maintaining a large metadata database is battling entropy as the collection grows. I’ve observed that most people trying to collect metadata go through an early period of thinking it’s easy, and then gradually gain understanding of the real challenges. Google has certainly been no exception to this pattern. When they first started dealing with book metadata, they were oblivious to the difficulties of maintaining a large metadata database. As Orwant’s response to Nunberg shows, they are currently in the phase of understanding the true difficulties of what they need to do. They have most certainly become attuned to the importance of keeping track of the source (provenance) of their metadata, if for no other reason than to have someone to blame for the inevitable metadata stupidities. Much of the “Linked Data” crowd has yet to digest this lesson fully.
Nunberg’s thesis is that Google Books will be the “Last Library” and that it would be a disaster for society if Google does a bad job of it. He does not consider the converse possibility. What if Google manages to do a better job of it than libraries have done? If that happens, all of the library world could be turned upside down. Existing metadata maintenance cooperatives would vanish overnight and libraries around the world would become dependent on Google’s metadata prowess. Google would acquire a legal metadata monopoly through technical merit rather than through class action maneuvering. What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so? Is it this even conceivable?
Context: Recent articles by Geoff Nunberg:
Google Books: A Metadata Train Wreck, Language Log blog (Hellman’s comments above refer to this article)
Google’s Book Search: A Disaster for Scholars, Chronicle of Higher Education
Eric Rumsey is on Twitter @ericrumseytemp