Seeing the Picture

Thoughts on digitization & libraries while working on Hardin MD

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About
  • Hardin MD

Category Archives: Train Wreck

Secret’s Out: Library Catalogs have some Crappy Metadata

Posted on September 3, 2009 by Eric Rumsey

Just as I was about to compose two articles this morning on metadata problems in Google Book Search and in library catalogs … lo and behold … I came across science-publishing-library blogger Eric Hellman’s article White Dielectric Substance in Library Metadata on much the same theme — It has some good narrative far down in the article, that I suspect will get overlooked, so I’m exerpting the last several paragraphs of the article, which has the words on metadata. Hellman is discussing Geoff Nunberg’s talk at last Friday’s symposium at UC Berekeley on Google Book Search (boldface added):

Reading Nunbergs blog post corresponding to the talk is very entertaining in a juvenile sort of way. The poor guy has been trying to use Google Books as a linguistic research corpus, and has discovered to his professed horror that there are all sorts of errors, many of them humorous, in its metadata.

I must now let you in on a closely held secret among library metadata technologists which due to the Google Books metadata fiasco must now be revealed to the general public. There is some crappy data in library catalogs. How much is an interesting question, and my ability to comment on how much is limited by confidentiality obligations. However, I am free to observe that studies have been published on the error rate in OpenURL linking. OpenURL linking usually depends on matching of metadata between a source metadata file and a target metadata file; errors in either file can cause a linking error. Reported error rates are in excess of 1%. In his response to Nunberg blog post, Jon Orwant points out that a one in a million error occurs a million times if you have a trillion metadata items; my guess is that an error rate of one part per million may be overly optimistic by four orders of magnitude when applied to library metadata.

In my post on “collecting too much data”, I wrote that a huge challenge of maintaining a large metadata database is battling entropy as the collection grows. I’ve observed that most people trying to collect metadata go through an early period of thinking it’s easy, and then gradually gain understanding of the real challenges. Google has certainly been no exception to this pattern. When they first started dealing with book metadata, they were oblivious to the difficulties of maintaining a large metadata database. As Orwant’s response to Nunberg shows, they are currently in the phase of understanding the true difficulties of what they need to do. They have most certainly become attuned to the importance of keeping track of the source (provenance) of their metadata, if for no other reason than to have someone to blame for the inevitable metadata stupidities. Much of the “Linked Data” crowd has yet to digest this lesson fully.

Nunberg’s thesis is that Google Books will be the “Last Library” and that it would be a disaster for society if Google does a bad job of it. He does not consider the converse possibility. What if Google manages to do a better job of it than libraries have done? If that happens, all of the library world could be turned upside down. Existing metadata maintenance cooperatives would vanish overnight and libraries around the world would become dependent on Google’s metadata prowess. Google would acquire a legal metadata monopoly through technical merit rather than through class action maneuvering. What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so? Is it this even conceivable?

Context: Recent articles by Geoff Nunberg:
Google Books: A Metadata Train Wreck, Language Log blog (Hellman’s comments above refer to this article)
Google’s Book Search: A Disaster for Scholars, Chronicle of Higher Education

Eric Rumsey is on Twitter @ericrumseytemp

Posted in Google Book Search, Libraries, Library Catalog, Metadata, PicsNo, Train Wreck, Uncategorized.

Metadata About Metadata: Library Catalog Fail

Posted on September 3, 2009 by Eric Rumsey

David Weinberger’s book Everything is Miscellaneous: The Power of the New Digital Disorder is fascinating — I’m especially enjoying his many original comments on metadata. So, trying out Weinberger’s ideas, I search in local library catalogs for david weinberger metadata — I get: NO ENTRIES FOUND … Hmmm … How does Google Book Search compare? I do the same search in GBS, and Bingo –There it is, at the top of the list …

Google, of course, puts the book at the top of the list because its deep metadata indicates that metadata is an important topic, and PageRank likely indicates that other people also value Weinberger’s discussion of the topic.

So, why don’t library catalogs find the book? — The problem is the subject headings assigned by the Library of Congress, and used in most all library catalogs:

Knowledge management.
Information technology — Management.
Information technology — Social aspects.
Personal information management.
Information resources management.
Order.

Even though the book discusses metadata at length and on many pages, it’s not deemed important enough to be a heading — The problem is that the traditional catalog is what Weinberger calls a “second order” resource, being limited to the small number of subject headings that will fit on a card in the (bygone) catalog. Given resources to assign a larger number of subject headings, no doubt metadata would be included.

So … Librarians can’t afford to be smug about metadata — Google has problems (as discussed in Geoff Nunberg articles linked below). But libraries have their own problems. In many ways the traditional library catalog lacks metadata features that have become common in Google, Amazon, and other sites.

Hope for Libraries — WorldCat does find the book with the david weinberger metadata search (#2 in results), because it has additional tags listed in its “Abstract” (scroll down) which include metadata — Sooner or later, maybe libraries will add the WorldCat Abstract to their catalogs to “enrich their metadata.”

Context: Recent articles by Geoff Nunberg:
Google Books: A Metadata Train Wreck, Language Log blog
Google’s Book Search: A Disaster for Scholars, Chronicle of Higher Education

Eric Rumsey is on Twitter @ericrumseytemp

Posted in Google Book Search, Library Catalog, Metadata, PicsYes, Train Wreck, Uncategorized.

“Metadata Train Wreck”: Librarians Should Tread Lightly

Posted on September 3, 2009 by Eric Rumsey

There’s been much buzz among librarians, and others, on recent articles by Geoff Nunberg (UC Berkeley School of Information) on the “Train Wreck” state of Metadata in Google Book Search (See article references below). Nunberg certainly makes some good points. But we librarians are far from perfection in the metadata realm — Look at the good ol’ Card Catalog — Problems abound, as described in an amusing YouTube piece by librarian Brian Mathews (@brianmathews). He uses Georgia Tech as an example, but the same sorts of problems exist in many catalogs. It’s especially appropriate, in light of Nunberg’s emphasis of date problems in GBS, that one of the examples highlighted by Mathews (below) is author dates … Born by the metadata, die by the metadata …

Context: Recent articles by Geoff Nunberg:
Google Books: A Metadata Train Wreck, Language Log blog
Google’s Book Search: A Disaster for Scholars, Chronicle of Higher Education

Eric Rumsey is on Twitter @ericrumseytemp

Posted in Google Book Search, Library Catalog, Metadata, PicsYes, Train Wreck, Uncategorized.

Seeing the picture: Thoughts on digitization & libraries while working on Hardin MD

Bookmark and Share

Enter e-mail for updates

Eric Rumsey

Eric Rumsey

Eric Rumsey is a librarian and web developer at the Hardin Library for the Health Sciences, University of Iowa. He is the founder and manager of the Hardin MD site.

Recent Posts

  • 50 Cool Twitter Names For Colleges & Universities
  • Responsive Design Library Sites on iPhone & iPad
  • Fast, Efficient & Full-Context Retweeting with Flipboard
  • Official Twitter Retweets are not in Twitter Search or Lists
  • Responsive Design Sites: Higher Ed, Libraries, Notables
  • New York Times’ Bad Headline & the Art of Tweeting

Categories

  • Apple
  • BiB10
  • Blogging
  • BookReader
  • Borges
  • Color
  • ContentDM
  • Copyright
  • Curation
  • Digitization
  • DjVu
  • eBooks
  • Elegance
  • Facebook
  • Flickr
  • Flipboard
  • Flu
  • GBS Case Study
  • Google
  • Google Book Search
  • Google eBookstore
  • Google Flu Trends
  • Google Plus
  • Greenstone
  • Hardin MD
  • History
  • Human input
  • ICDL
  • Image Search
  • Internet Archive
  • iPad
  • iPhone Optimized
  • iPhone/iPod Touch
  • Journals
  • Kindle
  • Libraries
  • Library Catalog
  • Library of Congress
  • Long Tail
  • Magazines
  • Maps
  • Marginalia
  • MedlinePlus
  • Metadata
  • Microsoft
  • MLA
  • Mobile
  • Mobile Design
  • Mobile First Design
  • Mobile Libraries
  • Navigation
  • Newspapers
  • NLM
  • Pageturners
  • Pattern recognition
  • PDF
  • PicsNo
  • PicsYes
  • Pictures
  • Publishing
  • PubMed
  • Responsive Design
  • Rushdie
  • Safari
  • Seadragon
  • SEO
  • Serendipity
  • Steve Jobs
  • Storytelling
  • TED
  • The Stream
  • Thumbnails
  • TOC
  • Train Wreck
  • Twitter
  • Twitter Tips
  • Uncategorized
  • Visualization
  • Web History
  • WebKit
  • Wide World
  • Wikipedia
  • Zooming & panning

Archives

Pages

  • About

RSS

  • Entries (RSS)
  • © 2023 Seeing the Picture, all rights reserved.
    Proudly powered by WordPress