Seeing the Picture

Thoughts on digitization & libraries while working on Hardin MD

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About
  • Hardin MD

Category Archives: Metadata

Google Plus has a Metadata Problem – No Titles

Posted on August 4, 2011 by Eric Rumsey

One reason I haven’t been an early adopter of Google Plus is that the searchability seems to be so limited. Oddly, Search Giant Google seems to have overlooked a simple thing like titles for articles. I discovered this when I searched for a Google Plus article in Google Web Search. Here’s the G+ article, by Tim O’Reilly …

I picked out a phrase from the article to search – “I think he’s right that it’s too complex” and searched for it in Google Web Search – Google does find the O’Reilly article and several other articles that cite it (below), but the entries are just the names of the people who have entered them, with no title words at all — How strange!

Google Plus articles having no titles is especially surprising because the importance of a strong title has long been the first principle of building web pages, at least since the early days of Google. This is well-stated in an article on an SEO blog — Understanding the Magic of Metadata:

The title is without a doubt the most important piece of metadata there is.  One could further argue that it is the most important part of your web page, period.

Like a Library Full of Books with No Titles

As a librarian, I sometimes see parallels between the early evolution of the Internet and the early days of books. One of the important milestonetemps for books was the establishment of the Title Page as the place where the title of a book was clearly stated (imagine books without titles). So it’s been interesting to see, over the last 10-15 years of the Internet, that the Web page title has also come to be an indispensable handle for Web pages. Is Google Plus trying to take us back to a world of Web pages without titles?

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in Google Plus, Metadata, PicsYes, Uncategorized.

Metadata & Librarians – “Sexy … Cool”

Posted on February 25, 2011 by Eric Rumsey

I stumbled yesterday upon a couple of passing mentions in publishing circles of a new-found appreciation for the value of metadata, and the librarians who work with it — Certainly not a new theme, but coming across these two bits on the same day struck me. The first is in Kassia Krozser’s write-up of the recent Tools of Change in Publishing conference:

[Boldface & color added] … Which leads to my final theme: metadata. Metadata is the sexy of publishing conferences. This would embarrass metadata, metadata being the type who prefers to remain in the background. It also reveals too much about publishing conferences. Metadata is useful, efficient, precise. Metadata doesn’t grace the cover of Vogue. It’s the girl next door. The really smart girl next door. The really smart, really successful girl next door.

Metadata is data that describes data. That’s meta, I know. It is the information that feeds search. Enables discovery. The better your metadata, the better your chances of discovery. Consider your book’s metadata: title, ISBN, author, editor, year of publication, format, index, table of contents, keywords, tags, reviews, so much more. The more you can describe your (collective your) book, the greater the chances of discovery.

And, oh my, ask anyone who uses your metadata, and they’ll say it’s bad. Ask me, and I’ll be a bit more eloquent. My solution? Hire a librarian for your digital (and print — metadata matters there) team. Use this librarian’s knowledge. Speaking of which, these awesome experts were out in force at TOC.

The second mention is Hannah Johnson’s article in the Publishing Perspectives blog last month, with this concluding paragraph:

So even though metadata has a less-than-cool reputation (think solitary librarians checking ISBN numbers in their card catalogues), digitization is making it very cool.

Although a couple of comments on this article by library people express displeasure with the “solitary librarians” theme, I certainly see this as a positive appreciation by publishers of the new role of metadata and librarians, especially in light of Kassia Krozser’s laudatory sentiments above.

The kind words here from publishers about metadata and librarians continue a thread that’s been developing for a while — I’ve blogged before about how digital publishing, especially, is bringing librarians and publishers together — See my post on this, in which I discuss an article (with a cute cover picture) in Library Journal. Also see also my article on the growing importance of metadata for publishers, as discussed eloquently by Dominique Raccah.

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in Libraries, Metadata, PicsNo, Publishing, Uncategorized.

Discoverability: “Metadata Really Matters for eBooks”

Posted on November 16, 2010 by Eric Rumsey

Publisher Dominique Raccah’s eBooks and the digital transformation has good words on metadata, deep down in the article, that deserve a shout-out (Thanks to Nancy Picchi for flagging this in her tweet). The quote is striking for the clear, simple description of the importance of metadata for publishers (which transfers easily to libraries, since metadata quality is certainly one of the criteria libraries will want to consider in collecting eBooks). Excerpts from Raccah, boldface added:

Metadata may have become my favorite word in recent years. Most publishing companies of any reasonable size these days have a person or persons who are responsible for nothing but “metadata.” So what is metadata? Metadata is all the information/content related to a specific book, from the title, author name, and ISBN all the way through the description, marketing copy, author bio, and images you see for a book on an e-tailer’s site. eBook buyers run into metadata problems all the time – it seems like it’s the book you want, but there’s no description, no cover image, and hmmm, I think that might not even be the right author.

Metadata has long been a dominant theme in libraries, going back to subject headings in the card catalog. The purpose of headings, of course, was to help users discover books on library shelves. Now, with eBooks, the process of discovery is more difficult, since browsing of physical books on the shelf is not possible. So people in libraries are talking more intentionally about discoverability in the digital age and seeing metadata and discoverability as important tools in marketing the library. As with libraries, digital discovery and metadata have also assumed new importance for people in publishing, as they work to market their eBooks. Raccah’s comments are a clear reflection of this:

If you remember only one thing from reading this article, let it be this: metadata really matters for eBooks. On the web, reading with your e-reader, on your phone or however/wherever you access ebooks, discovery is everything. Unlike in a physical bookstore where you can browse shelves and find that next perfect book that you want to read, how you find a book online (whether a physical book or a digital book) is all about metadata. So making sure all those descriptive pieces are correct and where they’re supposed to be really matters.

Raccah’s comments, I think, are an example of the converging interests of publishers and librarians, as both realize the importance of metadata in providing access to digital resources. I especially appreciated this intermingling at the recent Books in Browsers conference (where Raccah was one of the presenters) — A highlight was Brian O’Leary’s keynote talk on the theme of context, as it relates to discoverability, tagging, and metadata.

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in BiB10, eBooks, Libraries, Metadata, PicsNo, Publishing, Uncategorized.

Advancing the Google Search Algorithm for Books

Posted on November 3, 2010 by Eric Rumsey

Alexis Madrigal’s recent article (Inside the Google Books Algorithm) has some valuable insights about the difficulties in adapting Google’s PageRank algorithm, which it developed for searching Web pages, to books. The article was apparently written on the occasion of the launching of the Rich Results feature in GBS, and Madrigal mixes this with his discussion of Google’s on-going work with the algorithm. Reading the article closely, and checking out GBS pages myself, I see that Rich Results — which adds options for viewing search results — is a relatively superficial innovation when compared to the deeper and more subtle changes that Madrigal discusses.

Madrigal’s article is especially valuable because it includes his interviews with Google engineers on the challenges of the book algorithm. So in this article I’m excerpting these remarks and other insights by Madrigal.

After an introductory paragraph on the the great success of the Google algorithm for searching Web pages, Madrigal quickly moves to the challenges of adapting this to books. Thinking of the Books in Browsers conference I recently attended, and also of Hugh Mcguire’s idea that books and the web will soon merge, I’m particularly struck by Madrigal’s characterizion of  books as being “outside the web” (This is no doubt said from the viewpoint of the Web search algorithm, since GBS books lack the links essential for PageRank. But in comparison to other forms of digitized books, GBS books are relatively more part of the Web, since they can be linked and used in a browser):

(Madrigal’s words, boldface throughout added) …
But what about when the company has to reach outside the web? The printed volumes represented on Google Books form a completely different kind of problem. Google’s famous algorithm can’t be deployed to search through books because they don’t link to each other in the way that webpages do. There is no perfect BookRank corollary for PageRank.

All of which made me wonder: How does Google Books work? What makes it tick? It turns out that it’s actually a great place for the company’s engineers to learn how to function in a linkless, physical world.

“There is a meaningful effort to say, how do we tune for books? We’ve got a lot of people doing very focused on the web. How do we take the lessons from what we learned on the web and invent new things that are unique to books?” Matthew Gray, lead software engineer of Google Books, told me.

After a brief digression on the new Rich Results feature, Madrigal moves to the heart of the article, on the more fundamental improvements in the GBS algorithm. I’m especially struck here that library holdings information is included:

Rich Results is the latest in a series of smaller front-end tweaks that have been matched by backend improvements. Now, the book search algorithm takes into account more than 100 “signals,” individual data categories that Google statistically integrates to rank your results. When you search for a book, Google Books doesn’t just look at word frequency or how closely your query matches the title of a book. They now take into account web search frequency, recent book sales, the number of libraries that hold the title, and how often an older book has been reprinted.

More comments by Google engineers on the differences between Web pages and books:

“One of the fundamental things we’ve learned is that the whole is greater than the sum of the parts,” Gray said. This is deeply Google thinking but without the dominant algorithm. It’s a Google subspecies that evolved by feeding on a different corpus. There is less data about books than web pages, but there is more structure to it, and there’s less spam to contend with.

The most difficult part of making Google Books work, said James Crawford, the team’s engineering director, was determining the intent of the service’s heterogeneous user base. Scholars who search Google Books have very different wants and expectations from casual users looking to find a trade fiction title.

Concluding remarks — Especially interesting here is the idea that the real advances will happen after digitization is completed — Makes me think of Mike Cane’s remarks about Google’s use of metadata and how they use it to “make information do things”:

All the Google Books tweaks I’ve noticed are small. But you add them all up and apply them to the 15 million books Google has scanned and the truly unprecedented nature of Google Books starts to emerge. “We’re in the middle of doing something radical. No one has ever pulled together this whole collection, scanning books from 40 different libraries,” Crawford said. “I would say our general approach here has been to just get the books scanned because until they are digitized and OCR is done, you aren’t even in the game. As we get more and more content on line, the work that Matthew’s team gets to be more and more important and more and more doable.”

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in Google Book Search, Metadata, PicsNo, Uncategorized.

Giving Context to Digital Books: Google Book Search

Posted on October 26, 2010 by Eric Rumsey

If I had any doubts that the Books in Browsers conference in San Francisco last week was going to be an unforgettable pow-wow of book people, they were quickly erased at the very outset of the first presentation, Allen Noren’s keynote on Thursday morning, in which he gave an introduction to the themes of the conference.

Noren (@allennoren), the Director of Online Marketing at O’Reilly Media, talked about Web experiences he’s had recently that hint at the changing role of traditional books as they relate to the Web. His first example was the About this Book page in Google Book Search. Using the example of Moby Dick (at left), he noted that this contains a wealth of information about book titles, and commented that he was surprised that hardly anyone seems to notice it and talk about its value — The first time that day that what I heard GRABBED MY ATTENTION, since I wrote in much the same vein two years ago around the subject of the unheeded goodies on the GBS About this Book page (although I was emphasizing its value for seeing thumbnails of pictures).

So, yes, Noren has FOUND THE GOLD! And, why, indeed, has the GBS About this Book gotten so little attention? It’s true, as I discussed with Noren after his presentation, that Google itself doesn’t feature the About this Book page, with Google searches generally linking to the Front Cover view, which is probably seen as being more appealing to the general public. But librarians and other meta-ish people should certainly appreciate its value, a sort of Web-enhanced “card-catalog” view of a book, as I observed in my earlier article.

So, I tuck Noren’s words away for further processing later … on with the day’s engrossing talks … The last keynote of the day — Context First, by Brian O’Leary (@brianoleary) — certainly grabbed everyone’s attention (by my count, at the end of the first day, it was tweeted about five times more than any other talk). In eloquent words accompanied by superb graphics, O’Leary contrasted the “container view” of the traditional print book and the “context view,” that’s made possible by digital books — So, another great talk, much to consider, but on to the evening’s equally riveting activities. …

It wasn’t until later, on my red-eye flight home, in mulling things over — I see — YES — Noren and O’Leary are talking about the same thing! — The GBS About this Book page is a prime example of Giving Context to Digital Books — Putting [meta] data from the text of the book together with data from the Web to give insights about how the book fits into context of the Web and the world beyond the “container of the book” — Giving life to the Salman Rushdie-ish Streams of Story, as O’Leary suggests in his conclusion, and as I’ve blogged about.

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in BiB10, eBooks, Google Book Search, Metadata, PicsYes, Publishing, Uncategorized.

Google Book Search: Multiple Editions Give Quirky Results

Posted on October 12, 2010 by Eric Rumsey

In an earlier article, I reported on searching in Google Web Search for ten full-text books, which were found in Google Books and Internet Archive. Most of the titles included in that study had only one edition, which makes them relatively easy to search for. In this article I report on a more complex search, for one book title with multiple editions, searched in Google Web Search, and also searching directly in Google Book Search. I also report on results for searching a specific phrase from the book in Web Search and Book Search. In a separate article, I examine the results for the same title in Internet Archive.

Searching for the book

The book I used as a case-study example is Diagnostic and therapeutic technic, by Albert S. Morrow, which has four editions from 1911 to 1921.

Google Web Search: “Diagnostic and therapeutic technic” morrow – In the first 10 records retrieved, the only full-text books are two GBS records, from Harvard 1911 (#1) and Harvard 1917 (#4) [The 1911 edition was scanned twice by Harvard, as shown in the detailed GBS results below, so the boldface is used to distinguish the different versions]. There are no records from Internet Archive.

Google Book Search (searching the same phrase, limiting to Full-view): “Diagnostic and therapeutic technic” morrow – The only edition in the top 10 items is the same Harvard 1911 edition, ranked at the top of the list — It seems odd that only one edition is found in Book Search, when there were two editions in Web Search … BUT, there’s a trick in Book Search! – Clicking “More editions” brings up a great wealth of additional editions — Seven different records for four editions, from Harvard and Stanford, listed below in chronologic publication order, with search retrieval rank in parentheses (As mentioned above, the 1911 edition was entered twice by Harvard, distinguished by boldface and ID number):

Harvard 1911/uL8R (#4)
Harvard 1911/PqsR (#5)
Stanford 1921 (#1) [Results list incorrect, says 1911]
Harvard 1915 (#3)
Stanford 1911 (#2) [Results list incorrect, says 1915]
Harvard 1917 (#7)
Harvard 1921 (#6)

Taken together, these results for Web Search and Book Search raise questions —

With GBS records for several editions, is there any predictable reason why particular records appear in each of the searches? Why two records are retrieved in Web Search when seven are found in Book Search?

The favored record seems to be the Harvard 1911/uL8R one, which appears first in both Web Search and Book Search — Is this chosen because it’s the first edition? Why not choose the latest edition from 1921? It’s also notable that this seemingly favored record links to a rather confusing scan from the front cover of the book instead of linking to the clearly-displayed title page or table of contents, as most of the other records do.

“More editions” search – How is the order of these determined? Is it significant that the two top-ranked records noted on the list above are the only ones from Stanford? Another little glitch – Each of the entries linked from the initial “More editions” has its own “More editions” link, which goes to the same list as the initial list — An obvious oversight.

Both of the Stanford editions in the list have incorrect dates given in the search list. This fits my experience — I’ve found that Stanford GBS records are often quirky, especially compared to Harvard.

Searching for a *phrase* in the book

Google Web Search –  The ability to search in Google Web Search for a phrase that occurs in a GBS book is invaluable — I searched Google Web for this phrase that I had seen in the Morrow book: “The tube devised by Crile is of German silver.” As expected, this gets only Morrow, which goes to page 117, with the phrase nicely highlighted, in the same Harvard 1911 edition that’s featured in the searches above for the whole book. Clicking on “Repeat the search with the omitted results included” retrieves six of the seven editions listed above, all with the phrase highlighted. Interestingly it also retrieves, ranked #2, the Internet Archive DjVu-formatted record for the book, which goes to the top of the book record instead of the highlighted occurrence on a page. Finding the Internet Archive record in a phrase search is surprising since there were no Internet Archive records retrieved in searching for the book title above.

Google Book Search (searching the same phrase, limiting to Full-view): “The tube devised by Crile is of German silver” – Searching for the same phrase in Google Books – This goes to the highlighted phrase in the the Stanford 1921 edition. This the only record retrieved (Interesting that this edition was ranked #1 in the GBS search for the book title). Clicking on “More editions” retrieves the same seven records listed above, which do not link to the highlighted phrase, but only to the beginning of the book.

Thoughts on Google Books and Metadata

There’s been much talk about metadata in Google Book Search, with strong complaints about its poor quality by Geoff Nunberg and others, especially as it relates to different editions, and the examples presented in this article fit into that category. I guess I take a less negative view of this than Nunberg — The upside that I see is that its better to have many editions, even with perplexing search patterns, than to have a more limited number of editions. The lesson that this little case study gives, I think, is to be aware of the quirks, and to be persistent. In particular, if you come across an interesting reference to a book in Google Books in a Google Web Search, be sure to “repeat the search with omitted results” and to do the search in Google Book Search.

Keeping multiple editions straight, which takes the large-scale organizational skills of a librarian, I suspect its going to take a while for Google to figure out how to do it well (hopefully with the help of librarians). Keeping multiple editions straight, I suspect, is something that librarians have more skill at than Google, and I suspect its going to take a while for Google to figure out how to do it well (hopefully with the help of librarians). On the other hand, where Google excels, as shown in this little study, is in hitting the small target — I apply this especially to Google’s ability to find specific phrases, and to highlight the phrase searched for, which I think will be very helpful for scholars.

Related articles:

  • Google Search for Full-text Books: GBS & Internet Archive
  • Internet Archive: Multiple Editions

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in GBS Case Study, Google Book Search, Metadata, PicsNo, Uncategorized.

Google Books Blur the Line between Book & Internet

Posted on September 20, 2010 by Eric Rumsey

With the lack of progress in figuring out the Settlement, Google Book Search has been out of the news recently. So I was glad to see that Hugh McGuire, in his recent article about the line between the book and the Internet disappearing, in the second half of the article, gives credit to GBS as helping to lay the foundation for this (see comments below). In the first part of his paper, McGuire has a lucid description of the future world of books on the Internet that’s being foreshadowed by GBS:

A book properly hooked into the Internet is a far more valuable collection of information than a book not properly hooked into the Internet. … The false battles between ebooks and print books continue to ignore the real, though as-yet unknown, value that comes with books being truly digital; not the phony, unconnected digital of our current understanding of “ebooks.” … We still consider that books live outside of the Internet. There is massive and untapped (and unknown) value to be discovered once books are connected.

McGuire says our idea of what an eBook should be still carries a lot of baggage from our idea of what a book is — We still think of eBooks as being separate from the Web — In particular, eBooks, like print books, are not searchable or linkable, and text can’t be copy-pasted. But full view Google Books are searchable, linkable and deep–link–able, and text can be copy-pasted.

Being able to find books in a Web search is a key to removing the barrier between books and websites, as McGuire says, and Google has been bringing that capability to reality, as they steadily work with libraries to scan their books. Searching for books within Book Search has been part of GBS since the beginning. In the last three years or so, books have also been increasingly integrated into Web Search results. This is a big step, I think, in connecting books to the Web — Searching Google is how we find out about things these days, so to be able to retrieve books and web pages in the same search is a extremely valuable — A great advance that’s gotten little attention.

Not surprisingly, the people who have especially learned to value bringing books onto the Web are historical researchers — With pre-1923 books being out of copyright, and freely available in GBS full-view searches, historians have quickly taken advantage — In a recent forum of historians on GBS, it’s described with words like indispensable … enlivening.

While historians have expressed their valuing of the connected books in Google Books, there are certainly many non-academics with little voice who are also learning to use it. Google engineer Dan Clancy says that half of the out-of-copyright books in GBS have at least 10 pages viewed per month — A large portion of this use must surely be from the general non-academic public. One commenter to McGuire’s article (who apparently doesn’t know about GBS) expresses the voice of this little-heard group: “If books/entire texts were *searchable* on the internet, people searching the web could more easily find exact phrases/specific information in books. To me, this is HUGE!” Indeed — and what a fine, pithy statement of McGuire’s thesis!

To conclude, a plug for metadata as the key to connecting books to the Web, and why it’s people like the geeks at Google who are doing it before the traditional publishing industry — In the words of Mike Cane:

All of this hidden information — exploded out, made explicit — turns an ebook from a dumb object into a smart object. … Google is staffed by geeks who juggle information with an expertise that print publishers lack. … Google makes information do things. … Print publishing freezes information into a static object — An object that stands alone, disconnected, unable to do anything. … The information geeks are new publishers for a new age … Metadata has value. As new connections are formed, and new data is added, its value increases exponentially.

Bringing the world’s books to digital life on the Web, whether it’s done by Google or someone else, is a story that’s just beginning. For now though, Google Book Search gives us a start in exploring the new world of the book and the Internet.

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in eBooks, Google Book Search, History, Metadata, PicsNo, Uncategorized.

More Metadata Problems in Google Books?: Word Clouds

Posted on September 30, 2009 by Eric Rumsey

A month ago Geoff Nunberg wrote two articles that got much attention on Google Book Search’s “metadata trainwreck,” relating to incorrect dating of books. I discovered another metadata-ish sort of problem, as I read Lorcan Dempsey’s recent article on GBS word clouds, and the value of their “glancability” for getting a quick overview of the contents of a book.

I was actually thinking of taking Dempsey’s thought a step further, and proposing the idea of including Google’s word clouds in library catalogs. But when I started looking more closely at GBS word clouds I found problems — The first thing I noticed in the cloud for Origin of Species (below and here at GBS [scroll down to Common terms and phrases]) is that it has the plant-related words “seeds,” “pistil,” and “pollen,” but does not have the word “plant(s).” Hmm, that’s odd — So I searched for “plants” and found that there are in fact 100 occurrences of it in the book. Then I clicked some of the terms in the cloud shown below, and found that the number of results often does not correlate well with the font size of the word (which is what’s supposed to happen in a word cloud) …

Note that the words “admit,” “cause,” and “male,” which are in the smallest font, have more occurrences than other terms with larger fonts — “Asa Gray” and “pistil” in particular.

I tried several books, and found similar results in all of them — The font size of terms in the word cloud does not show much correlation with the number of occurrences of words in the books. In Snippet view books (as at least one of the books in Dempsey’s article is) the problem is not apparent because the number of search results is limited to three links in the book, making it impossible to determine how many occurrences of the term there are.

I suspect that the GBS word cloud problem has not been noticed more because the word clouds are rather “buried” — Not on the default Read (Front cover) page, but inconspicuously down in the middle of the Overview page, probably not seen by the vast majority of users.

We need more documentation about word clouds in GBS — How are they derived? What exactly are they intended to mean? Google has said about other metadata problems that they are working on them, and that they’ll slowly get fixed. Hopefully, that will apply to word clouds also. Maybe Google thinks of word clouds as still being “in beta” — they were, after all, only launched in July — and that’s why they’re giving them a low profile.

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

Posted in eBooks, Google Book Search, Metadata, PicsYes, Uncategorized.

Tagging in Hardin MD

Posted on September 25, 2009 by Eric Rumsey

[This article accompanies previous article: Tagging in Hardin MD – History]

All Hardin MD (HMD) pages have tags at the bottom, to make them more visible for search engines i.e. Google. We have been doing tagging in HMD since 2000, and it works very well. As shown in the example to the left, the tags are for variant spellings (measels), variant terms (rubeola), and words and word combinations relating to pictures (we have found that the word “pictures” is especially favored by Google).

One of the things that has made HMD fun has been applying longstanding practices of librarianship to a web-based system. Having been a cataloger for a brief time early in my library career, it seemed natural to put tags at the bottom of the web page, just like subject headings are at the bottom of cards in the card catalog. Including mis-spellings in the tags to help users find the page seemed natural, too — As a cataloger, I had been taught to put x-ref cards in the catalog for variant ways that patrons might look for a book, and following the same principle on web pages, it became possible to apply it on a much larger scale.

It continues to surprise me that this simple idea — Putting tags on web pages — has not been more widely applied. I have seen very few cases of it at other sites. I suspect part of the reason for this is that people have tended to think the hidden meta keyword field was the place to put tags, rather than “cluttering up” their pages by putting them on the page. Google’s announcement a few days ago that they ignore meta keywords finally puts an end to that idea. But many SEO people have thought meta keywords were ineffective for a long time, and it was certainly our experience — Around the time we began putting tags on pages in 2000, we compared meta field tagging and on-page tagging, and found that meta field tagging seemed to be ignored by Google.

Another factor that may have discouraged people from putting tags inconspicuously at the bottom of the page is that SEO people generally say that words need to be in a prominent place on the page, preferably near the top, to be found by Google. That’s no doubt true for common words that have a lot of competition, but for relatively uncommon words, like variant spellings of medical diseases, placement at the bottom of the page works well. (One proviso: Our pages with HMD are relatively small, usually no more than two screens. Putting tags at the bottom of larger pages may not work as well.)

I suspect a reason that people don’t think more of experimenting with tagging and Google visibility is that it is a lengthy process. Google’s not going to see new words on your page right away. It may take several weeks or even months. So it requires careful record-keeping, to note when words are added, and having a regular schedule of Google checking to see if your pages are starting to appear in search results.

Posted in Google, Hardin MD, Library Catalog, Metadata, PicsYes, SEO, Uncategorized.

Secret’s Out: Library Catalogs have some Crappy Metadata

Posted on September 3, 2009 by Eric Rumsey

Just as I was about to compose two articles this morning on metadata problems in Google Book Search and in library catalogs … lo and behold … I came across science-publishing-library blogger Eric Hellman’s article White Dielectric Substance in Library Metadata on much the same theme — It has some good narrative far down in the article, that I suspect will get overlooked, so I’m exerpting the last several paragraphs of the article, which has the words on metadata. Hellman is discussing Geoff Nunberg’s talk at last Friday’s symposium at UC Berekeley on Google Book Search (boldface added):

Reading Nunbergs blog post corresponding to the talk is very entertaining in a juvenile sort of way. The poor guy has been trying to use Google Books as a linguistic research corpus, and has discovered to his professed horror that there are all sorts of errors, many of them humorous, in its metadata.

I must now let you in on a closely held secret among library metadata technologists which due to the Google Books metadata fiasco must now be revealed to the general public. There is some crappy data in library catalogs. How much is an interesting question, and my ability to comment on how much is limited by confidentiality obligations. However, I am free to observe that studies have been published on the error rate in OpenURL linking. OpenURL linking usually depends on matching of metadata between a source metadata file and a target metadata file; errors in either file can cause a linking error. Reported error rates are in excess of 1%. In his response to Nunberg blog post, Jon Orwant points out that a one in a million error occurs a million times if you have a trillion metadata items; my guess is that an error rate of one part per million may be overly optimistic by four orders of magnitude when applied to library metadata.

In my post on “collecting too much data”, I wrote that a huge challenge of maintaining a large metadata database is battling entropy as the collection grows. I’ve observed that most people trying to collect metadata go through an early period of thinking it’s easy, and then gradually gain understanding of the real challenges. Google has certainly been no exception to this pattern. When they first started dealing with book metadata, they were oblivious to the difficulties of maintaining a large metadata database. As Orwant’s response to Nunberg shows, they are currently in the phase of understanding the true difficulties of what they need to do. They have most certainly become attuned to the importance of keeping track of the source (provenance) of their metadata, if for no other reason than to have someone to blame for the inevitable metadata stupidities. Much of the “Linked Data” crowd has yet to digest this lesson fully.

Nunberg’s thesis is that Google Books will be the “Last Library” and that it would be a disaster for society if Google does a bad job of it. He does not consider the converse possibility. What if Google manages to do a better job of it than libraries have done? If that happens, all of the library world could be turned upside down. Existing metadata maintenance cooperatives would vanish overnight and libraries around the world would become dependent on Google’s metadata prowess. Google would acquire a legal metadata monopoly through technical merit rather than through class action maneuvering. What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so? Is it this even conceivable?

Context: Recent articles by Geoff Nunberg:
Google Books: A Metadata Train Wreck, Language Log blog (Hellman’s comments above refer to this article)
Google’s Book Search: A Disaster for Scholars, Chronicle of Higher Education

Eric Rumsey is on Twitter @ericrumseytemp

Posted in Google Book Search, Libraries, Library Catalog, Metadata, PicsNo, Train Wreck, Uncategorized.
  • 1
  • 2
  • Next »

Seeing the picture: Thoughts on digitization & libraries while working on Hardin MD

Bookmark and Share

Enter e-mail for updates

Eric Rumsey

Eric Rumsey

Eric Rumsey is a librarian and web developer at the Hardin Library for the Health Sciences, University of Iowa. He is the founder and manager of the Hardin MD site.

Recent Posts

  • 50 Cool Twitter Names For Colleges & Universities
  • Responsive Design Library Sites on iPhone & iPad
  • Fast, Efficient & Full-Context Retweeting with Flipboard
  • Official Twitter Retweets are not in Twitter Search or Lists
  • Responsive Design Sites: Higher Ed, Libraries, Notables
  • New York Times’ Bad Headline & the Art of Tweeting

Categories

  • Apple
  • BiB10
  • Blogging
  • BookReader
  • Borges
  • Color
  • ContentDM
  • Copyright
  • Curation
  • Digitization
  • DjVu
  • eBooks
  • Elegance
  • Facebook
  • Flickr
  • Flipboard
  • Flu
  • GBS Case Study
  • Google
  • Google Book Search
  • Google eBookstore
  • Google Flu Trends
  • Google Plus
  • Greenstone
  • Hardin MD
  • History
  • Human input
  • ICDL
  • Image Search
  • Internet Archive
  • iPad
  • iPhone Optimized
  • iPhone/iPod Touch
  • Journals
  • Kindle
  • Libraries
  • Library Catalog
  • Library of Congress
  • Long Tail
  • Magazines
  • Maps
  • Marginalia
  • MedlinePlus
  • Metadata
  • Microsoft
  • MLA
  • Mobile
  • Mobile Design
  • Mobile First Design
  • Mobile Libraries
  • Navigation
  • Newspapers
  • NLM
  • Pageturners
  • Pattern recognition
  • PDF
  • PicsNo
  • PicsYes
  • Pictures
  • Publishing
  • PubMed
  • Responsive Design
  • Rushdie
  • Safari
  • Seadragon
  • SEO
  • Serendipity
  • Steve Jobs
  • Storytelling
  • TED
  • The Stream
  • Thumbnails
  • TOC
  • Train Wreck
  • Twitter
  • Twitter Tips
  • Uncategorized
  • Visualization
  • Web History
  • WebKit
  • Wide World
  • Wikipedia
  • Zooming & panning

Archives

Pages

  • About

RSS

  • Entries (RSS)
  • © 2023 Seeing the Picture, all rights reserved.
    Proudly powered by WordPress