A month ago Geoff Nunberg wrote two articles that got much attention on Google Book Search’s “metadata trainwreck,” relating to incorrect dating of books. I discovered another metadata-ish sort of problem, as I read Lorcan Dempsey’s recent article on GBS word clouds, and the value of their “glancability” for getting a quick overview of the contents of a book.
I was actually thinking of taking Dempsey’s thought a step further, and proposing the idea of including Google’s word clouds in library catalogs. But when I started looking more closely at GBS word clouds I found problems — The first thing I noticed in the cloud for Origin of Species (below and here at GBS [scroll down to Common terms and phrases]) is that it has the plant-related words “seeds,” “pistil,” and “pollen,” but does not have the word “plant(s).” Hmm, that’s odd — So I searched for “plants” and found that there are in fact 100 occurrences of it in the book. Then I clicked some of the terms in the cloud shown below, and found that the number of results often does not correlate well with the font size of the word (which is what’s supposed to happen in a word cloud) …
Note that the words “admit,” “cause,” and “male,” which are in the smallest font, have more occurrences than other terms with larger fonts — “Asa Gray” and “pistil” in particular.
I tried several books, and found similar results in all of them — The font size of terms in the word cloud does not show much correlation with the number of occurrences of words in the books. In Snippet view books (as at least one of the books in Dempsey’s article is) the problem is not apparent because the number of search results is limited to three links in the book, making it impossible to determine how many occurrences of the term there are.
I suspect that the GBS word cloud problem has not been noticed more because the word clouds are rather “buried” — Not on the default Read (Front cover) page, but inconspicuously down in the middle of the Overview page, probably not seen by the vast majority of users.
We need more documentation about word clouds in GBS — How are they derived? What exactly are they intended to mean? Google has said about other metadata problems that they are working on them, and that they’ll slowly get fixed. Hopefully, that will apply to word clouds also. Maybe Google thinks of word clouds as still being “in beta” — they were, after all, only launched in July — and that’s why they’re giving them a low profile.
Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp