A month ago Geoff Nunberg wrote two articles that got much attention on Google Book Search’s “metadata trainwreck,” relating to incorrect dating of books. I discovered another metadata-ish sort of problem, as I read Lorcan Dempsey’s recent article on GBS word clouds, and the value of their “glancability” for getting a quick overview of the contents of a book.
I was actually thinking of taking Dempsey’s thought a step further, and proposing the idea of including Google’s word clouds in library catalogs. But when I started looking more closely at GBS word clouds I found problems — The first thing I noticed in the cloud for Origin of Species (below and here at GBS [scroll down to Common terms and phrases]) is that it has the plant-related words “seeds,” “pistil,” and “pollen,” but does not have the word “plant(s).” Hmm, that’s odd — So I searched for “plants” and found that there are in fact 100 occurrences of it in the book. Then I clicked some of the terms in the cloud shown below, and found that the number of results often does not correlate well with the font size of the word (which is what’s supposed to happen in a word cloud) …
Note that the words “admit,” “cause,” and “male,” which are in the smallest font, have more occurrences than other terms with larger fonts — “Asa Gray” and “pistil” in particular.
I tried several books, and found similar results in all of them — The font size of terms in the word cloud does not show much correlation with the number of occurrences of words in the books. In Snippet view books (as at least one of the books in Dempsey’s article is) the problem is not apparent because the number of search results is limited to three links in the book, making it impossible to determine how many occurrences of the term there are.
I suspect that the GBS word cloud problem has not been noticed more because the word clouds are rather “buried” — Not on the default Read (Front cover) page, but inconspicuously down in the middle of the Overview page, probably not seen by the vast majority of users.
We need more documentation about word clouds in GBS — How are they derived? What exactly are they intended to mean? Google has said about other metadata problems that they are working on them, and that they’ll slowly get fixed. Hopefully, that will apply to word clouds also. Maybe Google thinks of word clouds as still being “in beta” — they were, after all, only launched in July — and that’s why they’re giving them a low profile.
Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp
Pingback: eBook Notes For Wednesday, September 30, 2009 « The eBook Test
“Note that the words “admit,” “cause,” and “male,” which are in the smallest font, have more occurrences than other terms with larger fonts — “Asa Gray” and “pistil” in particular.”
Yes, that’s what should happen, because “admit”, “cause” and “male” are all more common words than “Ada Grey” or “pistil”.
Word clouds for books need to based on relative, not absolute, frequency. Otherwise, they’d be dominated by “the”, “I”, and other common words. So a lower-frequency word might appear more prominently than a higher-frequency word if the lower-frequency word is less generally common. You’re trying to determine the words that are most distinctive in this book, not the ones that appear the most often.
Now, they might or might not be calculating relative frequency appropriately, but you need more than the number of results for given words in a book to determine that.
I assume common words like “I” and “the” are eliminated as stop-words. In the word cloud for Origin of Species above “nest” is a specialized biological word that certainly shouldn’t be a smaller font than “Asa Gray” which has less than half as many occurrences. And “Asa Gray” shouldn’t be the same size as “natural selection.”
What about the complete absence of “plants”?
What do you think of GBS word clouds more generally? Are they “ready for prime time”?
Even if you eliminate stopwords, you’ll still tend to get pedestrian terms if you just go by absolute word count. (They’ll just be slightly less common words than the stopwords.) I think it’s much more useful to see words that appear more often than usual in the book I’m looking at; it gives me a good idea of what it’s about.
If you look closely “Asa Gray” is actually in a slightly smaller font than “natural selection”. Considering that “natural selection” probably appears much more frequently in the collection as a whole than “Asa Gray”, and “Asa Gray” appears 8 times in this book, the sizes chosen here seem sensible to me.
I’m not sure exactly how often “plants” appears in the text, but it’s a fairly common word, and if it only appears in the main text about 100 times (which looks like it might be the case), it might get pushed out. That’s not a big deal to me; I can see from some of the other words in the cloud (like flowers, pistil, pollen, and seeds) that this book probably deals a fair bit with plants, flowering plants in particular.
So in general I think that these clouds give a useful overview of what the book is talking about. It’s certainly better than a cloud based on an absolute word count (even with stopwords removed). Absolute counts work fine for tag clouds, but not for book word clouds, because of the very different nature of the text.
There may well be room for improvement, though. If you can think of a better way to pick out relevant words, I encourage you or others reading this to try it out. There are plenty of free book texts available to compare and contrast different approaches; the Project Gutenberg corpus, for instance, could be a useful test set.
Pingback: Twitter Trackbacks for Seeing the picture » Blog Archive » More Metadata Problems in Google Books: Word Clouds [uiowa.edu] on Topsy.com
As for pedestrian words, I don’t see much use in including these in the cloud: admit, appear, become, common, increase, living
Asa Gray is the only person in the word cloud — Looking in the index of the book shows that other people are mentioned at least as many times — people who are more important to the story than Gray e.g. Joseph Hooker and Alfred Russel Wallace (who is critical to the story!). Listing Gray as the only person in the cloud gives a quite false view of the content of the book.
Do you know of other examples of word clouds in books? Or discussions of how to do them?
Thanks a lot for your thoughtful comments!
My name is Diego Puppin, and I am one of the engineers at Google Books that developed this feature.
The word cloud is designed to show terms and phrases (e.g. “natural selection”) which are common in the specific book, but relatively rare in general. While “pistil” appears only 15 times in this book, this term is very rare in our book collection, and gives you a very strong signal about the issues covered by Charles Darwin in his masterpiece. This is the reason some frequent words are small or completely missing, while less frequent words may appear in the cloud.
Also, we give emphasis to proper nouns, people and places, as they could lead you to interesting related material. For example, Asa Gray was a great botanist whose work deeply influenced Darwin. If you are interested in the “The Origin of Species”, you may benefit from exploring Gray’s publications as well.
You are right, “plants” would also be good choice. We picked the more specific phrase “animals and plants” instead, but unfortunately it did not fit in the final list we display. Thank you very much for taking the time to express your criticism: your feedback is very valuable, and we are always looking at ways to improve our algorithms.
Diego, Thanks for the useful comments — It’s especially helpful to know that terms in word clouds are partly determined by their relative occurence in other books in the GBS collection.
When you say “We picked the more specific phrase …” does that mean that there are humans involved in the generation of word clouds, instead of they’re being generated automatically?
Is there any description on the GBS site of the process by which word clouds are created? This would be helpful.
Just to clarify, the generation of the word cloud is a completely automatic system. We don’t have enough people to generate a million of those by hand! The algorithm is mainly based on standard TF/IDF, augmented with a few more signals and rules such as the added emphasis on people, places, etc.
One shouldn’t expect much of an automatically-generated word cloud, but, unfortunately, people will. The clouds are no more reliable than automatically-generated indexes, which have long since been abandoned even in geeky computer manuals. Software engineers simply cannot devise software that matches an informed human brain. The US Copyright Office recognized the value and uniqueness of human-generated indexes when it granted them copyright status. I expect it will grant human-generated word clouds the same protection for the same reason.
I am also wondering if there is any rough estimate of the number of books that do have this feature.
Arash, I just did a little random survey in Google Book Search, and I find that word clouds are included for all books in GBS, except the ones with “No preview available.”
Thanks for you response, I came to the same conclusion.
I am starting a datamining project in which I am planning to use these word clouds and hoping there might be an easier way than HTML parsing to access them. Are they accessible through GBS API? I could not find any document on that. Cuurently I am using http queries like this: http://www.google.com/books/feeds/volumes?max-results=20&q=“query” to search books and get the results in XML but those results do not contain the word clouds.
Also Diego confimed my suspicions that they are using TF/IDF but I am now wondering how they convert the TF/IDF value to different font sizes and if there is anyway to retrive the original TF/IDF value for each term.