Alexis Madrigal’s recent article (Inside the Google Books Algorithm) has some valuable insights about the difficulties in adapting Google’s PageRank algorithm, which it developed for searching Web pages, to books. The article was apparently written on the occasion of the launching of the Rich Results feature in GBS, and Madrigal mixes this with his discussion of Google’s on-going work with the algorithm. Reading the article closely, and checking out GBS pages myself, I see that Rich Results — which adds options for viewing search results — is a relatively superficial innovation when compared to the deeper and more subtle changes that Madrigal discusses.
Madrigal’s article is especially valuable because it includes his interviews with Google engineers on the challenges of the book algorithm. So in this article I’m excerpting these remarks and other insights by Madrigal.
After an introductory paragraph on the the great success of the Google algorithm for searching Web pages, Madrigal quickly moves to the challenges of adapting this to books. Thinking of the Books in Browsers conference I recently attended, and also of Hugh Mcguire’s idea that books and the web will soon merge, I’m particularly struck by Madrigal’s characterizion of books as being “outside the web” (This is no doubt said from the viewpoint of the Web search algorithm, since GBS books lack the links essential for PageRank. But in comparison to other forms of digitized books, GBS books are relatively more part of the Web, since they can be linked and used in a browser):
(Madrigal’s words, boldface throughout added) …
But what about when the company has to reach outside the web? The printed volumes represented on Google Books form a completely different kind of problem. Google’s famous algorithm can’t be deployed to search through books because they don’t link to each other in the way that webpages do. There is no perfect BookRank corollary for PageRank.
All of which made me wonder: How does Google Books work? What makes it tick? It turns out that it’s actually a great place for the company’s engineers to learn how to function in a linkless, physical world.
“There is a meaningful effort to say, how do we tune for books? We’ve got a lot of people doing very focused on the web. How do we take the lessons from what we learned on the web and invent new things that are unique to books?” Matthew Gray, lead software engineer of Google Books, told me.
After a brief digression on the new Rich Results feature, Madrigal moves to the heart of the article, on the more fundamental improvements in the GBS algorithm. I’m especially struck here that library holdings information is included:
Rich Results is the latest in a series of smaller front-end tweaks that have been matched by backend improvements. Now, the book search algorithm takes into account more than 100 “signals,” individual data categories that Google statistically integrates to rank your results. When you search for a book, Google Books doesn’t just look at word frequency or how closely your query matches the title of a book. They now take into account web search frequency, recent book sales, the number of libraries that hold the title, and how often an older book has been reprinted.
More comments by Google engineers on the differences between Web pages and books:
“One of the fundamental things we’ve learned is that the whole is greater than the sum of the parts,” Gray said. This is deeply Google thinking but without the dominant algorithm. It’s a Google subspecies that evolved by feeding on a different corpus. There is less data about books than web pages, but there is more structure to it, and there’s less spam to contend with.
The most difficult part of making Google Books work, said James Crawford, the team’s engineering director, was determining the intent of the service’s heterogeneous user base. Scholars who search Google Books have very different wants and expectations from casual users looking to find a trade fiction title.
Concluding remarks — Especially interesting here is the idea that the real advances will happen after digitization is completed — Makes me think of Mike Cane’s remarks about Google’s use of metadata and how they use it to “make information do things”:
All the Google Books tweaks I’ve noticed are small. But you add them all up and apply them to the 15 million books Google has scanned and the truly unprecedented nature of Google Books starts to emerge. “We’re in the middle of doing something radical. No one has ever pulled together this whole collection, scanning books from 40 different libraries,” Crawford said. “I would say our general approach here has been to just get the books scanned because until they are digitized and OCR is done, you aren’t even in the game. As we get more and more content on line, the work that Matthew’s team gets to be more and more important and more and more doable.”
Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey