There’s been a lot of buzz about the announcement last week of mobile access to Google Book Search public-domain books. I’ve been looking hard for nitty-gritty details of how it works, though, and haven’t found much. The best is in comments by bowerbird on an announcement article on toc.oreilly.com. It’s easy for comments to get lost, so I’m excerpting most of bowerbird’s words here:

this offering is very good. extremely good. the interface is quite nice…

it was great to see google is serving digital text, rather than scans, since text is a lot more nimble. however, a tap on a paragraph brings up the scan of that paragraph, which is nice. and another tap restores the text. so if you want to verify the o.c.r., it’s simple to do. as i said above, this is nicely done.

curiously, in the one book i checked (roughing it), the text was extremely accurate as well, which is a pleasant discovery. i found only one o.c.r. error — “firty” for “fifty”, due to a blotch on the page …

this quality text is _not_ typical of google’s raw o.c.r., so they’ve evidently run some clean-up routines on it. i’m curious to see if they share this cleaned-up text with their library partners, or keep it to themselves… (no, the libraries weren’t smart enough to ask for it, as far as i know, let alone write it into the contracts.)

I’ve bolded what I take to be the most interesting point here, that Google has done an extra-special job of OCR’ing text for GBS mobile. As bowerbird notes, hopefully Google will share more about this process, sooner or later.

Clancy cites high usage of out-of-copyright books

This article is generally unremarkable, although it does have some good quotes from prominent players. Otherwise, just another article in NY Times on Google Books. But it has two notable features — The first is the quote from Google’s Dan Clancy, in the second paragraph, stating a remarkably high volume of usage of out-of-copyright books. The second notable feature, which is why I’m excerpting the article at some length, is that it was given surprisingly little attention in the blogosphere/twittersphere when it was published a month ago.

Google hopes to open a trove of little-seen books [IHT version]
by Motoko Rich, New York Times, Jan 5, 2009

Ever since Google began scanning printed books four years ago, scholars and others … have been able to tap a trove of information that had been locked away on the dusty shelves of libraries and in antiquarian bookstores.

[boldface added] According to Dan Clancy, the engineering director for Google book search, every month users view at least 10 pages of more than half of the one million out-of-copyright books that Google has scanned into its servers.

The agreement, pending approval by a judge this year, also paved the way for both sides to make profits from digital versions of books. Just what kind of commercial opportunity the settlement represents is unknown, but few expect it to generate significant profits for any individual author. Even Google does not necessarily expect the book program to contribute significantly to its bottom line. … “We did not think necessarily we could make money,” said Sergey Brin, a Google founder and its president of technology, in a brief interview at the company’s headquarters. “We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site.”

Users are already taking advantage of out-of-print books that have been scanned and are available for free download. Mr. Clancy was monitoring search queries recently when one for “concrete fountain molds” caught his attention. The search turned up a digital version of an obscure 1910 book, and the user had spent four hours perusing 350 pages of it.

“More students in small towns around America are going to have a lot more stuff at their fingertips,” said Michael A. Keller, the university librarian at Stanford. “That is really important.”

Some librarians privately expressed fears that Google might charge high prices for subscriptions to the book database as it grows. … David Drummond, Google’s chief legal officer, said the company wanted to push the book database to as many libraries as possible. “If the price gets too high,” he said, “we are simply not going to have libraries that can afford to purchase it.”

Authors view the possibility of readers finding their out-of-print books as a cultural victory more than a financial one. … “Our culture is not just Stephen King’s latest novel or the new Harry Potter book,” said James Gleick, a member of the board of the Authors Guild. “It is also 1,000 completely obscure books that appeal not to the one million people who bought the Harry Potter book but to 100 people at a time.”

Some scholars worry that Google users are more likely to search for narrow information than to read at length. “I have to say that I think pedagogically and in terms of the advancement of scholarship, I have a concern that people will be encouraged to use books in this very fragmentary way,” said Alice Prochaska, university librarian at Yale.

“There is no short way to appreciate Jane Austen …,” said Paul Courant, university librarian at the University of Michigan. “But a lot of reading is going to happen on screens. One of the important things about this settlement is that it brings the literature of the 20th century back into a form that the students of the 21st century will be able to find it.”

Adam Hodgkin, in Google Pictures and Google Books, wonders why Google has chosen to put Prado paintings in Google Earth rather than in Google Images. In December I asked a similar question about Google’s putting Life Magazine pictures in Google Images, but putting other picture-laden magazines in Google Books. And, in another recent launch they’ve put newspapers, which also have many pictures, in Google News.

Once again I come back to the theme of this blog — Pictures are just different — They don’t fit neatly into our categories. Pictures are an important part of several different media — books, magazines, newspapers, and (of course) art — So what slot do we put them in?

Even before the recent questions arose with Life Magazine pictures, Google Magazines, Google Newspapers, and Prado paintings, there’s the ongoing, but little-noted question of pictures in the growing collection of public domain books in Google Books. In my experience, these are completely absent from Google Image Search — When will Google make this connection?

Figuring out what category to put them into, of course, is a relatively minor problem compared to the BIG PROBLEM with pictures, which is making them searchable! If there was one category to put them into that was searchable, then of course that would be the place for Google to put them!

Excerpts from Peter Brantley’s eloquent words on the Google Book settlement, in A fire on the plain (bold added).

With recent back and forth over the proposed Google Book Search settlement (e.g., Robert Darnton’s essay in The New York Review of Books; Tim O’Reilly’s response; and James Grimmelman’s litany of proposed corrections predating both at The Labortorium), I’ve been cast again into thinking about aspects of the agreement.

It is difficult to credit that frustrating access is ever able to delay or stem fundamental social trends – for example, the increasing importance of visual and interactive media. … Or the possibility that searching and reading networked books for anyone under the age of 40 might be an inherently social activity that generally increases enthusiasm for all forms of reading.

Let us consider a far more basic, more fundamental concern: the proposed Google Book Search settlement is embedded in a set of conceptions about books, reading, and information access which is as profoundly obsolescent as the printed Encyclopedia.

This is a world where young children carry around in the palm of their hands gaming consoles that have more networked computing capacity than a moderately powerful Sun workstation of five years back. Where increasingly I think about printed books with as much fondness as large cinder blocks, …  And yet authors and publishers worry that a fair level of access to digitized books … might reduce their profits. Truly, this should not be their worry. Their eyes remain cast on a horizon which has fallen from the earth, while a new sun is rising.

The settlement describes a world of time past, not a world of possibilities. Can we not imagine a redrafting of the settlement’s terms with libraries? … let us envision an alternative world where children routinely carry Alexandria in their hands. Where they experience works of literature as games, pushing at the borders of their knowledge and experience by engaging the library with others as a festschrift.

The people served by our libraries – let them show us how to re-make literature in a world where it fits in the circle of many hands, caressed by fingers, shared between minds. Libraries are laboratories for the future of reading, and with this, we have the key to it. … We stride into a world where books are narratives in long winding rivers; drops of thought misting from the sundering thrust of great waterfalls; and seas from which all rivers and rain coalesce, and which carry our sails to continents not yet imagined.

[concluding paragraph] Digital books are sparkles of magic untapped. The settlement proposes a bold path from darkness. But it is a trail that circles back to an old forest, abandoned. Our people have left, ventured onto a flat savannah, strewn with rocks, thorny shrubs, windblown trees, beasts. We can see it all now. And we are starting fires, with wood from fallen trees. Burning down the forest.

Related articles:

Eric Rumsey is at @ericrumsey

In a brief response letter, author and publisher Marc Aronson writes about the copyright status of pictures that are in publisher partner books in Google Books. Aronson suggests that the rights for pictures are separate from the rights for text. I’ve corresponded with Aronson to expand on this idea, and he says that in his experience as an author and editor, he has been told that he needs to obtain rights to pictures and text separately. I’ve searched for other commentary on this issue, and have found very little. It’s a subject that needs exploration. Anyone have ideas?

All books in the publisher partner program, of course, are under copyright, and are available only in Limited Preview, with the publisher giving Google the rights to display a specific number of pages. In some cases of books containing pictures, however, the pages are available, but without the pictures. Is this because the publisher has gotten the rights for limited preview of the text, but not the pictures, as Aronson suggests? The three examples below show a variety of Limited Preview options. The first two are especially pertinent, because they are for books from the same publisher (Macmillan), in the same series, that have a different picture preview status, possibly indicating that the illustrator has given permission to display pictures in the first case, but not in the second.

In this example, the first 39 pages* are available for preview, with all pictures displaying. There are about 30 thumbnail images for pages with pictures on the About this Book page.
Birds of North America (Golden Field Guides)
By Chandler S. Robbins et al, Illustrated by Arthur Singer, Published by Macmillan, 2001

In this book, from the same publisher, the first 37 pages* are available for preview, but almost all pictures do not display, replaced with the message “Copyrighted image.” There are no thumbnail images on the About page.
Wildflowers of North America (Golden Field Guides)
By Frank D. Venning, Illustrated by Manabu C. Saito, Published by Macmillan, 2001

This book follows the most common, fairly liberal, pattern of publishers in Limited preview books, with the first 50 pages* available including all pictures. A full complement of 30 thumbnails is on the About page.
Central Rocky Mountain Wildflowers
By H. Wayne Phillips, Illustrated, Published by Globe Pequot, 1999

* The number of pages available for preview varies from session to session — The number given here is the maximum I experienced.

Mike Cane hits the target on color eBooks

Truly, the first device that can do color eBooks will change things forever … There are three recent signs — as well as a total wild card — that point to possible dramatic changes in the eBook-reading hardware landscape. …

… The first is Samsung hitting the pedal hard on OLED screen manufacturing.

… The second development has been Hewlett-Packard demonstrating color eInk screens.

… The third piece of this puzzle: Amtek Rumored to Show Slate Netbook at CES 2009.

… The Wild Card in all this? … Pixel Qi, which brags it has revolutionary screens that will basically run on electrons by osmosis instead of the greedy sip-sip-sip of current technology.

Sad to say, this is one of Mike’s last blog postings — His incisive comments on eBooks will be missed.

Andrew Smith, at the Dallas News, writes on the same subject, in article — Why e-books will rule

… Nearly all non-fiction books cry out for far more illustration than they contain, but the costs of adding pictures and charts (especially color pictures and charts) are prohibitive. That’s why you see so many non-fiction books with all the photos bunched up into a couple of glossy-page sections in the center. It’s the only cheap(ish) way to get the job done. Color E-Ink will change that forever. Nearly all non-fiction books cry out for far more illustration than they contain, but the costs of adding pictures and charts (especially color pictures and charts) are prohibitive. That’s why you see so many non-fiction books with all the photos bunched up into a couple of glossy-page sections in the center. It’s the only cheap(ish) way to get the job done. Color E-Ink will change that forever.

Scientific and medical books, which make heavy use of color illustrations, especially stand to benefit from the advent of color eBooks, maybe lowering the prices, which can break a student’s budget for print textbooks.

The list presented here has FULL VIEW (public domain, pre-1923) journals in Google Books. This is certainly NOT intended to be a complete list! There’s no easy way that I have found to limit a search in Google Books to journals, so I have found these titles by searching for appropriate words such as medical, dermatology, journal, archive, transactions. I have not included titles that have less than 5 volumes in Google Books. Unfortunately, there’s no way that I have found to sort the title searches chronologically, so to find a particular volume, it’s necessary to go through the results list. Each entry in the list below has links to the first and last volumes that I have found for each title; these dates are not necessarily inclusive. For “contributing libraries,” examples are given if there is more than one contributing library.

This list grew a lot longer than I thought it would — I was surprised to find so many journals in Google Books! It was a tedious job compiling this, and I probably won’t try to keep it current, with new volumes being added all the time. If I get feedback :-) I’m more likely to put in more work on it, so please add a comment, or mail me at: eric[hyphen]rumsey AT uiowa[dot]edu

A New York Times article on Google Flu Trends reports that Google’s methodology “has been validated by an unrelated study” based on Yahoo! search data whose lead author is Philip Polgreen, an infectious disease doctor at the University of Iowa.

I was glad to learn about the Polgreen study, first, of course, because Polgreen and colleagues are right here at the University of Iowa! — But beyond that, it was good to find in the full article by the Polgreen team that they give more details about the flu-related search terms they used than the Google Flu Trends team does, making it easier to break down the complicating factors in flu searching. Specifically, they report that they excluded the following terms:

bird, avian, pandemic
vaccine, vaccination, shot

As discussed in accompanying articles (see below), flu is a particularly complicated disease for correlating disease occurrence and web search behavior, because of the existence of bird flu, and because there is a vaccine for flu — exactly the factors that have been excluded by Polgreen et al. It seems likely that the Google Flu Trends team is using a similar method.

Incidentally, more on the Iowa connection — Philip Polgreen has been involved for several years with the Iowa Health Prediction Market, a spin-off of the Iowa Electronic Markets, a real-money prediction market/futures market that’s used to make predictions in political elections.

** This is one of a group of three articles on Google Flu Trends:

Together, these articles suggest that, although it’s difficult to know with assurance because Google has not revealed the search terms that they use for GFT, it seems likely that they’ve done a good job in working around the complications of flu-related search patterns.

Peter Suber, at Open Access News, has a good article on Google’s recent announcement that they are now OCR’ing scanned PDF documents so that they become searchable text documents in Google Web Search.

Scroll down especially to Suber’s comments, in which he describes the background to this Google advance, which is already in Google Book Search — As he says, it’s had an OCR’d text layer version of full-view books from the start, which is how they can be searched. (Google Catalogs also has a searchable text layer).

For more on searchable and non-searchable text see: Identifying Google scanned PDF’s

Google recently announced that scanned PDF documents are now available in Google Web Search. PDF documents have been in Google before, but most PDF documents that have been scanned from paper documents have not, so this will greatly improve access to PDF’s. As described below, it’s important to be able to distinguish scanned PDF’s from others, of the sort that have been in Google before.

Scanned PDF documents are originally created by making an image scan of a paper document, and since the text is an image, it’s not selectable or searchable as text. The other kind of PDF document, usually called native PDF, that’s been in Google before, is originally created from an existing electronic formatted document, like a Word document, and its text is selectable and searchable as text.

From Google search results it’s not possible to determine  whether a PDF document is a scanned document or a native document — Both simply say “File Format: PDF/Adobe Acrobat.” To see if it’s scanned or native PDF, go to the document and click on a word to see if it can be selected. If it can, it’s native PDF; if not it’s scanned PDF. It’s important to know this because in a scanned PDF, the text is not searchable within the PDF-browser reader. This is not readily apparent, because the search command seems to work, but comes up with zero results. To search the text of a scanned document, go to search results, and click “View as HTML,” which has the text of the document.

Examples from Google:
Google search : Scanned PDF – Text cannot be selected (Notice that the text in this document is scratchy, poor quality, another indication of scanned text).
Google search : Native PDF – Text can be selected

See also: Google Books and Scanned PDF’s

For more: