Peter Suber, at Open Access News, has a good article on Google’s recent announcement that they are now OCR’ing scanned PDF documents so that they become searchable text documents in Google Web Search.

Scroll down especially to Suber’s comments, in which he describes the background to this Google advance, which is already in Google Book Search — As he says, it’s had an OCR’d text layer version of full-view books from the start, which is how they can be searched. (Google Catalogs also has a searchable text layer).

For more on searchable and non-searchable text see: Identifying Google scanned PDF’s

Google recently announced that scanned PDF documents are now available in Google Web Search. PDF documents have been in Google before, but most PDF documents that have been scanned from paper documents have not, so this will greatly improve access to PDF’s. As described below, it’s important to be able to distinguish scanned PDF’s from others, of the sort that have been in Google before.

Scanned PDF documents are originally created by making an image scan of a paper document, and since the text is an image, it’s not selectable or searchable as text. The other kind of PDF document, usually called native PDF, that’s been in Google before, is originally created from an existing electronic formatted document, like a Word document, and its text is selectable and searchable as text.

From Google search results it’s not possible to determine  whether a PDF document is a scanned document or a native document — Both simply say “File Format: PDF/Adobe Acrobat.” To see if it’s scanned or native PDF, go to the document and click on a word to see if it can be selected. If it can, it’s native PDF; if not it’s scanned PDF. It’s important to know this because in a scanned PDF, the text is not searchable within the PDF-browser reader. This is not readily apparent, because the search command seems to work, but comes up with zero results. To search the text of a scanned document, go to search results, and click “View as HTML,” which has the text of the document.

Examples from Google:
Google search : Scanned PDF – Text cannot be selected (Notice that the text in this document is scratchy, poor quality, another indication of scanned text).
Google search : Native PDF – Text can be selected

See also: Google Books and Scanned PDF’s

For more:

Kalev Leetaru (Univ Illinois) recently published a lengthy and interesting article comparing Google Books and the Open Content Alliance. It’s especially interesting because it brings together a good description of many nitty-gritty details of Google Books that are not easy to track down. I’m excerpting a few passages on the use of color and PDF format in Google Books.

Color in Google Books – I have the impression, as Leetaru says, that when Google first started scanning books they didn’t scan in color — They do now though, at least in some cases.

[I’ve added the bold-face in quotes below. The order of quotes is not necessarily the same as in Leetaru’s article.]

Since the majority of out–of–copyright books do not have color photographs or other substantial color information, Google decided early on that it would be acceptable to trade color information for spatial resolution.

Google’s use of bitonal imagery and its interactive online viewing client significantly decrease the computing resources required to view its material. … Google Book’s bitonal page images, on the other hand, render nearly instantly, permitting realtime interactive exploration of works.

Use of PDF in Google Books – It’s interesting that Leetaru says the Google Books view “mimics the PDF Acrobat viewer.” Until recently, I avoided using the “Download PDF” button link in Google Books, thinking that it was mainly for downloading to print, and that the PDF view would take a long time to load. But I’m finding that it loads quickly, and provides a fairly usable interface that is in fact reminiscent of the Google Books view, as Leetaru suggests.

Google realized it was necessary to use different compression algorithms for text and image regions and package them in some sort of container file format that would allow them to be combined and layered appropriately. It quickly settled on the PDF format for its flexibility, near ubiquitous support, and its adherence to accepted compression standards (JBIG2, JPEG2000).

While many digital library systems either do not permit online viewing of digitized works, or force the user to view the book a single page at a time (called flipbook viewing), Google has developed an innovative online viewing application. Designed to work entirely within the Web browser, the Google viewing interface mimics the experience of viewing an Adobe Acrobat PDF file.

While most services take advantage of the linearized PDF format, Google made a conscious decision to avoid it. Linearized PDFs use a special data layout to allow the first page of the file to be loaded immediately for viewing … Google found several shortcomings with this format [noting that] the majority of PDF downloads are from users wanting to view the entire work offline or print it [and that] for these users, linearized PDFs provide no benefit.

See Leetaru’s extensively-referenced article for many other useful details.

Color pictures in full-view books in Google Books are generally not common. This is not surprising, since color pictures in books generally before the pre-copyright date (1923) were uncommon. Searches in Google Books for likely subjects — museum, sculpture, french painting, history — do find many books with pictures, but they are almost all black and white.

An exception to the general lack of colored illustrations in older books is in the areas of botany and dermatology, two subjects in which I have a particular interest. In these subjects there were many books published in the 19th century, especially in Europe, with excellent color illustrations. A few examples from Google Books About This Book: Selected Pages are shown here.

For more see Color Pictures in Google Books: More examples

A little-discussed but valuable part of Google Books is the About this book page. This is sort of like an enhanced card catalog view of the book — In addition to standard bibliographic data, it also has a variety of other useful information. For books with pictures an especially valuable part of this page is the Selected pages section, which has thumbnails for a selection of pictures in the book.

The About page is especially useful for full-view books, for which it has a wide variety of information, including popular passages, references from web pages & scholarly works, other editions, related books, & places mentioned  (linked to Google Maps).

The About page is in strong contrast to the frontcover screen in full-view books, which is what’s linked from the main title entry for titles listed on the search results page. For full-view books Frontcover usually goes to the title-page of the book, which in most cases is much less useful than the About page.

In Limited-preview and Snippet-view books, the About page usually has much less information than in full-view books (This is apparently controlled by the publisher of the book). In these books, the Frontcover view really does go to the frontcover of the book, which, of course, is usually a colorful, exciting picture.

Interestingly, the basic URL for entries in Google Books, that just has the generic base + the ID number for the specific book, goes to the About this book screen …

The URL for the frontcover screen then builds on this basic address …

… Which seems to imply that on some level the designers of Google Books see the About screen as the more basic, elemental unit for the book.

I Just came across an old but still current article by Roy Tennant — Digital Libraries- The Other E-Books — written in 2001. Tennant says:

When People refer to e-books, they typically mean device-dependent e-books (ER: bold added, see below) such as those marketed by Gemstar (ER: a long-gone pre-Kindle reader) … The term infrequently seems to encompass efforts by libraries, universities, or others to publish e-books on the net for free.

Writing three years before Google Books launched, Tennant seems prophetic — I thought I was being rather original 😉 in my posting of last week — Are Google Books eBooks? — but Tennant made an inroad into the discussion long ago! Of course, what Tennant refers to as “libraries, universities, or others” has now turned into Google Books.

Meta-story – How I stumbled on this article – In thinking about how to refer to different types of ebooks, I wanted to try out the term “device dependent ebook,” so I Googled it, and Voila! — I saw that Roy’s 2001 article was #1 >> Google search: device dependent ebook

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

What are eBooks? WikiPedia says an eBook is “the digital media equivalent of a conventional printed book … usually read on a PC or … an e-book reader.”

This sounds like what Peter Brantley calls “the older model” of eBooks that are “downloadable and packaged … into dedicated readers.” The newer idea that Brantley suggests, in contrast, is of eBooks that will be “consumed over the network.”

In common usage, Google Books are not usually considered to be in the category of eBooks (The Wikipedia eBooks article linked above doesn’t mention Google). In Brantley’s language, though — eBooks being used on the network — I would suggest that full-view books in Google Books are in fact eBooks.

The eBook meme does seem to be changing — There’s indication that the term eBook is coming to be applied to Google Books — A recent Christian Science Monitor article on Google Books, refers to them in the title as “e-books.” The article quotes prominent blogger Siva Vaidhyanathan to the effect that Google is coming to dominate “the digital book world,” with the clear implication (with “e-Book” being in the article title) that Google Books are in the category of eBooks.

Browser-worthy small-scale devices like the iPhone and the G1/Google phone/Android make it increasingly likely that Google Books will be put in the same category as dedicated e-book services like the Amazon Kindle. Especially for reading books that have pictures, reading Google books in a browser is clearly preferable to using the Kindle, which does not even have color.

In the last year, we’ve begun to include the copyright status of pictures on Hardin MD pages. We have especially done this to show which pictures are not under copyright and are therefore free to copy.

Recently Peter Brantley suggested that libraries should make it easy for users to find public domain content on their sites. So, with thanks to Brantley for this idea, we’ve made a page that shows public domain galleries for specific diseases (see snip below).

Brantley also suggests that libraries with public domain content make the content available from a specific directory called /public. This seems like a good idea, and we will be considering it.

Maps and newspapers, because they’re rich in graphic information, benefit greatly from a zooming and panning interface. Text-only books, because they’re more linear and because text is easily searchable, don’t benefit from this sort of interface as much, but books with pictures certainly do. has recently implemented Google Maps technology for viewing non-map text and picture resources, such as magazines and newspapers, which are converted from PDF format. This is an exciting development especially because it holds promise that the same sort of technology could also be used for books.

With Google’s great success using a zooming-panning interface in Google Maps, and having recently launched Google Newspapers which also uses it, the question naturally occurs — Will Google developers sooner or later also use it for Google Books?

The zKimmer screen-shots above are from a magazine (though they could easily be from a book) and those below are from a newspaper. They both show how this interface facilitates navigating a resource that includes extensive pictures as well as text.

zKimmer lacks a good search capability (it has a search box, but it doesn’t seem to work) — So it’s not ready for heavy-duty enterprise use — It’s exciting, though, because it shows the potential value of a zooming-panning interface for books. Google Books already uses panning and zooming in a limited way, for navigating between pages, but a multi-page pan and zoom, as in zKimmer, would greatly simplify picture and text navigation.

Other implementations of the Google Maps API for non-map graphic resources are a desktop collection of elegant books by the reclusive German techno-artist Markus Dressen, and a card set from the World Of Warcraft.

Looking at Google Newspapers has got me thinking that the same sort of zooming-panning interface that’s used in that, and in Google Maps, could also be used for viewing books. An example of this is shown in the screenshots from videos on Seadragon linked below.

Seadragon is a zooming-panning technology, owned by Microsoft, and used as a component in other tools, such as PhotoZoom, Silverlight, Photosynth, and various Microsoft mapping applications. When it was acquired by Microsoft in 2007 it got attention as a powerful component of other Microsoft applications, but I haven’t seen it featured as a potential interface design tool for ebooks. This is a relatively small part of the videos below, but the screenshots give a feel for it. These are from two different videos, both showing how the system can be used to zoom in on pages from a book.

The sequence above, which is made up of 800 images from a map collection at the Library of Congress, shows how easy it is to zoom in to find pages that have text and pictures together. This video (2:13) is made by the company from which Microsoft bought Seadragon.

The second sequence is from a video (7:42, the first 2:50 on Seadragon) of a talk by Blaise Aguera, the creator of Seadragon. As indicated, it shows zooming in on a large text source.

Both of these videos emphasize the obvious usefulness of Seadragon technology for mapping applications. But they also show that it has potential usefulness for viewing online e-books — So it’s too bad Microsoft dropped out of the Internet Archive digitization project in May, 2008!