Kalev Leetaru (Univ Illinois) recently published a lengthy and interesting article comparing Google Books and the Open Content Alliance. It’s especially interesting because it brings together a good description of many nitty-gritty details of Google Books that are not easy to track down. I’m excerpting a few passages on the use of color and PDF format in Google Books.

Color in Google Books – I have the impression, as Leetaru says, that when Google first started scanning books they didn’t scan in color — They do now though, at least in some cases.

[I’ve added the bold-face in quotes below. The order of quotes is not necessarily the same as in Leetaru’s article.]

Since the majority of out–of–copyright books do not have color photographs or other substantial color information, Google decided early on that it would be acceptable to trade color information for spatial resolution.

Google’s use of bitonal imagery and its interactive online viewing client significantly decrease the computing resources required to view its material. … Google Book’s bitonal page images, on the other hand, render nearly instantly, permitting realtime interactive exploration of works.

Use of PDF in Google Books – It’s interesting that Leetaru says the Google Books view “mimics the PDF Acrobat viewer.” Until recently, I avoided using the “Download PDF” button link in Google Books, thinking that it was mainly for downloading to print, and that the PDF view would take a long time to load. But I’m finding that it loads quickly, and provides a fairly usable interface that is in fact reminiscent of the Google Books view, as Leetaru suggests.

Google realized it was necessary to use different compression algorithms for text and image regions and package them in some sort of container file format that would allow them to be combined and layered appropriately. It quickly settled on the PDF format for its flexibility, near ubiquitous support, and its adherence to accepted compression standards (JBIG2, JPEG2000).

While many digital library systems either do not permit online viewing of digitized works, or force the user to view the book a single page at a time (called flipbook viewing), Google has developed an innovative online viewing application. Designed to work entirely within the Web browser, the Google viewing interface mimics the experience of viewing an Adobe Acrobat PDF file.

While most services take advantage of the linearized PDF format, Google made a conscious decision to avoid it. Linearized PDFs use a special data layout to allow the first page of the file to be loaded immediately for viewing … Google found several shortcomings with this format [noting that] the majority of PDF downloads are from users wanting to view the entire work offline or print it [and that] for these users, linearized PDFs provide no benefit.

See Leetaru’s extensively-referenced article for many other useful details.

I Just came across an old but still current article by Roy Tennant — Digital Libraries- The Other E-Books — written in 2001. Tennant says:

When People refer to e-books, they typically mean device-dependent e-books (ER: bold added, see below) such as those marketed by Gemstar (ER: a long-gone pre-Kindle reader) … The term infrequently seems to encompass efforts by libraries, universities, or others to publish e-books on the net for free.

Writing three years before Google Books launched, Tennant seems prophetic — I thought I was being rather original 😉 in my posting of last week — Are Google Books eBooks? — but Tennant made an inroad into the discussion long ago! Of course, what Tennant refers to as “libraries, universities, or others” has now turned into Google Books.

Meta-story – How I stumbled on this article – In thinking about how to refer to different types of ebooks, I wanted to try out the term “device dependent ebook,” so I Googled it, and Voila! — I saw that Roy’s 2001 article was #1 >> Google search: device dependent ebook

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

What are eBooks? WikiPedia says an eBook is “the digital media equivalent of a conventional printed book … usually read on a PC or … an e-book reader.”

This sounds like what Peter Brantley calls “the older model” of eBooks that are “downloadable and packaged … into dedicated readers.” The newer idea that Brantley suggests, in contrast, is of eBooks that will be “consumed over the network.”

In common usage, Google Books are not usually considered to be in the category of eBooks (The Wikipedia eBooks article linked above doesn’t mention Google). In Brantley’s language, though — eBooks being used on the network — I would suggest that full-view books in Google Books are in fact eBooks.

The eBook meme does seem to be changing — There’s indication that the term eBook is coming to be applied to Google Books — A recent Christian Science Monitor article on Google Books, refers to them in the title as “e-books.” The article quotes prominent blogger Siva Vaidhyanathan to the effect that Google is coming to dominate “the digital book world,” with the clear implication (with “e-Book” being in the article title) that Google Books are in the category of eBooks.

Browser-worthy small-scale devices like the iPhone and the G1/Google phone/Android make it increasingly likely that Google Books will be put in the same category as dedicated e-book services like the Amazon Kindle. Especially for reading books that have pictures, reading Google books in a browser is clearly preferable to using the Kindle, which does not even have color.

As computers have become more powerful, many of the aspects of handling text that were formerly done by humans have been taken over by computers. Pictures, however, are much more difficult to automate — Recognizing patterns remains a task that humans do much better than computers. A human infant can easily tell the difference between a cat and a dog, but it’s difficult to train a computer to do this.

In pre-Google days, the task of finding good lists of web links needed the input of smart humans (and Hardin MD was on the cutting edge in doing this). Now, though, Google Web Search gives us all the lists we need.

Pictures are another story — on many levels, pictures require much more human input than text.

The basic, intractable problem with finding pictures is that they have no innate “handle” allowing them to be found. Text serves as its own handle, so it’s easy for Google Web Search to find it. But Google Image Search has a much more difficult task. It still has to rely on some sort of text handle that’s associated with a picture to find it, and is at loss to find pictures not associated with text.

The explosive growth of Hardin MD since 2001 (page views in 2008 are over 50 times larger) has been strongly correlated with the addition of pictures. This time period has also gone along with the growing presence of Google, with its page-rank technology, and this has come to make old-style list-keeping, as had been featured in Hardin MD, less important.

Though Google has accomplished much in the retrieval of text-based pages, it’s made little progress in making pictures more accessible. Google Image Search is the second most-used Google service, but its basic approach has changed little over the years.

The basic problem for image search is that pictures don’t have a natural handle to search for. Because of this it takes much more computer power for the Google spider to find new pictures, and consequently it takes much longer for them to be spidered, compared to text pages (measured in months instead of days).

Beyond the problem of identifying pictures there are other difficult-to-automate problems for image search:
• How to display search results most efficiently to help the user find the what they want — Do you rank results according to picture size, number of related pictures at a site, or some other, more subjective measure of quality?
• What’s the best way to display thumbnail images in search results?
• How much weight should be given to pictures that have associated text that helps interpret the picture?

So — Good news for picture people! — I would suggest that pictures are a growth sector of the information industry, and a human-intensive one. I would predict that text-based librarians will continue to be replaced, as computers become more prominent. But there will continue to be a need for human intelligence working in all areas relating to pictures, from indexing/tagging to designing systems to make them more accessible.