The slides and data from Jon Orwant’s presentation on Google Book Search at TOC, that were not available when I wrote previously, have now been put up on the O’Reilly site. [these have been removed, see comment below] This is made up of 59 PDF slides, covering a range of recent developments with Google Books, including the recent release of GBS mobile, and a discussion of the Oct 2008 Publisher settlement. The part I’m most interested in is the data on GBS usage that had been mentioned by Orwant in various venues before, but with few details. The details in the TOC presentation are mostly in three “Case studies” of publishers that participate in the GBS Partner Plan — McGraw-Hill, Oxford University Press, and Springer. I’ve chosen one slide for each of these publishers that show various long-tail effects for usage of their books that are in GBS, and one slide that has data for a more extensive grouping from GBS.

McGraw-Hill case study is presented in slides 21-23. Below is slide 24. Note that this is a small sample of only the top 30 titles.

Oxford University Press – Slides 26-31. Below is slide 27. Note the long tail of visits for pre-1990 books.

Springer – Slides 32-36. Below is slide 35, showing clicks for Buy this Book. Note again the very long tail of clicks for pre-1995 books.

Slide 37 below shows “Share of books with more than 10 pages viewed”, apparently for all books in GBS. The coloring of the data lines looks ambiguous to me – The lowest line is undoubtedly for Snippet View books. It looks like the top line is for Limited Preview, which are presumably higher than Full View books, apparently the middle line, because Limited Preview books are more current.

Please comment here or Twitter @ericrumseytemp

When a link is clicked to a specific page in GBS Mobile, the page that always opens is the entry page for the book. There doesn’t seem to be a way to link successfully to specific pages. I’ve tried this in several examples, and have had the same experience in all of them. An example below illustrates.

In this example, I’m trying to link to a group of pages starting with page 31. But when the link below is clicked, it goes to the entry page, which is page 21, with the same URL as below except that the page number is 21 instead of 31.
[This link and the link in the image below are the same]

After this link is clicked, and it goes to page 21, then it does work to change the number from 21 to 31, and it goes to page 31. The right > next to “Pages 21-30” also works.

When the link is clicked to go to page 31, and ends up on page 21, clicking the Back button goes to page 31. And the address bar initially initially reads 31, but then changes to 21 – So when the link is initially clicked, it does “pass through” page 31, but apparently there’s some signal on page 31 that tells it to redirect to page 21.

Does anyone see what’s happening here? Any help would be much appreciated! Please post suggestions in comments, or in Twitter.






In a recent NY Times article that I blogged on, Dan Clancy, the engineering director for Google book search, is cited as saying “every month users view at least 10 pages of more than half of the one million out-of-copyright books that Google has scanned into its servers.” Remarkably, this classic long tail description of Google Books seems not to have been noticed by anyone — I’ve searched in Google (web and blogs) for various word combinations in the quote combined with “Dan Clancy,” and have found nothing at all except the original NYT article.

The long tail idea, which was first described by Chris Anderson in 2004, is that when a very large number of users are given a very large number of items to choose from, especially in an online environment with virtually unlimited “shelf space” and easy access, a very wide variety of items will be chosen. Anderson proposed the idea especially to describe commercial sites such as Amazon and Netflix, but it has also been seen as a good fit for libraries, and especially online library/book sources, such as Google Books.

So — Yes — There has been discussion of Google Books and the long tail. For the most part, though, this has been on a conceptual, non-numeric level. The statement by Clancy is valuable because it’s the first time there have been actual numbers provided by Google sources to back up the conceptual ideas. And, indeed, striking numbers they are — every month, half of the out-of-copyright books — i.e. old books — in Google Books are getting significant use. The long tail will certainly be even longer when newer books are made available after the October 2008 settlement goes into effect.

The best numeric data that I’ve found on Google Books and the long tail is given in an article by Tim O’Reilly in 2006, which compares sales of O’Reilly Media book titles, as reported by Nielsen Bookscan, with page views from Google Books. As the graph (at left) from that article shows, the Google Books page views (in red) have a very long, almost flat, tail, in contrast with the relatively short tail for actual sales of book titles (in blue). Incidentally, the graph shown here has a bad link in the O’Reilly article, so all that displays is the file name; I did some digging on the O’Reilly site to find it here. (Feb 11: Bad link for this image and others in O’Reilly article are fixed, after I noted them in a comment.)

The closest thing I have found to other long tail numeric data relating to online books is reported in a 2006 article by Jason Epstein:

According to Mark Sandler of the University of Michigan Library, in an essay in Libraries and Google, an experiment by the library involving the digitization of 10,000 “low use” monographs offered on the Web produced “between 500,000 and one million hits per month.”

I suspect the realization of the “power of the long tail” shown in this experiment contributed to the University of Michigan opting to be one of the original library partners in the Google Books project.

Venn diagrams have long been used in teaching online searching, to help users visualize how Boolean searching works. A new application of Venn diagrams, Twitter Venn, gives on-the-fly Venn diagrams of Twitter postings. Because Twitter does such a good job of taking the pulse of the Web, Twitter Venn is an excellent way to visualize connections of breaking news topics. The first Twitter Venn example below shows that there are 5047 postings per day with the word heart, 1314 postings with the word risk and 24 postings that contain both words, represented by the gray “intersection” in the middle containing two purple dots. The fun part — To the lower left other words are listed that occur in the postings on heart and risk — The top word, in large print, is decaf, indicating that there’s current buzz that relates decaf to heart disease risk. Sure enough, a Google search for heart decaf shows that there have indeed been recent reports that decaf coffee may increase the risk of heart disease, at least slightly.

[Click images below for live results in Twitter Venn. The numbers will vary slightly, since they’re generated live. On the Twitter Venn results, click the middle intersection area to show common terms in lower left.]

The second example, below, shows clearly that the main source of alarm about salmonella poisoning is peanut butter, since this is the predominant word that occurs with the search words, as shown in the listing at the lower left.

It’s occurred to me for a long time that Venn diagrams are a good way to visualize the relationships among subjects in online searching. But I suspect the sort of on-the-fly, realtime generation of Venn diagrams done by Twitter Venn would be too slow for databases with more text per record. So it’s for Twitter, with its tiny 140-character records, to show how useful Venn diagrams can be for visualization.


A month ago, Google announced that it has begun putting magazines in Google Books. In one way, this is a new direction for Google. But looked at broadly, it’s really not so new — Google has been putting old journals in Google Books for a long time. The basic difference between the newly announced “Google magazines” and Google’s “old journals,” of course, is the date of publication — The titles that are being treated as “magazines” are generally published in the last 50 years or so. But some of these also include much older issues, in some cases, such as Popular Science, going back to the 1800’s. A bit of digging — searching for words in an article — finds a nice case of a title that’s in Google Books both ways, as a magazine and as an old journal. Snippets from the “About this book” and “About this magazine” pages below show differences.

Old journals – The journal / book format

Old journals are given the same treatment as books, with each volume of the journal being considered a book. The record here is for volume 26 of Popular Science Monthly (the old name of Popular Science).

Old journals are scanned into Google Books by libraries, in the case shown here, Harvard University. As with other books scanned by libraries, the About page has a selection of thumbnail images, giving an idea of what sort of graphics are in the book. Also note the button to Download the entire volume in PDF format.

The Magazine format

In contrast to journal/book format, in which the volume (made up of several issues) is treated as the basic record unit, in magazines, the basic record unit is the issue. This record is for the Feb 1885 issue of Popular Science.

Comparing this with the journal/book format, this lacks thumbnail preview images and it also does not support downloading a PDF of the issue. It does, however, have the great advantage over the journal/book format, that all issues are connected in the Browse all issues menu.

DjVu Google Books is full of surprises!  In surveying medical journals in Google Books, I discovered that volumes of British Medical Journal circa 1880 scanned at Harvard have extensive sections devoted to advertisements. Most libraries, when they bind issues of journals and magazines into bound volumes, very reasonably remove pages that have only advertisements, to save space on the shelf. So it’s good to have a Harvard, that can afford to save the rare gems of 19th century ads, so that they can be put online for the world to enjoy!

As fanciful at the ad shown here is (“Ask for Cadbury’s Pure Cocoa, makers to the Queen”), there is a wealth of more prosaic ads in the same volume, awaiting future medical historians, on subjects such as malted infant food, lactopeptine for indigestion, bronchitis & croup kettles, and state-of-the-art wheelchairs.

I found several other journals in Google Books from the same late-19th-century era, that also have extensive ads. But British Medical Journal is the only one I found that has entire, separate volumes of advertising. Apparently there must have been separate supplements that were only ads (this was in the dawn of the age of mass advertising, and people, even including physicians, were actually GLAD to read ads!)

So, how searchable are the ads in Google Books? I tried a few examples and had mixed results — Searching for this phrase that’s in the Cadbury’s ad — “why does my doctor recommend Cadbury’s Cocoa” — was successful. But searching for a phrase in the ad that follows the Cadbury’s ad, for Anodyne Amyl Colloid — “in cases of neuralgia, sciatica, lumbago” — found the phrase in other ads for the same product, but not the one occurring in this instance.

Here are volumes of British Medical Journal that I found that are exclusively advertising (All of these were scanned at Harvard):

Until now, books with pictures, especially color pictures, have been a relatively small part of Google Books. But the addition of highly visual, popular magazines changes this — The titles added so far are filled with pictures!

On one level, more pictures in Google Books is gratifying — a theme of this blog! But the navigation/search capabilities for finding these pictures is limited. The best way seems to be to use Advanced Search and limit the search to Magazines. But the results listing for this is text-only. It would be much easier to search for pictures with the sort of thumbnail search results interface that’s used in Google Image Search.

In light of the launching of picture-laden magazines as part of Google Books, it’s interesting to note that only last month, Google launched Life magazine pictures, as part of Google Image Search. Google is facing the same choice that librarians have been considering for the last while — Should books (or magazines) that have many pictures be considered mainly as books that happen to have pictures, or as pictures that happen to be in books?

The pictures & links below are from magazines that are in Google Books. I’ve chosen them because I know from work on Hardin MD that they are on highly-searched subjects, which would likely appear in Google Image Search if they were crawlable.

.           .

Google Books - Magazines

When I started this list in Dec, 2008, Google did not provide a list of their own — Thankfully, they provided one in Nov, 2009 (their announcement is Here, their list is Here). Assuming they keep up their list, I will probably not add to the list provided here. Comparing their list with mine now (11/12/09), they have everything on my list except one title (Log home living). Good start, Google, Hope you keep it up 🙂

Please note: the dates given for titles is not necessarily inclusive! Some are quite spotty.

Eric Rumsey is at: eric-rumseytemp AttSign uiowa dott edu and on Twitter @ericrumseytemp

The CDC data above shows that the occurrence of flu generally peaks in February; the data below from Google insights : flu symptoms, not surprisingly, has a similar peak in February.

Google insights, which uses the same data as Google Flu Trends, shows quite a different pattern for flu shot (below), which peaks in October or November (Flu vaccine peaks similarly).

How about searching for just the word flu (below)? — Interestingly, this seems to combine the peaks in the two graphics above, for flu symptoms and flu shot. The exaggerated peaks in 2004 and 2005 likely are caused by peoples’ concerns about vaccine shortage (more on this in accompanying posting, Google Flu Trends: Kudos & Complications).

Looking at the evidence of these graphics from Google Insights, it seems likely that the Google Flu Trends team is excluding search terms relating to flu vaccine, and concentrating on terms that relate to symptoms. See confirmation of this in accompanying posting, Google Flu Trends: The Iowa Connection.

The data shown here seems to indicate that for a seasonal disease in which there is a vaccine, the search patterns for “disease: symptoms,” “disease: vaccine/shot,” and the disease term itself differ, correlated with the time in the year when the disease occurs and when the vaccine is given. This idea is confirmed by Google Insights data for pneumonia, another respiratory disease that has a vaccine. The patterns are similar to flu, with high peaks for pneumonia shot in October, and somewhat lower peaks for pneumonia and pneumonia symptoms in February.

Bronchitis — A disease with no vaccine

Bronchitis is a respiratory disease condition that does not have a vaccine. As the graphics from Google Insights below show, the pattern is different from flu and pneumonia — The peaks for the disease itself (bronchitis, below) and for the disease with symptoms is much the same, making it less complicated to track search patterns — Apparently the people who search for the disease are in fact people who have the disease.

Bronchitis symptoms, from Google Insights.

** This is one of a group of three articles on Google Flu Trends:

Together, these articles suggest that, although it’s difficult to know with assurance because Google has not revealed the search terms that they use for GFT, it seems likely that they’ve done a good job in working around the complications of flu-related search patterns.

Google Flu trends is an elegant application of search data to medicine. Working on Hardin MD, I’ve long noticed seasonal variations in certain diseases — Colds, flu, & respiratory illnesses peak in winter, and insect bites & sun exposure conditions peak in summer. I pay a lot of attention to the search terms that people use to get to Hardin MD pages, so Google’s mining of this data to serve the health of the community is especially interesting.

The idea of Google Flu Trends is shown nicely in the snapshot above from the animation at the Google Flu Trends site — Google finds that there is an excellent correlation between flu-related terms that people search and the occurrence of flu, as measured by CDC data. And, as shown in the animation, the Google search data in near real-time precedes CDC data, which takes 1-2 weeks to be reported and compiled.


The idea of using search data to track the progression of disease outbreaks certainly is elegant, and Google deserves congratulations for it. In choosing flu as the first example, however, Google has chosen a disease with complicating factors.

The nature of these potentially complicating factors is suggested in the graphic above from the Google Flu Trends site — A big question here is — What caused the spike in flu occurrence and flu search activity in Dec 2003 – Jan 2004?

Because Google has chosen not to reveal the exact search terms that they are using to determine the volume of searching for flu-related searching (see supplementary material accompanying Google’s paper in Nature), it’s difficult to know the cause of the 03-04 spike with certainty. But looking back at the chronology of that time period sheds light — There was a major shortage of the flu vaccine in late 2004, which is certainly related to the spike shown in the graphic — The CDC spike (yellow) shows that many people had flu, presumably because they were unable to get the vaccine. The Google spike (blue) is even higher, which may indicate that there were a significant number of people searching for flu information not because they were infected, but because they were looking for information on how to get the vaccine. The accompanying article (Flu Symptoms vs Flu Shot) shows that there is in fact a clear indication of heightened search activity for flu vaccine-related terms during the autumn pre-flu season.

The other complicating factor in looking at flu-related search activity is bird flu, and this seems to have been addressed well by Google — The large bird flu outbreak in Asia, and corresponding bird flu scare throughout the world, occurred in late 2004 and early 2005. Since there is no major spike shown in the graphs for this time, Google apparently has excluded bird flu/avian flu search terms from the aggregate group of terms it’s using.

** This is one of a group of three articles on Google Flu Trends:

Together, these articles suggest that, although it’s difficult to know with assurance because Google has not revealed the search terms that they use for GFT, it seems likely that they’ve done a good job in working around the complications of flu-related search patterns.