In a recent NY Times article that I blogged on, Dan Clancy, the engineering director for Google book search, is cited as saying “every month users view at least 10 pages of more than half of the one million out-of-copyright books that Google has scanned into its servers.” Remarkably, this classic long tail description of Google Books seems not to have been noticed by anyone — I’ve searched in Google (web and blogs) for various word combinations in the quote combined with “Dan Clancy,” and have found nothing at all except the original NYT article.
The long tail idea, which was first described by Chris Anderson in 2004, is that when a very large number of users are given a very large number of items to choose from, especially in an online environment with virtually unlimited “shelf space” and easy access, a very wide variety of items will be chosen. Anderson proposed the idea especially to describe commercial sites such as Amazon and Netflix, but it has also been seen as a good fit for libraries, and especially online library/book sources, such as Google Books.
So — Yes — There has been discussion of Google Books and the long tail. For the most part, though, this has been on a conceptual, non-numeric level. The statement by Clancy is valuable because it’s the first time there have been actual numbers provided by Google sources to back up the conceptual ideas. And, indeed, striking numbers they are — every month, half of the out-of-copyright books — i.e. old books — in Google Books are getting significant use. The long tail will certainly be even longer when newer books are made available after the October 2008 settlement goes into effect.
The best numeric data that I’ve found on Google Books and the long tail is given in an article by Tim O’Reilly in 2006, which compares sales of O’Reilly Media book titles, as reported by Nielsen Bookscan, with page views from Google Books. As the graph (at left) from that article shows, the Google Books page views (in red) have a very long, almost flat, tail, in contrast with the relatively short tail for actual sales of book titles (in blue). Incidentally, the graph shown here has a bad link in the O’Reilly article, so all that displays is the file name; I did some digging on the O’Reilly site to find it here. (Feb 11: Bad link for this image and others in O’Reilly article are fixed, after I noted them in a comment.)
The closest thing I have found to other long tail numeric data relating to online books is reported in a 2006 article by Jason Epstein:
According to Mark Sandler of the University of Michigan Library, in an essay in Libraries and Google, an experiment by the library involving the digitization of 10,000 “low use” monographs offered on the Web produced “between 500,000 and one million hits per month.”
I suspect the realization of the “power of the long tail” shown in this experiment contributed to the University of Michigan opting to be one of the original library partners in the Google Books project.