When I first read the passage below in Salman Rushdie’s Haroun and the sea of stories three years ago, it struck me as a remarkable word picture of my experience of the Web. So of course I went right to Google to see if anyone else had made this connection — My searching, surprisingly, has found little since then, so I’ve thought about writing it up, but it didn’t get done. In the last week, I’ve gotten nudges (discussed below) that tell me this is the time. Here’s Rushdie:

Haroun looked into the water and saw that it was made up of a thousand thousand thousand and one different currents, each one a different color, weaving in and out of one another like a liquid tapestry of breathtaking complexity; and [the Water Genie] explained that these were the Streams of Story, that each colored strand represented and contained a single tale. Different parts of the Ocean contained different sorts of stories, and as all the stories that had ever been told and many that were still in the process of being invented could be found here, the Ocean of the Streams of Story was in fact the biggest library in the universe. And because the stories were held here in fluid form, they retained the ability to change, to become new versions of themselves, to join up with other stories and so become yet other stories; so that unlike a library of books, the Ocean of the Streams of Story was much more than a storeroom of yarns. It was not dead but alive.

Pow! Isn’t this a strikingly clear metaphorical description of the Web Stream that we all swim in every day? My first idea of a title for this article was “Did Salman Rushdie predict the Web?” I decided that was a bit too presumptuous  — But not by much — The passage does indeed verge on prediction. It was written in 1990 — Interestingly, the same year that Tim Berners-Lee “invented” the World Wide Web. It’s tempting to imagine the left-brained engineer (Berners-Lee) and the right-brained artist-seer (Rushdie) both envisioning the future Web in their own ways — Berners-Lee in outlining his Web ideas at CERN, and Rushdie in writing Haroun.

How has this passage and its Webishness gone unnoticed for so many years? Haroun is a story on many levels — Rushdie wrote it for his young son, and it’s often put in the category of “childrens’ literature.” I suspect this is the main reason it hasn’t been read often enough by grown-up Web users for someone to have seen Rushdie’s Stream of the Web metaphor. (Note that Haroun is on a prominent list of the 100 Greatest Novels of All Time)

How about the library connections in the passage? As a librarian, it certainly occurs to me that it could be viewed as being especially about libraries, maybe even seen as a threat to the traditional print library (“storeroom of yarns”). But I think, to the contrary, that Rushdie’s passage does the library world a great service, ushering us into the “liquid tapestry” of the digital Ocean, in which the Stream of “the library” will be able to “weave in and out” with the “thousand thousand and one different currents” outside of the traditional library world. Recent discussions of Google Book Search and orphan books show that the world is eagerly anticipating the stories in libraries being put into “fluid form.” And, in fact, library leader Peter Brantley, in commentary on GBS written in January 2009, talking about the coming age of digital books, uses language reminiscent of Rushdie: “We stride into a world where books are narratives in long winding rivers … and seas from which all rivers and rain coalesce.”

Speaking of the library world — As mentioned, there has been a notable lack of anyone else seeing a connection between the Rushdie passage and the Web. But the closest I’ve seen is in a paper co-authored by an engineer (JH Lienhard) and two librarians (JE Myers, TC Wilson), that was written in 1992, Surfing the Sea of Stories: Riding the Information Revolution (Mechanical Engineering 1992 Oct; 114(10): 60-65). This does an excellent job of connecting the Rushdie passage to the coming digital revolution, as it was seen in 1992, and contains the perceptively-done graphic in this article (above). But of course the full-blown Web was not born until 1995, so this view is limited. (The paper is summarized in the transcript of a radio program about it.)

Nudges for writing about this in the last week: First, In his blog article, Is The Stream What Comes After the Web?, Nova Spivack suggests that the metaphor of The Stream may soon replace The Web. The article doesn’t mention Rushdie, but it has elicited much discussion on Twitter, and someone would surely make the connection soon. Spivack does mention Twitter, saying that it and other microblogging systems are “the best example of the Stream,” which is related to the other nudge I’ve gotten, a blog article by Joff Redfern, Twitter is becoming the Ocean of the Stream of Stories. This is short , consisting mainly of the Rushdie quote above, but with its title it would likely be connected to Spivack’s Stream and Rushdie’s Streams of Stories sooner or later. Taken together, I think the articles by Spivack and Redfern indicate that Twitter is bringing to peoples’ minds the “stream-like” nature of the Web — The way big streams (e.g. swine flu 2 weeks ago) weave in and out with the day-to-day small streams of peoples’ lives on the Twitter ocean, with the stories constantly rewriting themselves.

Related articles:

Eric Rumsey is at @ericrumsey

Last week the National Academy of Sciences announced that “more than 9,000 Academies reports” are now available through Google Book Search, upon completion of “the first phase of a partnership with Google to digitize the library’s collection of reports from 1863 to 1997.” This sounds like good news, but it’s hard to evaluate the exact nature of the NAS documents that have become available, since neither the NAS press release nor Google give any indication of how to search the newly available documents in Google Book Search.

The NAS press release uses the word “reports” to describe the newly available documents. In its long history, the NAS has had several named series (below), and one of those is in fact “Report of the NAS.” But the example documents in the NAS press release are not part of that (or any other) series, so apparently the use of the word “report” in the press release is meant more as a generic description of the documents.

As far as I can tell, the only way to find NAS documents in Google Book Search is to search for “national academy of sciences” — This retrieves a mix of monographic sorts of titles, and series titles. Some have apparently been digitized and NAS, and others have been digitized at participating libraries. Below, I’m listing the main NAS series I find, that are in full-view, freely-available mode.

In a recent posting at O’Reilly Radar, Linda Stone discusses recent comments by Brewster Kahle and Robert Darnton on the Google Book Search Settlement. This is especially valuable for its talk about the orphan books problem, discussed by Kahle, as Stone reports, and in comments by Thomas Lord and Tim O’Reilly. I’m excerpting this interchange here. About Kahle’s posting, Stone says that he “focused on the plight of ‘orphan works’ – that vast number of books that are still under copyright but whose authors can no longer be found.”

Thomas Lord’s first comment — He says he’s thought much about the settlement:

My conclusion [around the time of the settlement] was that the big libraries, like Harvard, had made a bad deal — they didn’t understand the tech well enough and Google basically not only steamrollered them but implicated them in the potentially massive infringement case.

Basically, Google should have, indeed, paid for scanning and building the databases – but the ownership of those databases should have remained entirely with the libraries … The Writer’s Guild caved pretty easy and pretty early but legal pressure can still be brought to bear on Google. They can give up their private databases back to the libraries that properly should own them in the first place.

Tim O’Reilly’s comment on the article, and especially on Lord’s comment:

I agree with Tom’s analysis. (See my old post: Book search should work like web search [2006]). And I do agree with Brewster’s concern that this settlement will derail the kind of reform that would have solved this problem far more effectively. That’s still my preferred solution.

That being said, the tone of both Brewster’s comments and Darnton’s, implies that Google was up to some kind of skulduggery here. That’s unfair. Should they have stood up on principle to the Author’s Guild and the AAP? Absolutely, yes. But it’s the AG and the AAP who should be singled out for censure. … From conversations with people at Google, I believe that they do in fact continue to believe in real solutions to the orphaned works problem, and that demonizing them doesn’t do any of us any good.

The fact is, that Google made a massive investment to digitize these books in the first place. No one else was making the effort … In short, we’re comparing a flawed real world outcome with an “if wishes were horses” outcome that wasn’t in the cards. … Barring change to copyright law (and yes, we need that), Google has at least created digital copies of millions of books that were not otherwise available at all. Make those useful enough and valuable enough, and I guarantee there will be pressure to change the law so that others can profit too. …

Google Book Search was an important step forward in building an ebook ecosystem. I wish this settlement hadn’t happened, and that Google had held out for the win on the idea that search is fair use. And I wish that Google had taken the road that Tom outlined. … But they put hundreds of millions of dollars into a project that no one else wanted to touch. And frankly, I think we’re better off, even with this flawed settlement, than if Google had never done this in the first place.

Finally, I’ll point out that there is more competition in ebooks today than at any time in the past. Any claim that we’re on the verge of a huge Google monopoly, such as Darnton claims, is so far from the truth as to be laughable. Google is one of many contenders in an exploding marketplace.

Thomas Lord’s reply to O’Reilly:

… In the spirit of understanding things: you praise Google, I don’t. We’re better off those books having been scanned (I strongly agree) – I don’t like the way they bull-in-china-shop worked this. I think there’s a deep and lasting threat here that they need to fix if they want to “not be evil.”

The slides and data from Jon Orwant’s presentation on Google Book Search at TOC, that were not available when I wrote previously, have now been put up on the O’Reilly site. [these have been removed, see comment below] This is made up of 59 PDF slides, covering a range of recent developments with Google Books, including the recent release of GBS mobile, and a discussion of the Oct 2008 Publisher settlement. The part I’m most interested in is the data on GBS usage that had been mentioned by Orwant in various venues before, but with few details. The details in the TOC presentation are mostly in three “Case studies” of publishers that participate in the GBS Partner Plan — McGraw-Hill, Oxford University Press, and Springer. I’ve chosen one slide for each of these publishers that show various long-tail effects for usage of their books that are in GBS, and one slide that has data for a more extensive grouping from GBS.

McGraw-Hill case study is presented in slides 21-23. Below is slide 24. Note that this is a small sample of only the top 30 titles.

Oxford University Press – Slides 26-31. Below is slide 27. Note the long tail of visits for pre-1990 books.

Springer – Slides 32-36. Below is slide 35, showing clicks for Buy this Book. Note again the very long tail of clicks for pre-1995 books.

Slide 37 below shows “Share of books with more than 10 pages viewed”, apparently for all books in GBS. The coloring of the data lines looks ambiguous to me – The lowest line is undoubtedly for Snippet View books. It looks like the top line is for Limited Preview, which are presumably higher than Full View books, apparently the middle line, because Limited Preview books are more current.


Please comment here or Twitter @ericrumsey

Jon Orwant, from Google Book Search, made a presentation at the O’Reilly Tools Of Change (TOC) for Publishing Conference in New York last week, which I did not attend. Apparently Orwant presented some numeric data about the use of Google Books, but the data has yet to be spread to the world (See my comment on Peter Brantley’s blog about this). I’ve been searching in the week since TOC, to see what discussion there is of Orwant’s talk, and have found little. So I’m excerpting the three pieces that I have found. Only the first has any numeric data at all.

First, a piece by Jackie Fry, on the BookNet Canada publishers’ Blog. This is notable, and I’m putting it first, because it’s the only report I’ve found that has any numeric data at all from Orwant’s talk:

Conversion rates from Google Book Search results have been great for their partner publishers, mostly in the Textbook, Reference and STM channels, particularly in the shallow backlist (2003-2005 pubdates) with the highest Buy the Book clickthrus on 2004 titles. For some publishers, conversion to buy is as high as 89% for the titles they have made available.

30% of viewers looked at 10 or more pages when viewing the content of the book to make a buy decision.

The future is analytics! Google is thinking about what data they can pull out of their logs and provide anonymous aggregate data around consumer behaviour like what books were purchased that were like this one, search terms used most often for a category, most effective discounts, most effective referral sites etc.

More research [is needed] – Saw some good presentations with quantifiable research included – Brian O’Leary from Magellan, Joe Orwent (sic) from Google, and Neelan Choksi from Lexcycle were some of the few presenters who were able to quantify in any way what is going on in the marketplace. We need more  …

James Long’s report, on thedigitalist.net (Pan Macmillan Publishing):

Jon Orwant, from Google Book Search, stated at TOC that ‘the ultimate goal of Google Book Search is to convert images to “original intent” XML’. He explained the post-processing Google runs to continuously improve the quality of the scanned books, and to convert images to structured content. Retro-injecting structure accurately is no mean feat but when it’s done, Google will be able to transform the books into a variety of formats. The content becomes mutable and transportable, in a sense it isn’t yet, even though it is scanned, online and searchable. Orwant also presented three case studies – McGraw Hill, OUP, Springer – that demonstrated the benefits publishers can gain from having their books in GBS.

Highlighting the theme of discovery (to my mind), Tim O’Reilly interjected, at the end of these case studies, and made the point that O’Reilly used to own the top links to their own books in Google search results, but have now lost those links to GBS. Orwant, somewhat simplistically, responded that O’Reilly needed to improve their website to regain the top ranked link per title, as this spot was determined by Google’s search algorithms. This was not a convincing response, and dodged the issue, which I understood to be that the scale and in-house-ness of GBS could seriously inhibit the ability of the publisher to represent their own products online at the most common point of entry by the consumer, Google search results. There are many compelling reasons for publishers to own the top search result link, the most obvious being: offer unique additional content around the title, start a conversation with the reader, control the brand.

Edward Champion’s comments on his blog:

Thanks to a concept called blending, Google Book Search options remain in the top search results. An effort to direct traffic GBS’s way. …

There are 1.5 million free books, all public domain titles, available on Google. But if you want to access them, well, you’ll have to go to Google. Or you’ll have to have Google generate results at your site. Because the Google team are specialists in latency. They can do things with milliseconds that you couldn’t work out in your dreams.

As an experiment, Google recently unleashed Google Books Mobile, which means that you can nose search Google Book Search from your smartphone … Orwant was careful to point out that Google is not in the handset manufacturing or carrier business. But he anticipated, just as many of the seer-like speakers at Tools of Change did based on sketchy inside information, a “rapid evolution.”

Tim O’Reilly tried to badger Orwant too. You see, O’Reilly used to have good webpage placement for many of his titles. But those days are gone, replaced by Google Book Search results above the O’Reilly pages. And that hardly seems fair …

There’s some comfort in knowing that 99% of the books at GBS have been viewed at least once. Even the sleep-inducing textbooks. Which is really quite something. Which brings us to the future, which is based on the past …

That snippet view you see with some titles? Orwant‘s official position, pressed by Cory Doctorow, is that it’s fair use. But once the October 2008 settlement in Authors Guild v. Google is approved by the court, you’re going to see that snippet view jump to 20% of the book.

Please comment here or Twitter @ericrumsey

When a link is clicked to a specific page in GBS Mobile, the page that always opens is the entry page for the book. There doesn’t seem to be a way to link successfully to specific pages. I’ve tried this in several examples, and have had the same experience in all of them. An example below illustrates.

In this example, I’m trying to link to a group of pages starting with page 31. But when the link below is clicked, it goes to the entry page, which is page 21, with the same URL as below except that the page number is 21 instead of 31.

http://books.google.com/googlebooks/mobile/#Read?id=yb8UAAAAYAAJ&page_num=31
[This link and the link in the image below are the same]

After this link is clicked, and it goes to page 21, then it does work to change the number from 21 to 31, and it goes to page 31. The right > next to “Pages 21-30″ also works.

When the link is clicked to go to page 31, and ends up on page 21, clicking the Back button goes to page 31. And the address bar initially initially reads 31, but then changes to 21 – So when the link is initially clicked, it does “pass through” page 31, but apparently there’s some signal on page 31 that tells it to redirect to page 21.

Does anyone see what’s happening here? Any help would be much appreciated! Please post suggestions in comments, or in Twitter.

.

.

.

.

.

There’s been a lot of buzz about the announcement last week of mobile access to Google Book Search public-domain books. I’ve been looking hard for nitty-gritty details of how it works, though, and haven’t found much. The best is in comments by bowerbird on an announcement article on toc.oreilly.com. It’s easy for comments to get lost, so I’m excerpting most of bowerbird’s words here:

this offering is very good. extremely good. the interface is quite nice…

it was great to see google is serving digital text, rather than scans, since text is a lot more nimble. however, a tap on a paragraph brings up the scan of that paragraph, which is nice. and another tap restores the text. so if you want to verify the o.c.r., it’s simple to do. as i said above, this is nicely done.

curiously, in the one book i checked (roughing it), the text was extremely accurate as well, which is a pleasant discovery. i found only one o.c.r. error — “firty” for “fifty”, due to a blotch on the page …

this quality text is _not_ typical of google’s raw o.c.r., so they’ve evidently run some clean-up routines on it. i’m curious to see if they share this cleaned-up text with their library partners, or keep it to themselves… (no, the libraries weren’t smart enough to ask for it, as far as i know, let alone write it into the contracts.)

I’ve bolded what I take to be the most interesting point here, that Google has done an extra-special job of OCR’ing text for GBS mobile. As bowerbird notes, hopefully Google will share more about this process, sooner or later.

In a recent NY Times article that I blogged on, Dan Clancy, the engineering director for Google book search, is cited as saying “every month users view at least 10 pages of more than half of the one million out-of-copyright books that Google has scanned into its servers.” Remarkably, this classic long tail description of Google Books seems not to have been noticed by anyone — I’ve searched in Google (web and blogs) for various word combinations in the quote combined with “Dan Clancy,” and have found nothing at all except the original NYT article.

The long tail idea, which was first described by Chris Anderson in 2004, is that when a very large number of users are given a very large number of items to choose from, especially in an online environment with virtually unlimited “shelf space” and easy access, a very wide variety of items will be chosen. Anderson proposed the idea especially to describe commercial sites such as Amazon and Netflix, but it has also been seen as a good fit for libraries, and especially online library/book sources, such as Google Books.

So — Yes — There has been discussion of Google Books and the long tail. For the most part, though, this has been on a conceptual, non-numeric level. The statement by Clancy is valuable because it’s the first time there have been actual numbers provided by Google sources to back up the conceptual ideas. And, indeed, striking numbers they are — every month, half of the out-of-copyright books — i.e. old books — in Google Books are getting significant use. The long tail will certainly be even longer when newer books are made available after the October 2008 settlement goes into effect.

The best numeric data that I’ve found on Google Books and the long tail is given in an article by Tim O’Reilly in 2006, which compares sales of O’Reilly Media book titles, as reported by Nielsen Bookscan, with page views from Google Books. As the graph (at left) from that article shows, the Google Books page views (in red) have a very long, almost flat, tail, in contrast with the relatively short tail for actual sales of book titles (in blue). Incidentally, the graph shown here has a bad link in the O’Reilly article, so all that displays is the file name; I did some digging on the O’Reilly site to find it here. (Feb 11: Bad link for this image and others in O’Reilly article are fixed, after I noted them in a comment.)

The closest thing I have found to other long tail numeric data relating to online books is reported in a 2006 article by Jason Epstein:

According to Mark Sandler of the University of Michigan Library, in an essay in Libraries and Google, an experiment by the library involving the digitization of 10,000 “low use” monographs offered on the Web produced “between 500,000 and one million hits per month.”

I suspect the realization of the “power of the long tail” shown in this experiment contributed to the University of Michigan opting to be one of the original library partners in the Google Books project.

Clancy cites high usage of out-of-copyright books

This article is generally unremarkable, although it does have some good quotes from prominent players. Otherwise, just another article in NY Times on Google Books. But it has two notable features — The first is the quote from Google’s Dan Clancy, in the second paragraph, stating a remarkably high volume of usage of out-of-copyright books. The second notable feature, which is why I’m excerpting the article at some length, is that it was given surprisingly little attention in the blogosphere/twittersphere when it was published a month ago.

Google hopes to open a trove of little-seen books [IHT version]
by Motoko Rich, New York Times, Jan 5, 2009

Ever since Google began scanning printed books four years ago, scholars and others … have been able to tap a trove of information that had been locked away on the dusty shelves of libraries and in antiquarian bookstores.

[boldface added] According to Dan Clancy, the engineering director for Google book search, every month users view at least 10 pages of more than half of the one million out-of-copyright books that Google has scanned into its servers.

The agreement, pending approval by a judge this year, also paved the way for both sides to make profits from digital versions of books. Just what kind of commercial opportunity the settlement represents is unknown, but few expect it to generate significant profits for any individual author. Even Google does not necessarily expect the book program to contribute significantly to its bottom line. … “We did not think necessarily we could make money,” said Sergey Brin, a Google founder and its president of technology, in a brief interview at the company’s headquarters. “We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site.”

Users are already taking advantage of out-of-print books that have been scanned and are available for free download. Mr. Clancy was monitoring search queries recently when one for “concrete fountain molds” caught his attention. The search turned up a digital version of an obscure 1910 book, and the user had spent four hours perusing 350 pages of it.

“More students in small towns around America are going to have a lot more stuff at their fingertips,” said Michael A. Keller, the university librarian at Stanford. “That is really important.”

Some librarians privately expressed fears that Google might charge high prices for subscriptions to the book database as it grows. … David Drummond, Google’s chief legal officer, said the company wanted to push the book database to as many libraries as possible. “If the price gets too high,” he said, “we are simply not going to have libraries that can afford to purchase it.”

Authors view the possibility of readers finding their out-of-print books as a cultural victory more than a financial one. … “Our culture is not just Stephen King’s latest novel or the new Harry Potter book,” said James Gleick, a member of the board of the Authors Guild. “It is also 1,000 completely obscure books that appeal not to the one million people who bought the Harry Potter book but to 100 people at a time.”

Some scholars worry that Google users are more likely to search for narrow information than to read at length. “I have to say that I think pedagogically and in terms of the advancement of scholarship, I have a concern that people will be encouraged to use books in this very fragmentary way,” said Alice Prochaska, university librarian at Yale.

“There is no short way to appreciate Jane Austen …,” said Paul Courant, university librarian at the University of Michigan. “But a lot of reading is going to happen on screens. One of the important things about this settlement is that it brings the literature of the 20th century back into a form that the students of the 21st century will be able to find it.”

Adam Hodgkin, in Google Pictures and Google Books, wonders why Google has chosen to put Prado paintings in Google Earth rather than in Google Images. In December I asked a similar question about Google’s putting Life Magazine pictures in Google Images, but putting other picture-laden magazines in Google Books. And, in another recent launch they’ve put newspapers, which also have many pictures, in Google News.

Once again I come back to the theme of this blog — Pictures are just different — They don’t fit neatly into our categories. Pictures are an important part of several different media — books, magazines, newspapers, and (of course) art — So what slot do we put them in?

Even before the recent questions arose with Life Magazine pictures, Google Magazines, Google Newspapers, and Prado paintings, there’s the ongoing, but little-noted question of pictures in the growing collection of public domain books in Google Books. In my experience, these are completely absent from Google Image Search — When will Google make this connection?

Figuring out what category to put them into, of course, is a relatively minor problem compared to the BIG PROBLEM with pictures, which is making them searchable! If there was one category to put them into that was searchable, then of course that would be the place for Google to put them!