There’s been a lot of buzz about the announcement last week of mobile access to Google Book Search public-domain books. I’ve been looking hard for nitty-gritty details of how it works, though, and haven’t found much. The best is in comments by bowerbird on an announcement article on toc.oreilly.com. It’s easy for comments to get lost, so I’m excerpting most of bowerbird’s words here:

this offering is very good. extremely good. the interface is quite nice…

it was great to see google is serving digital text, rather than scans, since text is a lot more nimble. however, a tap on a paragraph brings up the scan of that paragraph, which is nice. and another tap restores the text. so if you want to verify the o.c.r., it’s simple to do. as i said above, this is nicely done.

curiously, in the one book i checked (roughing it), the text was extremely accurate as well, which is a pleasant discovery. i found only one o.c.r. error — “firty” for “fifty”, due to a blotch on the page …

this quality text is _not_ typical of google’s raw o.c.r., so they’ve evidently run some clean-up routines on it. i’m curious to see if they share this cleaned-up text with their library partners, or keep it to themselves… (no, the libraries weren’t smart enough to ask for it, as far as i know, let alone write it into the contracts.)

I’ve bolded what I take to be the most interesting point here, that Google has done an extra-special job of OCR’ing text for GBS mobile. As bowerbird notes, hopefully Google will share more about this process, sooner or later.

Comments are closed.