Google recently announced that scanned PDF documents are now available in Google Web Search. PDF documents have been in Google before, but most PDF documents that have been scanned from paper documents have not, so this will greatly improve access to PDF’s. As described below, it’s important to be able to distinguish scanned PDF’s from others, of the sort that have been in Google before.
Scanned PDF documents are originally created by making an image scan of a paper document, and since the text is an image, it’s not selectable or searchable as text. The other kind of PDF document, usually called native PDF, that’s been in Google before, is originally created from an existing electronic formatted document, like a Word document, and its text is selectable and searchable as text.
From Google search results it’s not possible to determine whether a PDF document is a scanned document or a native document — Both simply say “File Format: PDF/Adobe Acrobat.” To see if it’s scanned or native PDF, go to the document and click on a word to see if it can be selected. If it can, it’s native PDF; if not it’s scanned PDF. It’s important to know this because in a scanned PDF, the text is not searchable within the PDF-browser reader. This is not readily apparent, because the search command seems to work, but comes up with zero results. To search the text of a scanned document, go to search results, and click “View as HTML,” which has the text of the document.
Examples from Google:
Google search : Scanned PDF – Text cannot be selected (Notice that the text in this document is scratchy, poor quality, another indication of scanned text).
Google search : Native PDF – Text can be selected
See also: Google Books and Scanned PDF’s