A New York Times article on Google Flu Trends reports that Google’s methodology “has been validated by an unrelated study” based on Yahoo! search data whose lead author is Philip Polgreen, an infectious disease doctor at the University of Iowa.

I was glad to learn about the Polgreen study, first, of course, because Polgreen and colleagues are right here at the University of Iowa! — But beyond that, it was good to find in the full article by the Polgreen team that they give more details about the flu-related search terms they used than the Google Flu Trends team does, making it easier to break down the complicating factors in flu searching. Specifically, they report that they excluded the following terms:

bird, avian, pandemic
vaccine, vaccination, shot

As discussed in accompanying articles (see below), flu is a particularly complicated disease for correlating disease occurrence and web search behavior, because of the existence of bird flu, and because there is a vaccine for flu — exactly the factors that have been excluded by Polgreen et al. It seems likely that the Google Flu Trends team is using a similar method.

Incidentally, more on the Iowa connection — Philip Polgreen has been involved for several years with the Iowa Health Prediction Market, a spin-off of the Iowa Electronic Markets, a real-money prediction market/futures market that’s used to make predictions in political elections.

** This is one of a group of three articles on Google Flu Trends:

Together, these articles suggest that, although it’s difficult to know with assurance because Google has not revealed the search terms that they use for GFT, it seems likely that they’ve done a good job in working around the complications of flu-related search patterns.

The CDC data above shows that the occurrence of flu generally peaks in February; the data below from Google insights : flu symptoms, not surprisingly, has a similar peak in February.

Google insights, which uses the same data as Google Flu Trends, shows quite a different pattern for flu shot (below), which peaks in October or November (Flu vaccine peaks similarly).

How about searching for just the word flu (below)? — Interestingly, this seems to combine the peaks in the two graphics above, for flu symptoms and flu shot. The exaggerated peaks in 2004 and 2005 likely are caused by peoples’ concerns about vaccine shortage (more on this in accompanying posting, Google Flu Trends: Kudos & Complications).

Looking at the evidence of these graphics from Google Insights, it seems likely that the Google Flu Trends team is excluding search terms relating to flu vaccine, and concentrating on terms that relate to symptoms. See confirmation of this in accompanying posting, Google Flu Trends: The Iowa Connection.

The data shown here seems to indicate that for a seasonal disease in which there is a vaccine, the search patterns for “disease: symptoms,” “disease: vaccine/shot,” and the disease term itself differ, correlated with the time in the year when the disease occurs and when the vaccine is given. This idea is confirmed by Google Insights data for pneumonia, another respiratory disease that has a vaccine. The patterns are similar to flu, with high peaks for pneumonia shot in October, and somewhat lower peaks for pneumonia and pneumonia symptoms in February.

Bronchitis — A disease with no vaccine

Bronchitis is a respiratory disease condition that does not have a vaccine. As the graphics from Google Insights below show, the pattern is different from flu and pneumonia — The peaks for the disease itself (bronchitis, below) and for the disease with symptoms is much the same, making it less complicated to track search patterns — Apparently the people who search for the disease are in fact people who have the disease.

Bronchitis symptoms, from Google Insights.

** This is one of a group of three articles on Google Flu Trends:

Together, these articles suggest that, although it’s difficult to know with assurance because Google has not revealed the search terms that they use for GFT, it seems likely that they’ve done a good job in working around the complications of flu-related search patterns.

Google Flu trends is an elegant application of search data to medicine. Working on Hardin MD, I’ve long noticed seasonal variations in certain diseases — Colds, flu, & respiratory illnesses peak in winter, and insect bites & sun exposure conditions peak in summer. I pay a lot of attention to the search terms that people use to get to Hardin MD pages, so Google’s mining of this data to serve the health of the community is especially interesting.

The idea of Google Flu Trends is shown nicely in the snapshot above from the animation at the Google Flu Trends site — Google finds that there is an excellent correlation between flu-related terms that people search and the occurrence of flu, as measured by CDC data. And, as shown in the animation, the Google search data in near real-time precedes CDC data, which takes 1-2 weeks to be reported and compiled.

Complications

The idea of using search data to track the progression of disease outbreaks certainly is elegant, and Google deserves congratulations for it. In choosing flu as the first example, however, Google has chosen a disease with complicating factors.

The nature of these potentially complicating factors is suggested in the graphic above from the Google Flu Trends site — A big question here is — What caused the spike in flu occurrence and flu search activity in Dec 2003 – Jan 2004?

Because Google has chosen not to reveal the exact search terms that they are using to determine the volume of searching for flu-related searching (see supplementary material accompanying Google’s paper in Nature), it’s difficult to know the cause of the 03-04 spike with certainty. But looking back at the chronology of that time period sheds light — There was a major shortage of the flu vaccine in late 2004, which is certainly related to the spike shown in the graphic — The CDC spike (yellow) shows that many people had flu, presumably because they were unable to get the vaccine. The Google spike (blue) is even higher, which may indicate that there were a significant number of people searching for flu information not because they were infected, but because they were looking for information on how to get the vaccine. The accompanying article (Flu Symptoms vs Flu Shot) shows that there is in fact a clear indication of heightened search activity for flu vaccine-related terms during the autumn pre-flu season.

The other complicating factor in looking at flu-related search activity is bird flu, and this seems to have been addressed well by Google — The large bird flu outbreak in Asia, and corresponding bird flu scare throughout the world, occurred in late 2004 and early 2005. Since there is no major spike shown in the graphs for this time, Google apparently has excluded bird flu/avian flu search terms from the aggregate group of terms it’s using.

** This is one of a group of three articles on Google Flu Trends:

Together, these articles suggest that, although it’s difficult to know with assurance because Google has not revealed the search terms that they use for GFT, it seems likely that they’ve done a good job in working around the complications of flu-related search patterns.

Peter Suber, at Open Access News, has a good article on Google’s recent announcement that they are now OCR’ing scanned PDF documents so that they become searchable text documents in Google Web Search.

Scroll down especially to Suber’s comments, in which he describes the background to this Google advance, which is already in Google Book Search — As he says, it’s had an OCR’d text layer version of full-view books from the start, which is how they can be searched. (Google Catalogs also has a searchable text layer).

For more on searchable and non-searchable text see: Identifying Google scanned PDF’s

Google recently announced that scanned PDF documents are now available in Google Web Search. PDF documents have been in Google before, but most PDF documents that have been scanned from paper documents have not, so this will greatly improve access to PDF’s. As described below, it’s important to be able to distinguish scanned PDF’s from others, of the sort that have been in Google before.

Scanned PDF documents are originally created by making an image scan of a paper document, and since the text is an image, it’s not selectable or searchable as text. The other kind of PDF document, usually called native PDF, that’s been in Google before, is originally created from an existing electronic formatted document, like a Word document, and its text is selectable and searchable as text.

From Google search results it’s not possible to determine  whether a PDF document is a scanned document or a native document — Both simply say “File Format: PDF/Adobe Acrobat.” To see if it’s scanned or native PDF, go to the document and click on a word to see if it can be selected. If it can, it’s native PDF; if not it’s scanned PDF. It’s important to know this because in a scanned PDF, the text is not searchable within the PDF-browser reader. This is not readily apparent, because the search command seems to work, but comes up with zero results. To search the text of a scanned document, go to search results, and click “View as HTML,” which has the text of the document.

Examples from Google:
Google search : Scanned PDF – Text cannot be selected (Notice that the text in this document is scratchy, poor quality, another indication of scanned text).
Google search : Native PDF – Text can be selected

See also: Google Books and Scanned PDF’s

For more:

Kalev Leetaru (Univ Illinois) recently published a lengthy and interesting article comparing Google Books and the Open Content Alliance. It’s especially interesting because it brings together a good description of many nitty-gritty details of Google Books that are not easy to track down. I’m excerpting a few passages on the use of color and PDF format in Google Books.

Color in Google Books – I have the impression, as Leetaru says, that when Google first started scanning books they didn’t scan in color — They do now though, at least in some cases.

[I've added the bold-face in quotes below. The order of quotes is not necessarily the same as in Leetaru's article.]

Since the majority of out–of–copyright books do not have color photographs or other substantial color information, Google decided early on that it would be acceptable to trade color information for spatial resolution.

Google’s use of bitonal imagery and its interactive online viewing client significantly decrease the computing resources required to view its material. … Google Book’s bitonal page images, on the other hand, render nearly instantly, permitting realtime interactive exploration of works.

Use of PDF in Google Books – It’s interesting that Leetaru says the Google Books view “mimics the PDF Acrobat viewer.” Until recently, I avoided using the “Download PDF” button link in Google Books, thinking that it was mainly for downloading to print, and that the PDF view would take a long time to load. But I’m finding that it loads quickly, and provides a fairly usable interface that is in fact reminiscent of the Google Books view, as Leetaru suggests.

Google realized it was necessary to use different compression algorithms for text and image regions and package them in some sort of container file format that would allow them to be combined and layered appropriately. It quickly settled on the PDF format for its flexibility, near ubiquitous support, and its adherence to accepted compression standards (JBIG2, JPEG2000).

While many digital library systems either do not permit online viewing of digitized works, or force the user to view the book a single page at a time (called flipbook viewing), Google has developed an innovative online viewing application. Designed to work entirely within the Web browser, the Google viewing interface mimics the experience of viewing an Adobe Acrobat PDF file.

While most services take advantage of the linearized PDF format, Google made a conscious decision to avoid it. Linearized PDFs use a special data layout to allow the first page of the file to be loaded immediately for viewing … Google found several shortcomings with this format [noting that] the majority of PDF downloads are from users wanting to view the entire work offline or print it [and that] for these users, linearized PDFs provide no benefit.

See Leetaru’s extensively-referenced article for many other useful details.

Color pictures in full-view books in Google Books are generally not common. This is not surprising, since color pictures in books generally before the pre-copyright date (1923) were uncommon. Searches in Google Books for likely subjects — museum, sculpture, french painting, history – do find many books with pictures, but they are almost all black and white.

An exception to the general lack of colored illustrations in older books is in the areas of botany and dermatology, two subjects in which I have a particular interest. In these subjects there were many books published in the 19th century, especially in Europe, with excellent color illustrations. A few examples from Google Books About This Book: Selected Pages are shown here.

For more see Color Pictures in Google Books: More examples