Yale Image Finder is a search engine for searching medical articles in PubMed Central for images. YIF is notable because it searches for text that is contained in images, many of which are charts and graphs with embedded “text” describing the data being presented. The “text” in these images, as in the example from YIF below, is converted to searchable OCR text.

What especially strikes me about this project is how similar it is to several initiatives from Google — For several years, Google has been working on image-to-text conversion in various of its facets, starting with Google Catalogs (now defunct) and Google Book Search. More recently, in 2008, several patents were published which extend the potential use of this sort of technology to a variety of possibilities, some of which include use in Google Maps street view, labels in museums and stores, and YouTube videos. Also showing Google’s continuing interest in this area is the announcement in Oct, 2008 that scanned PDF documents in Google Web Search are being converted to OCR text format.

Yale Image Finder was first announced in August, 2008, so it’s surprising that I have not been able to find anywhere (including a scholarly description by the developers) that it’s been connected to the initiatives by Google, which seem to be so similar. The same sorts of expressions of awe and amazement that have been expressed about the Google initiatives apply equally well to the Yale project, so I’m excerpting several of these commentaries below, all written in January, 2008, when the latest patents from Google inventors Luc Vincent and Adrian Ulges were published …

Bill Slawski, who has written several articles on Google image-to-text patents – Google on Reading Text in Images from Street Views, Store Shelves, and Museum Interiors :

One of the standard rules of search engine optimization that’s been around for a long time is that “search engines cannot read text that is placed within images.” What if that changed?

Here’s more from Slawski – Googlebot In Aisle Three: How Google Plans To Index The World? :

It’s been an old sawhorse for years that Google couldn’t recognize text that was displayed in images while indexing pages on the Web. These patent filings hint that Google may be able to do much more with images than we can imagine.

Duncan RileyGoogle Lodges Patent For Reading Text In Images And Video :

I may be stating the blatantly obvious when I say that if Google has found a way to index text in static images and video this is a great leap forward in the progression of search technology. This will make every book in the Google Books database really searchable, with the next step being YouTube, Flickr (or Picasa Web) and more. The search capabilities of the future just became seriously advanced.

Of course — sorry to keep harping on it! — as much as recognizing text in pictures would be a great advance, the REAL advance, of recognizing the actual objects in pictures, the philosopher’s stone of image search, still seems far from happening.

Please comment here or Twitter @ericrumsey

Adam Hodgkin, in Google Pictures and Google Books, wonders why Google has chosen to put Prado paintings in Google Earth rather than in Google Images. In December I asked a similar question about Google’s putting Life Magazine pictures in Google Images, but putting other picture-laden magazines in Google Books. And, in another recent launch they’ve put newspapers, which also have many pictures, in Google News.

Once again I come back to the theme of this blog — Pictures are just different — They don’t fit neatly into our categories. Pictures are an important part of several different media — books, magazines, newspapers, and (of course) art — So what slot do we put them in?

Even before the recent questions arose with Life Magazine pictures, Google Magazines, Google Newspapers, and Prado paintings, there’s the ongoing, but little-noted question of pictures in the growing collection of public domain books in Google Books. In my experience, these are completely absent from Google Image Search — When will Google make this connection?

Figuring out what category to put them into, of course, is a relatively minor problem compared to the BIG PROBLEM with pictures, which is making them searchable! If there was one category to put them into that was searchable, then of course that would be the place for Google to put them!

In a brief response letter, author and publisher Marc Aronson writes about the copyright status of pictures that are in publisher partner books in Google Books. Aronson suggests that the rights for pictures are separate from the rights for text. I’ve corresponded with Aronson to expand on this idea, and he says that in his experience as an author and editor, he has been told that he needs to obtain rights to pictures and text separately. I’ve searched for other commentary on this issue, and have found very little. It’s a subject that needs exploration. Anyone have ideas?

All books in the publisher partner program, of course, are under copyright, and are available only in Limited Preview, with the publisher giving Google the rights to display a specific number of pages. In some cases of books containing pictures, however, the pages are available, but without the pictures. Is this because the publisher has gotten the rights for limited preview of the text, but not the pictures, as Aronson suggests? The three examples below show a variety of Limited Preview options. The first two are especially pertinent, because they are for books from the same publisher (Macmillan), in the same series, that have a different picture preview status, possibly indicating that the illustrator has given permission to display pictures in the first case, but not in the second.

In this example, the first 39 pages* are available for preview, with all pictures displaying. There are about 30 thumbnail images for pages with pictures on the About this Book page.
Birds of North America (Golden Field Guides)
By Chandler S. Robbins et al, Illustrated by Arthur Singer, Published by Macmillan, 2001

In this book, from the same publisher, the first 37 pages* are available for preview, but almost all pictures do not display, replaced with the message “Copyrighted image.” There are no thumbnail images on the About page.
Wildflowers of North America (Golden Field Guides)
By Frank D. Venning, Illustrated by Manabu C. Saito, Published by Macmillan, 2001

This book follows the most common, fairly liberal, pattern of publishers in Limited preview books, with the first 50 pages* available including all pictures. A full complement of 30 thumbnails is on the About page.
Central Rocky Mountain Wildflowers
By H. Wayne Phillips, Illustrated, Published by Globe Pequot, 1999

* The number of pages available for preview varies from session to session — The number given here is the maximum I experienced.

Mike Cane hits the target on color eBooks

Truly, the first device that can do color eBooks will change things forever … There are three recent signs — as well as a total wild card — that point to possible dramatic changes in the eBook-reading hardware landscape. …

… The first is Samsung hitting the pedal hard on OLED screen manufacturing.

… The second development has been Hewlett-Packard demonstrating color eInk screens.

… The third piece of this puzzle: Amtek Rumored to Show Slate Netbook at CES 2009.

… The Wild Card in all this? … Pixel Qi, which brags it has revolutionary screens that will basically run on electrons by osmosis instead of the greedy sip-sip-sip of current technology.

Sad to say, this is one of Mike’s last blog postings — His incisive comments on eBooks will be missed.

Andrew Smith, at the Dallas News, writes on the same subject, in article — Why e-books will rule

… Nearly all non-fiction books cry out for far more illustration than they contain, but the costs of adding pictures and charts (especially color pictures and charts) are prohibitive. That’s why you see so many non-fiction books with all the photos bunched up into a couple of glossy-page sections in the center. It’s the only cheap(ish) way to get the job done. Color E-Ink will change that forever. Nearly all non-fiction books cry out for far more illustration than they contain, but the costs of adding pictures and charts (especially color pictures and charts) are prohibitive. That’s why you see so many non-fiction books with all the photos bunched up into a couple of glossy-page sections in the center. It’s the only cheap(ish) way to get the job done. Color E-Ink will change that forever.

Scientific and medical books, which make heavy use of color illustrations, especially stand to benefit from the advent of color eBooks, maybe lowering the prices, which can break a student’s budget for print textbooks.

Until now, books with pictures, especially color pictures, have been a relatively small part of Google Books. But the addition of highly visual, popular magazines changes this — The titles added so far are filled with pictures!

On one level, more pictures in Google Books is gratifying — a theme of this blog! But the navigation/search capabilities for finding these pictures is limited. The best way seems to be to use Advanced Search and limit the search to Magazines. But the results listing for this is text-only. It would be much easier to search for pictures with the sort of thumbnail search results interface that’s used in Google Image Search.

In light of the launching of picture-laden magazines as part of Google Books, it’s interesting to note that only last month, Google launched Life magazine pictures, as part of Google Image Search. Google is facing the same choice that librarians have been considering for the last while — Should books (or magazines) that have many pictures be considered mainly as books that happen to have pictures, or as pictures that happen to be in books?

The pictures & links below are from magazines that are in Google Books. I’ve chosen them because I know from work on Hardin MD that they are on highly-searched subjects, which would likely appear in Google Image Search if they were crawlable.

.           .

Color pictures in full-view books in Google Books are generally not common. This is not surprising, since color pictures in books generally before the pre-copyright date (1923) were uncommon. Searches in Google Books for likely subjects — museum, sculpture, french painting, history – do find many books with pictures, but they are almost all black and white.

An exception to the general lack of colored illustrations in older books is in the areas of botany and dermatology, two subjects in which I have a particular interest. In these subjects there were many books published in the 19th century, especially in Europe, with excellent color illustrations. A few examples from Google Books About This Book: Selected Pages are shown here.

For more see Color Pictures in Google Books: More examples

Flickr takes the sun out of the sunset“Flickr takes the sun out of the sunset” — The picture to the left from Flickr shows the full picture and its square thumbnail, in the inset. Thumbnails like these are generated automatically by Flickr and other photo management systems. They work by taking a portion from the center to make the thumbnail. This works well if the center has the most important subject in the picture. But if the picture is relatively wide or tall, and its main subject is not in the center, as in the example at left, with the sun being to one side, the thumbnail misses it. Looking at this example (Long Beach Sunset) in Flickr, note that the first thumbnail on the Flickr page (top left) is the one for the larger picture (that’s shown on our page with the thumbnail in yellow-outlined inset).

In large mass-production systems like Flickr, automatic thumbnails are unavoidable, and my point is not that they should never be used. Instead, my point is that, on many levels, pictures require more human input than text to make them optimally usable. Pattern recognition — the simple observation that the thumbnail of a picture of a sunset SHOULD CONTAIN THE SUN — is something that the human brain does easily, but this does not come naturally for a computer.


Another sort of problem in automatic production of thumbnails is making a thumbnail by simply reducing the size of the large picture. If the main subject of the picture is relatively small, it is not visible in a small thumbnail.

The picture to the left is from the Hardin Library ContentDM collection. The inset in the upper right shows the thumbnail that’s generated automatically by the system, which does a poor job of showing details of the picture. The lower inset shows a thumbnail made manually, which gives a much more clear view of the central image in the picture.

Cropping of a picture to produce a thumbnail, as done here, takes more subtle human judgement than the case with the Flickr picture in the first example, where the weakness of automatic production is obvious. With cropping, there’s inevitably a trade-off between showing the whole picture in the thumbnail or showing the most important subject of the picture. In cases such as this one from ContentDM, where most all of the detail in the picture will be lost in a small thumbnail, it seems better to focus on a central image that will show up in the thumbnail.

Finally, a few examples from Hardin MD, below, show how we have done cropping to improve the detail in our thumbnails. The thumbnails on the left in each of the three pairs are made by simply reducing the size of the full picture. On the right in each pair are the thumbnails we use, that we have made by cropping the full picture before making the thumbnail.

The biomedical, scientific pictures that we work with in Hardin MD are fairly easy to make thumbnails for, because they generally have a well-defined focus, that’s usually captured well by automatically-generated thumbnails. More artistic, humanities-oriented pictures, such as the ones discussed here from Flickr and ContentDM, however, often have more subtle subjects, that benefit from the human intelligent touch in the production of thumbnails.

[When Hardin MD was launched in 1996, its main purpose was to provide links to health science resources on the Web. In recent years, the emphasis has been on providing access to medical pictures.]

We first started tracking how well Google was finding Hardin MD pages in about 2001, when search engine optimization was in its infancy, and most people, like us, had not heard the term “SEO.” But in today’s lingo, that’s pretty much what we were doing — Learning to use language that would help people searching in Google to find our pages — So here’s a little example of using search engine optimization techniques before they became famous as SEO. …

Users of Hardin MD will notice that the word “pictures” is used frequently on our pages and the word “images” is rarely used. Why is this? Basically, the answer is simple — We use “pictures” because that’s the word people use in searching.

The screen-shots below, for the Hardin MD : Impetigo Pictures page, show this clearly. The Extreme Tracker shot for this page shows the large proportion of search engine traffic from the word “pictures” (36%) compared to the small amount of traffic from the word “images” (0.6%).

hmd_impetigopics.JPG
extremeimpetigo.jpg
Hardin MD : Impetigo Pictures page
Keywords (Extreme Tracker)

The Google screen-shots show that the Impetigo Pictures page gets an equally high ranking for the two words, so it’s apparent that “pictures” is being searched much more frequently.

g_pictures.jpg
g_images.jpg
Google search: impetigo pictures
Google search: impetigo images

(Note that these screen-shots have been photo-edited to fit the space — Ads and other text not relevant to the article have been removed. All screen-shots captured in July 2008.)

Here’s the background …

In about 2001, we started noticing how people were finding Hardin MD pages in search engines, and designing our pages to make them more likely to be found. An important part of this was using words that people were more likely to search (e.g. “heart diseases” instead of “cardiology”). Tools such as WordTracker that show how many people are searching for particular words are especially useful for this.

About this same time, we were starting to make links to other sites that have pictures on medical/disease subjects. Using WordTracker, and ExtremeTracker (to see words people were searching to find our pages) it was striking that the word “pictures” was very effective. At the time, we assumed that the appropriate word to use was “images,” since that word is what’s used on most medical/disease pages at other sites. We could see clearly, however, that using the word “pictures” on our pages brought much more traffic than the word “images.” So we’ve gone on from there, and now have high rankings in Google for many medical/disease subjects combined with “pictures,” as with Impetigo.

Extreme Tracker | WordTracker

As computers have become more powerful, many of the aspects of handling text that were formerly done by humans have been taken over by computers. Pictures, however, are much more difficult to automate — Recognizing patterns remains a task that humans do much better than computers. A human infant can easily tell the difference between a cat and a dog, but it’s difficult to train a computer to do this.

In pre-Google days, the task of finding good lists of web links needed the input of smart humans (and Hardin MD was on the cutting edge in doing this). Now, though, Google Web Search gives us all the lists we need.

Pictures are another story — on many levels, pictures require much more human input than text.

The basic, intractable problem with finding pictures is that they have no innate “handle” allowing them to be found. Text serves as its own handle, so it’s easy for Google Web Search to find it. But Google Image Search has a much more difficult task. It still has to rely on some sort of text handle that’s associated with a picture to find it, and is at loss to find pictures not associated with text.

The explosive growth of Hardin MD since 2001 (page views in 2008 are over 50 times larger) has been strongly correlated with the addition of pictures. This time period has also gone along with the growing presence of Google, with its page-rank technology, and this has come to make old-style list-keeping, as had been featured in Hardin MD, less important.

Though Google has accomplished much in the retrieval of text-based pages, it’s made little progress in making pictures more accessible. Google Image Search is the second most-used Google service, but its basic approach has changed little over the years.

The basic problem for image search is that pictures don’t have a natural handle to search for. Because of this it takes much more computer power for the Google spider to find new pictures, and consequently it takes much longer for them to be spidered, compared to text pages (measured in months instead of days).

Beyond the problem of identifying pictures there are other difficult-to-automate problems for image search:
• How to display search results most efficiently to help the user find the what they want — Do you rank results according to picture size, number of related pictures at a site, or some other, more subjective measure of quality?
• What’s the best way to display thumbnail images in search results?
• How much weight should be given to pictures that have associated text that helps interpret the picture?

So — Good news for picture people! — I would suggest that pictures are a growth sector of the information industry, and a human-intensive one. I would predict that text-based librarians will continue to be replaced, as computers become more prominent. But there will continue to be a need for human intelligence working in all areas relating to pictures, from indexing/tagging to designing systems to make them more accessible.