I love serendipity — I happened to see these two pieces on the same day recently, and couldn’t help putting them together. Is there a meaning somewhere here? ….

Information on the Internet That Should Go Away, Roy Tennant

This is the kind of information I wish would disappear: old, outdated, in many cases downright misleading or incorrect. Now to only find the algorithm for determining these characteristics and nuking this dreck off the net! (boldface added here and below)

A case of great minds thinking alike? …

Google Announces Plan To Destroy All Information It Can’t Index, The Onion

MOUNTAIN VIEW, CA—Executives at Google, the rapidly growing online-search company that promises to “organize the world’s information,” announced Monday the latest step in their expansion effort: a far-reaching plan to destroy all the information it is unable to index. … “Our users want the world to be as simple, clean, and accessible as the Google home page itself,” said Google CEO Eric Schmidt at a press conference held in their corporate offices. “Soon, it will be.”

Fun Kicker — My first idea for a title for this article was “… Trimming the Internet.” Then I thought differently, and googled for “weeding the Internet” to see what might turn up – Sure enough, one of a handful of retrievals with that phrase is a library handout from libraries.uc.edu on The Library vs The Internet, sounding just like Roy: “No one’s weeding the Internet, and sites with seriously outdated information are still available.”

Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey

The recent controversy about the Google Book Search Settlement seems to have taken up peoples’ Google-watching attention so much that advances in the way GBS actually works have been getting overlooked. Several notable improvements were made during the summer, for example, that got very little recognition. Another change that seems to have gotten little recognition is that Google web searches have begun to include links to books in GBS in the last 1-2 years (as in the example at left). Particularly in searching for historical topics, I’ve been seeing searches recently in which the majority of the first 10 hits are from GBS — A great advance, I think, for historical research. Up to now, my experience has been that history has been a fairly weak subject on the Web — Locked away in books, not on Web pages.

I had occasion to take advantage of the newly accessible books from GBS recently, when I was least expecting it, while having a discussion with my son David, who’s a long-distance runner, about track runners of the past at the University of Iowa. I remembered that one particular runner on the team, Ted Wheeler, ran on the US Olympic team in the 1950′s, and that he later went on to become the coach for the UI track team (I especially knew about him because while he was the coach he married Sheila Creth, the University Librarian at the University of Iowa Libraries, where I work). David knew that Wheeler had been in the Olympics, and thought that he had been an assistant coach at Iowa, rather than the head coach. So … of course I turned to Google to settle the “discussion.” It turned out to be a surprisingly difficult search. I assumed that it would be fairly easy to find records of recent track coaches at a large, Big Ten program like Iowa. But it wasn’t — I tried several search terms without success before — Bingo! — I finally hit upon the combination that turned up the page shown here, establishing that Wheeler was, indeed, the UI track coach from 1978 to 1996 — with the added benefit of a great picture!

The point of this little story: I think integrating GBS links into Google web search is a great advance, and deserves more attention. As I said above, there’s been so much negative press for Google in recent discussions of the Settlement that everything they do is interpreted negatively — I saw a link in the last couple of weeks, that I unfortunately didn’t keep track of, decrying Google’s putting GBS links in Web search results because someone thought Google was trying to unfairly boost their own content. Really?? I think there’s such a treasure in old books that the world will benefit from Google’s making them more accessible. There are questions, certainly, about the algorithm used by Google to determine which books are included in Web search results, and I hope Google will say more about that. But it’s not only Google that’s saying little on the subject — I haven’t seen much discussion at all by anybody on the integration of GBS books in Google web search results –  If anyone can find it, please add a comment or contact me by Twitter or Email.

Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey

Many possible takes on this picture. What comes to my mind first is the idea of the Attention Economy –The idea that in the days of the traditional library, before the Internet, information was a limited resource. Libraries could afford to work under the assumption that “we’ve got the good stuff, and our users have to to come to us to get it.” There was little motivation to improve overly-complicated search interfaces like the picture on the right above, because users had no choice. In the new environment of the Internet, however, the limiting factor is not information, but attention. The problem of users now is not finding information, but being flooded by too much information. In this environment, users naturally gravitate to the easiest information to find, which, of course, Apple, Google et al are glad to provide.

Another take on this is the high cost of Simplicity –The simple interfaces of Apple and Google are just the tip of the iceberg, built upon the costly labor of armies of engineers. Libraries just can’t afford to compete with this sort of juggernaut. Personally, I consider myself lucky, as a librarian, to be working in a medical library — Medical libraries have a long history of generous federal support, in the interests of the country’s health, which has enabled the creation of tools to streamline access to medical information, from Index Medicus to PubMed. For libraries generally, however, it’s still hard to compete with the resources of dotcom information providers. To end on a hopeful note — It’s encouraging to see that libraries are increasingly realizing the importance of providing Google-like interfaces for their catalogs, to gain back the attention from users that they’ve lost in recent few years.

The picture above, and the title of this post, are adapted from an article by Scott Monty — Thanks!

Related articles:

Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey

Google Health OneBox is a boost for NLM’s MedlinePlus — As discussed previously, though, a few tweaks could make it an even bigger boost. A problem not discussed in the previous article is the “MedlinePlus” name — It has little user recognition, and therefore gets considerably less traffic than it might with a better name. In the NLM Update at the recent Midwest Chapter/Medical Library Association meeting, NLM staffer Paula Kitendaugh said some people at NLM are aware of this, and that a different name would likely do better in Google OneBox, but that so far bureaucratic inertia has prevented a name-change.

Realizing how slowly the wheels turn in a large organization like NLM, then, a better name for MedlinPlus is probably unlikely to happen soon. But how about a quick fix for the name of the link in Google OneBox, to take advantage of the fire-hose of potential traffic from Google? My idea for a simple change, that I think would draw more traffic, as shown in the enhanced screen shot here, is to change the link name from “Medline Plus” to “Natl Lib Med.” I think this simple abbreviation would be recognized and respected by users, and boost clicks to NLM.

As far as a new name for MedlinePlus, I don’t have any ideas so far. If anyone else does, please make a comment, or send to me via email or Twitter.

Accompanying article: MedlinePlus & Google Health OneBox

Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey

In August, Google launched Google Health OneBox (left). This puts the National Library of Medicine’s Medline Plus right at the top of the search results, and is potentially a valuable new source of traffic for NLM.

There are factors, however, that work against MLP — The three prominent links on the left, which are likely to get the bulk of OneBox clicks (Asthma, Google Health, & thumbnail) go to the Google Health Topics page (below). This has the same text and pictures as the MLP Encyclopedia/ADAM page that’s linked from the OneBox Medline Plus link. But there’s an important difference — The Google Health version of ADAM has Symptoms as the first section after Overview. The MLP version of ADAM, on the other hand (see further down on this page) has Causes as the first section. …

This may seem to be a minor difference. But I’ve learned — through long experience with Hardin MD and brief experience with the short-lived Medical Library Association-Google Health Coop project — that symptoms are a very popular, heavily searched topic for users (which Google certainly knows!). So I suspect that users who try out the Google Health and Medline Plus OneBox links will quickly learn to prefer Google Health because it features the symptoms information they’re looking for. It IS a positive for NLM that the Google Health page has a prominent link to MLP. But it’s rather surprising that there’s no clear credit given to ADAM as the original provider of the information — ADAM is credited only at the bottom of the page, where few users will see it (and I suspect many will consider it copyright-free, since they’ll presume that it’s from a government site.)

MedlinePlus & Google Health OneBox — How NLM can boost traffic

Change the order of sections on ADAM Encyclopedia pages, to put Symptoms at the top, as Google does. This would make the pages more interesting to most users.

Surprisingly, MLP Encyclopedia pages, which is what Google OneBox links to, have no links to equivalent MLP Health Topic pages (Example: there is no link between the Asthma pages in the Encyclopedia and in Health Topics) — After all, it’s the MLP Health Topic pages that NLM staff creates and maintains, so how about making links to them from Encyclopedia pages, so the surging clickers from Google OneBox can find them!

See follow-up article: MedlinePlus Needs a New Name

Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey

[This article accompanies previous article: Tagging in Hardin MD]

Soon after the launching of Hardin MD, in 1996, we began adding keywords in the hidden META keyword field (The first pages for HMD in Internet Archive [Dec, 1998] show them on all pages checked.) We began checking to see if HMD pages were appearing in search engine results in about 2000, and found that meta keywords didn’t seem to have much effect.

So, in late 2000, we began experimenting with putting keywords (aka tags*) at the bottom of the page, where most users wouldn’t notice them. At first we didn’t see much effect in search engine results, when using the tags mostly for variant spellings or terminology (e.g. on the Hematology page: blood diseases, haematology).

In 2001, as Google rose to prominence, and Search improved, we began using tools that gave the ability to see the popularity of specific words (HitBox, ExtremeTrackingWordTracker). We learned that using mis-spelled word variants as tags worked very well in drawing SE traffic. It was also during this time that links to pictures were being added to HMD, and we discovered the power of the word “pictures” in drawing SE traffic.

Time-line of tagging in Hardin MD

Based on invaluable help from Internet Archive — Starting from here: Internet Archive for Hardin MD, 1999+

The first HMD pages in Internet Archive in Dec, 1998 have meta keywords, but not tags on the page. Example of meta keywords (Hardin MD: Cardiology): health, medicine, medical, nursing, nurses, nurse, disease, diseases, best, list, lists, consumer, cardiology, cardiac, heart, stroke, cardiovascular, cardiothoracic, pacemaker, defibrillator, attack, arrest

Tagging for misspellings – Ophthalmology, I’m sure, would have been one of the first pages on which misspellings would have been used. Internet Archive pages show clearly that the first implementation was in early November, 2000. …

Ophthalmology, Nov 7, 2000 – No misspellings in meta keywords. There are no tags on page.
Ophthalmology, Nov 15, 2000 – Has misspellings in meta keywords and on page: [ophthamology]

This fits my memory of events — I was especially motivated to look for ways to draw Web traffic, because Google was just becoming prominent, rationalizing the search process, and making it easier to predict the effects of changes on page traffic.

Other examples of pages with tags on the page, with variant spellings, from about the same time: Orthopedics Nov 16, 2000 [orthopaedics] and Hematology Nov 29, 2000 [blood diseases, haematology]

Use of the word “pictures,” in tagging and in page titles

First use: Genital Warts Jun 10, 2002

First widespread use – Several pages linked on Hardin MD Index page Sept 30, 2002

.

Eric Rumsey is at: eric-rumsey AttSign uiowa dott edu and on Twitter @ericrumsey

This is excerpts from part 2 of Michael Nielsen’s seminal and long article, Is scientific publishing about to be disrupted?. Part 1 of Nielsen’s article is a general consideration of how industries fail, with particular discussion of the newspaper industry and blogs. Part 2 is the heart of Nielsen’s case (and has the same title as the article), so I’m excerpting it here to bring it to more certain attention …

Today, scientific publishers are production companies, specializing in services like editorial, copyediting, and, in some cases, sales and marketing. My claim is that in ten to twenty years, scientific publishers will be technology companies [3]. By this, I don’t just mean that they’ll be heavy users of technology, or employ a large IT staff. I mean they’ll be technology-driven companies in a similar way to, say, Google or Apple. That is, their foundation will be technological innovation, and most key decision-makers will be people with deep technological expertise. Those publishers that don’t become technology driven will die off.

What I will do … is draw your attention to a striking difference between today’s scientific publishing landscape, and the landscape of ten years ago. What’s new today is the flourishing of an ecosystem of startups that are experimenting with new ways of communicating research, some radically different to conventional journals. Consider Chemspider, the excellent online database of more than 20 million molecules, …. Consider Mendeley, a platform for managing, filtering and searching scientific papers, …. Or consider startups like SciVee (YouTube for scientists), the Public Library of Science, the Journal of Visualized Experiments, vibrant community sites like OpenWetWare and the Alzheimer Research Forum, and dozens more. And then there are companies like WordPress, Friendfeed, and Wikimedia, that weren’t started with science in mind, but which are increasingly helping scientists communicate their research. This flourishing ecosystem is not too dissimilar from the sudden flourishing of online news services we saw over the period 2000 to 2005.

Let’s look up close at one element of this flourishing ecosystem: the gradual rise of science blogs as a serious medium for research. It’s easy to miss the impact of blogs on research, because most science blogs focus on outreach. But more and more blogs contain high quality research content. Look at Terry Tao’s wonderful series of posts explaining one of the biggest breakthroughs in recent mathematical history, the proof of the Poincare conjecture. Or Tim Gowers recent experiment in “massively collaborative mathematics”, using open source principles to successfully attack a significant mathematical problem. Or Richard Lipton’s excellent series of posts exploring his ideas for solving a major problem in computer science, namely, finding a fast algorithm for factoring large numbers. Scientific publishers should be terrified that some of the world’s best scientists, people at or near their research peak, people whose time is at a premium, are spending hundreds of hours each year creating original research content for their blogs, content that in many cases would be difficult or impossible to publish in a conventional journal. What we’re seeing here is a spectacular expansion in the range of the blog medium. By comparison, the journals are standing still.

This flourishing ecosystem of startups is just one sign that scientific publishing is moving from being a production industry to a technology industry. A second sign of this move is that the nature of information is changing. Until the late 20th century, information was a static entity. The natural way for publishers in all media to add value was through production and distribution, and so they employed people skilled in those tasks, and in supporting tasks like sales and marketing. But the cost of distributing information has now dropped almost to zero, and production and content costs have also dropped radically [4]. At the same time, the world’s information is now rapidly being put into a single, active network, where it can wake up and come alive. The result is that the people who add the most value to information are no longer the people who do production and distribution. Instead, it’s the technology people, the programmers.

If you doubt this, look at where the profits are migrating in other media industries. In music, they’re migrating to organizations like Apple. In books, they’re migrating to organizations like Amazon, with the Kindle. In many other areas of media, they’re migrating to Google: Google is becoming the world’s largest media company. … How many scientific publishers are as knowledgeable about technology as Steve Jobs, Sergey Brin, or Larry Page?

… Being wrong is a feature, not a bug, if it helps you evolve a model that works: you start out with an idea that’s just plain wrong, but that contains the seed of a better idea. You improve it, and you’re only somewhat wrong. You improve it again, and you end up the only game in town. Unfortunately, few scientific publishers are attempting to become technology-driven in this way. The only major examples I know of are Nature Publishing Group (with Nature.com) and the Public Library of Science. …

Opportunities

So far this essay has focused on the existing scientific publishers, and it’s been rather pessimistic. But of course that pessimism is just a tiny part of an exciting story about the opportunities we have to develop new ways of structuring and communicating scientific information. These opportunities can still be grasped by scientific publishers who are willing to let go and become technology-driven, even when that threatens to extinguish their old way of doing things. … Here’s a list of services I expect to see developed over the next few years. A few of these ideas are already under development, mostly by startups, but have yet to reach the quality level needed to become ubiquitous. The list could easily be continued ad nauseum – these are just a few of the more obvious things to do.

Personalized paper recommendations: Amazon.com has had this for books since the late 1990s. You go to the site and rate your favourite books. The system identifies people with similar taste, and automatically constructs a list of recommendations for you. This is not difficult to do: Amazon has published an early variant of its algorithm, and there’s an entire ecosystem of work, much of it public, stimulated by the Neflix Prize for movie recommendations. If you look in the original Google PageRank paper, you’ll discover that the paper describes a personalized version of PageRank, which can be used to build a personalized search and recommendation system. …

A great search engine for science: ISI’s Web of Knowledge, Elsevier’s Scopus and Google Scholar are remarkable tools, but there’s still huge scope to extend and improve scientific search engines [5]. With a few exceptions, they don’t do even basic things like automatic spelling correction, good relevancy ranking of papers (preferably personalized), automated translation, or decent alerting services. They certainly don’t do more advanced things, like providing social features, or strong automated tools for data mining. Why not have a public API [6] so people can build their own applications to extract value out of the scientific literature? Imagine using techniques from machine learning to automatically identify underappreciated papers, or to identify emerging areas of study.

High-quality tools for real-time collaboration by scientists: Look at services like the collaborative editor Etherpad, which lets multiple people edit a document, in real time, through the browser. They’re even developing a feature allowing you to play back the editing process. Or the similar service from Google, Google Docs, which also offers shared spreadsheets and presentations. Look at social version control systems like Git and Github. Or visualization tools which let you track different people’s contributions. …

Scientific blogging and wiki platforms: With the exception of Nature Publishing Group, why aren’t the scientific publishers developing high-quality scientific blogging and wiki platforms? … On a related note, publishers could also help preserve some of the important work now being done on scientific blogs and wikis…. The US Library of Congress has taken the initiative in preserving law blogs. Someone needs to step up and do the same for science blogs.

The data web: Where are the services making it as simple and easy for scientists to publish data as it to publish a journal paper or start a blog? A few scientific publishers are taking steps in this direction. But it’s not enough to just dump data on the web. It needs to be organized and searchable, so people can find and use it. …

Just as Google Wave was announced yesterday, I was thinking of writing about the usefulness of the pictures that accompany results in Twitter Search, giving a good immediate overview of search results. I find this especially valuable in searching for Twitter users, to see how connected they are — It’s easy to see at a glance if most of the tweets listed are by the person being searched. So now Google Wave takes the idea a step further, with pictures of the people in an email thread. Below: Left: Twitter Search.  Right: Google Wave (from yesterday’s Google demo)

Facebook, of course, has similar pictures in its status updates. It’s interesting to follow how the use of pictures has progressed — In Facebook the status update pictures are relatively small. In Twitter, they grow larger, and now, in Google Wave, there are multiple pictures. This increasing reliance on pictures is smart. With the brain’s highly-developed facial recognition skills, we’re able to take in a large amount of information very quickly.

Eric Rumsey is at @ericrumsey

Working on Swine Flu this week has been especially interesting because it makes me reflect on how much things have changed in the information landscape since I worked on SARS in 2003 and Bird Flu in 2004-05. In those outbreaks, the main source of information was lists of links found in Google. How much that has changed now, with Twitter! People use Twitter in different ways — For me the most valuable part of it is the links in tweets. In former outbreaks, when Google was the “king of links,” it was especially hard to keep up with current news stories. Now links to breaking news stories appear within minutes in Twitter.

Evgeny Morozov, in his article, Swine flu: Twitter’s power to misinform complains about the chaotic nature of Swine Flu information in Twitter:

There are quite a few reasons to be concerned about Twitter’s role in facilitating an unnecessary global panic about swine flu. … [Twitter users] armed with a platform to broadcast their fears are likely to produce only more fear, misinformation and panic. … Twitter seems to have introduced too much noise into the process … The “swine flu” Twitter-scare has … proved the importance of context — The problem with Twitter is that there is very little context you can fit into 140 characters.

Anyone who’s used Twitter knows that there’s much truth here. Especially for a new user, it’s hard to separate the Twitter wheat from the Twitter chaff. But it can be done. To show the shallow, mindless nature of Twitterers, Morozov quotes text from tweets about Swine Flu. And he’s right, they’re pretty valueless. But, clicking to look at the writers’ profile pages shows that most of them are fairly inexperienced, with relatively few updates and followers, so it’s not surprising that their tweets are bad. Which goes to show, just as with online sources in general, in Twitter it’s important to check the source! Find out who’s behind the information.

So, while I agree with Morozov that Twitter has some negatives, I think we need to appreciate the positive value it has added to our ability to exchange information rapidly, that will certainly make us better able to deal with a real pandemic if it occurs. In composing this article, I came across a good conversation in Twitter that speaks to my ideas:

@PhilHarrison: Twitter is relatively new & we’re all learning about its power to inform & misinform as well. (bold added)
@charlesyeo: During SARS, some people in Asia blamed media for not exposing cases earlier so the sick can get help!

Note that the second tweet, by charlesyeo, comes back to the point I made in the first paragraph, that lack of information was a serious problem in the SARS epidemic. Twitter has clearly improved that.

Another valuable of Twitter in the Swine Flu epidemic has been the vibrancy of its international participation — Before Swine Flu, I had learned to value the prolific and multi-lingual tweeting of Jose Afonso Furtado (@jafurtado), a librarian in Portugal who tweets mostly on library/publishing subjects. When the Swine Flu epidemic broke out seriously in Mexico, he tweeted on that, and through his tweets I was able to connect on Twitter with people in Europe and Latin America who were following the situation in Mexico.

Eric Rumsey is at @ericrumsey