The list below is 50 consecutive random links to Wikipedia articles using the Random Article link that’s in all articles. As suggested in a recent study by Kittur, Chi & Suh (discussed below) I’ve divided these random articles into the top level Wikipedia categories. More interesting than these categories are other broad subjects (as picked out by me) in the articles below: Sports (7 articles), Pop Music (6), Europe (5), Politics (4), India (3). These subjects, I think, give a good flavor of the sorts of articles in Wikipedia.

Beyond the categories and sub-cats though — The most striking thing about this random sample of Wikipedia articles is the narrow, limited nature of the articles — Almost all of them are about things that No One Has Heard Of! — A great example of the Long Tail effect. Only in this case, it seems to be almost all Tail, and very little Head. Obviously, there are thousands of Wikipedia articles on well-known subjects, which we read every day. But in terms of numbers, the articles on minor, unheard-of subjects vastly outnumber the popular ones.

[There’s more commentary below following the list]

Culture

People

  • Brigadier General Anthony Stack
    Currently a Brigadier General in service of the Canadian Forces, 1 screen
  • Roy Orchard Woodruff
    Politician, soldier, printer and dentist from Michigan (1876 – 1953), 1 screen
  • Ed Bryant
    Former Republican member of the US House of Representatives from Tennessee, 1948- , 3 screens
  • Missy Higgins
    Australian singer-songwriter, 1983 – , 7 screens
  • Răzvan Sabău
    Romanian tennis player, 1977- , 1 screen
  • Mirza Rizvanović
    Bosnian football defender, 1 screen, Stub
  • Steve Byrne
    American stand-up comedian, 1974- , 1 screen, Stub
  • Joan Hambidge
    Afrikaans poet, literary theorist and academic, 1956- , 2 screens
  • Agim Kaba
    American-Albanian actor, writer, director, sound editor, dancer, and film producer, 1980- , 1 screen
  • Verda Welcome
    African-American teacher, civil rights leader, and Maryland state senator, 1907 – 1990, 2 screens
  • Harolyn Blackwell
    African-American lyric coloratura soprano, 1955- , 7 screens
  • Ron Sobieszczyk
    Retired American professional basketball player, 1 screen
  • John Cumberland
    Former Major League Baseball player and coach, 1947- , 1 screen, Stub

Geography

  • Cwmcarn Forest Drive
    Tourist attraction and scenic route in Cwmcarn, Crosskeys, Wales, 1 screen, Stub
  • Vila Chã
    Portuguese parish with 2,957 inhabitants and a total area of 5.49 km², 1 screen, Stub
  • Chojnowo
    Village in Krosno Odrzańskie County, Lubusz Voivodeship, western Poland, 1 screen, Stub
  • Interstate 17
    5 screens
  • Withee (Town) Wisconsin
    Town in Clark County in the US state of Wisconsin, with population of 885 at the 2000 census, 1 screen
  • Edson, Wisconsin
    Town in Chippewa County in the US state of Wisconsin, with population of 966 at the 2000 census, 1 screen

Society

History

Science

  • Swannia
    Genus of moth in the family Geometridae, 1 screen, Stub
  • Proflazepam
    Drug which is a benzodiazepine derivative, 1 screen, Stub

Technology

Math

Disambiguation


I’m assuming that the Random Article link used to derive these links is truly random, that it does give a good sample of all Wikipedia articles. Surprisingly, I have not been able to find a Wikipedia article on “Wikipedia Random Article,” or any other commentary on it that might give an idea about this. I also have found no indication that anyone else has attempted to make a list of random Wikipedia articles, as presented here. Please let me know in a Comment if I’m missing something!

The purpose of the Kittur, Chi & Suh paper (PDF) mentioned above was to map all Wikipedia articles to one of the top level Wikipedia categories. The articles in the list above fit their results fairly well for most of the categories. Here are their results and mine (in parentheses):

Culture: 30% (28%)
People: 15% (26%)
Geography: 14% (12%)
Society: 12% (8%)
History: 11% (10%)
Science: 9% (4%)
Technology: 4% (6%)
Religion: 2% (0%)
Health: 2% (0%)
Math: 1% (2%)
Philosophy: 1% (0%)

The assigning of categories by me is imprecise at best, so it’s not surprising that there’s not complete agreement between my findings and those of KCS. It’s also possible that the real division of categories has changed since KCS collected data for their study, in Jan, 2008. Finally, one more bit on KCS – Their paper has the same base title as this article (What’s in Wikipedia?) — I actually thought of this title before I found their paper — In fact I found their paper because I searched for the title after I thought of it for this article! So I don’t feel like I’m stealing their title 😉

Acknowledgement to my son David: The writing of this article is a long tale in itself! It arose from a (rather hair-brained, I now see) question I pondered — whether there’s a way to generate a “random Web page” from anywhere (The answer, I think is No, but that’s a separate discussion). As I discussed this idea with David, he mentioned the Random Article link on Wikipedia articles. I had actually never noticed this before, and found it quite interesting, which led to this article. Also David confirmed from his younger perspective that the links in the sample above are indeed obscure!

Note: These random links were generated over three separate days in the last week.

Eric Rumsey is on Twitter @ericrumseytemp

The slides and data from Jon Orwant’s presentation on Google Book Search at TOC, that were not available when I wrote previously, have now been put up on the O’Reilly site. [these have been removed, see comment below] This is made up of 59 PDF slides, covering a range of recent developments with Google Books, including the recent release of GBS mobile, and a discussion of the Oct 2008 Publisher settlement. The part I’m most interested in is the data on GBS usage that had been mentioned by Orwant in various venues before, but with few details. The details in the TOC presentation are mostly in three “Case studies” of publishers that participate in the GBS Partner Plan — McGraw-Hill, Oxford University Press, and Springer. I’ve chosen one slide for each of these publishers that show various long-tail effects for usage of their books that are in GBS, and one slide that has data for a more extensive grouping from GBS.

McGraw-Hill case study is presented in slides 21-23. Below is slide 24. Note that this is a small sample of only the top 30 titles.

Oxford University Press – Slides 26-31. Below is slide 27. Note the long tail of visits for pre-1990 books.

Springer – Slides 32-36. Below is slide 35, showing clicks for Buy this Book. Note again the very long tail of clicks for pre-1995 books.

Slide 37 below shows “Share of books with more than 10 pages viewed”, apparently for all books in GBS. The coloring of the data lines looks ambiguous to me – The lowest line is undoubtedly for Snippet View books. It looks like the top line is for Limited Preview, which are presumably higher than Full View books, apparently the middle line, because Limited Preview books are more current.


Please comment here or Twitter @ericrumseytemp

In a recent NY Times article that I blogged on, Dan Clancy, the engineering director for Google book search, is cited as saying “every month users view at least 10 pages of more than half of the one million out-of-copyright books that Google has scanned into its servers.” Remarkably, this classic long tail description of Google Books seems not to have been noticed by anyone — I’ve searched in Google (web and blogs) for various word combinations in the quote combined with “Dan Clancy,” and have found nothing at all except the original NYT article.

The long tail idea, which was first described by Chris Anderson in 2004, is that when a very large number of users are given a very large number of items to choose from, especially in an online environment with virtually unlimited “shelf space” and easy access, a very wide variety of items will be chosen. Anderson proposed the idea especially to describe commercial sites such as Amazon and Netflix, but it has also been seen as a good fit for libraries, and especially online library/book sources, such as Google Books.

So — Yes — There has been discussion of Google Books and the long tail. For the most part, though, this has been on a conceptual, non-numeric level. The statement by Clancy is valuable because it’s the first time there have been actual numbers provided by Google sources to back up the conceptual ideas. And, indeed, striking numbers they are — every month, half of the out-of-copyright books — i.e. old books — in Google Books are getting significant use. The long tail will certainly be even longer when newer books are made available after the October 2008 settlement goes into effect.

The best numeric data that I’ve found on Google Books and the long tail is given in an article by Tim O’Reilly in 2006, which compares sales of O’Reilly Media book titles, as reported by Nielsen Bookscan, with page views from Google Books. As the graph (at left) from that article shows, the Google Books page views (in red) have a very long, almost flat, tail, in contrast with the relatively short tail for actual sales of book titles (in blue). Incidentally, the graph shown here has a bad link in the O’Reilly article, so all that displays is the file name; I did some digging on the O’Reilly site to find it here. (Feb 11: Bad link for this image and others in O’Reilly article are fixed, after I noted them in a comment.)

The closest thing I have found to other long tail numeric data relating to online books is reported in a 2006 article by Jason Epstein:

According to Mark Sandler of the University of Michigan Library, in an essay in Libraries and Google, an experiment by the library involving the digitization of 10,000 “low use” monographs offered on the Web produced “between 500,000 and one million hits per month.”

I suspect the realization of the “power of the long tail” shown in this experiment contributed to the University of Michigan opting to be one of the original library partners in the Google Books project.