When it comes to corpus analysis, scholars have tended to focus on stylistic or linguistic patterns in an author’s work. Punctuation is often excluded from these conversations, yet it is not entirely clear as to why this is the case. Periods, commas, hyphens, etc., are meaningful units of expression, and they can typically serve as a kind of signature by which to identify an author’s more nuanced expression. Not only that, but they can tell us important things about the social, cultural, and historical conditions under and through which a text was produced.
All too often, though, these elements of language are the first things to go when using digital software to analyze big data. The tutorials that I have found often view punctuation as entirely expendable units of expression. The same is true of stopwords, or those elements of language which most programs seem to categorize as “nonessential.” Here, I am thinking of the ways in which word clouds are generated according to a “weighted” vocabulary the program deems more important than others. Moreover, the same appears true in most sentiment analyses which pull data from a lexicon containing “significant” words with clear positive and negative connotations.
For these last few weeks of the summer fellowship, I have been exploring those marginal aspects of language which are often either forgotten or intentionally excluded from most datasets. It was my feeling that a computational analysis of Whitman’s punctuation offers important insights into the ways in which different media formats influenced his work. The comma, for example, serves many functions: it organizes parts of a list, joins together different ideas, and it can even act as a surrogate for other words.
As revealed in the graph, the comma is among Whitman’s most-used punctuation. By itself this is perhaps not a revelatory statement. I’d bet that the comma is probably the most-used punctuation mark in the entire English language. How exactly Whitman employs it in his poetry, prose, and correspondence is worth investigating further. The same goes for other punctuation marks. In Whitman’s correspondence, an increase in em dash usage is particularly noteworthy. Working as a research assistant for the Walt Whitman Archive, much of what I do consists of transcribing and encoding these messages. What I have found is that postal cards in the nineteenth century, even more so than traditional “letters,” contain a tremendous amount of em dashes. A number of reasons can explain this, but the most compelling to me is the smaller physical size of these messages. The materiality of the message itself is just as important as the content these writers are attempting to communicate. In Whitman’s prose, too, we begin to see an increase in em dash usage. It would be interesting to see whether or not the emergence of the postal card in the mid to late nineteenth century had any significant impact on Whitman’s postbellum writing. Continuing this project, I will try to incorporate some temporal dimension that could help track such developments.