Hawthorne Article Published at DHQ

Cross-posted from my “Hawthorne’s Celestial Railroad: A Publication History” development blog at http://ryan.cordells.us/crr/hawthorne-article-published-at-dhq/

While I apologize for the slow updates here of late, I am pleased to report that my article on the reprinting history of “The Celestial Railroad” has been published in the latest issue of Digital Humanities Quarterly. This is a special issue on “The Literary,” and is well worth perusing in full.

Mr. Penumbra, Distant Reading, and Cheating at Scholarship

My Technologies of Text course is capping this semester reading Robin Sloan’s novel, Mr. Penumbra’s 24-Hour Bookstore, which Matt Kirschenbaum deemed “the first novel of the digital humanities” last year. Mr. Penumbra is a fine capstone because it thinks through so many of our course themes: the (a)materiality of reading, the book (and database) as physical objects, the relationship between computers and previous generations of information technology, &c. &c. &c. I will try not too spoil much of the book here, but I will of necessity give away some details from the end of the first chapter. So if you’ve not yet read it: go thou and do so.

Rereading the book for class, I was struck by one exchange between the titular Mr. Penumbra—bookstore owner and leader of a group of very close readers—and the narrator, Clay Jannon—a new bookstore employee curious about the odd books the store’s odd club members check out. In an attempt to understand what the club members are up to, Clay scans one of the store’s logbooks, which records the comings and goings of club members, the titles of the books they checked out, and when they borrowed each one. When he visualizes these exchanges over time within a 3d model of the bookstore itself, visual patterns of borrowing emerge, which seem, when compiled, to reveal an image of a man’s face. When Clay shows this visualization to Mr. Penumbra, they have an interesting exchange that ultimately hinges on methodology:

Half-smiling, he holds his glasses at an angle and peers down at the screen. His face goes slack, and then he says, quietly: “The Founder.” He turns to me. “You solved it.” He claps a hand to his forehead and his face splits into a giddy smile. “You solved it already! Look at him! Right there on the screen! [...] “How did you do it?” he continues. He’s so proud, like I’m his grandson and I just hit a home run, or cured cancer. “I must see your notes! Did you use Euler’s method? Or the Brito inversion? There is no shame in that, it clears away much of the confusion early on…”

“Mr. Penumbra,” I say, triumph in my voice, “I scanned an old logbook [...] because Google has this machine, it’s superfast, and Hadoop, it just goes—I mean, a thousand computers, like that!” I snap for emphasis. I don’t think he has any idea what I’m talking about. “Anyway, the point is, we just pulled out the data. Automatically.”

At first Mr. Penumbra is quiet, but then he responds to Clay’s news:

“Oh, yes, I know,” he says sharply, and his eyes flash at me. “I see it now. You cheated—would that be fair to say? And as a result, you have no idea what you have accomplished.”

I look down at the desk. That would be fair to say.

When I look back up at Penumbra, his gaze has softened. “And yet…you did it all the same.” He turns and wanders into the Waybacklist. “How curious.”

“Who is it?” I ask suddenly. “Whose face?”

“It is the Founder,” Penumbra says, running a long hand up along one of the shelves. “The one who waits, hiding. He vexes novices for years. Years! And yet you revealed him in—what? A single month?”

Not quite: “Just one day.”

Penumbra takes a sharp breath. His eyes flash again. They are pulled wide and, reflecting the light from the windows, they crackle electric blue in a way I’ve never seen. He gasps, “Incredible.”

As I read this conversation, I was immediately reminded of so many exchanges I’ve seen at conferences about projects that use computational methods—whether text mining, network graphs, or geospatial visualizations—to expose patterns in literary-historical texts. When I talk about our Viral Texts project, for instance, I typically begin by describing the archival challenge: in brief, there are just so many nineteenth-century periodicals that no scholar can read them all. I then discuss how we’ve leveraged the pattern-finding powers of the computer to begin addressing this problem, automatically—there’s that word from Mr. Penumbra—uncovering more than 40,000 reprinted texts in one large-scale digital newspaper archive and using that data to visualize the spread of texts around the country or the strength of connections among publications.

At the risk of sounding uncharitable here—a risk I hope to address in the following paragraphs, so please stick with me‐often the response to this work from scholars in my discipline can sound not unlike Mr. Penumbra’s initial response to Clay’s visualization—”I see it now. You cheated…[a]nd as a result, you have no idea what you have accomplished.” Often such responses come, as one might expect, from scholars who spent significant time researching nineteenth-century reprinting in the archive, reading newspapers one by one and taking careful notes. That two junior scholars—one a few measly years out of graduate school—are claiming to have identified 40,000 reprinted texts in a single year’s work, and without stepping foot into an actual newspaper archive, seems a lot like cheating.

If someone actually articulated such a caricatured version of our work—and I am deliberately overstating things here to cast a more subtle problem into sharper relief—I could quibble with details of that caricaturization. I brought to the project a disciplinary understanding of reprinting that shaped the early development of the duplicate-detection algorithm. We used known sets of widely-reprinted texts—typically drawn from the incredible work of book historians, bibliographers, and literary critics—to ensure we were capturing reprints we would expect to find, as well as new reprintings. We continually tweak the algorithm based on what it fails to find. We’re still not great, for instance, at identifying reprinted lyric poems, because such texts simply don’t include enough 5-grams (sequences of 5 words) to be identified using our current algorithm. Working through such problems and perfecting our computational methods requires that we draw on literary-historical knowledge and literary-historical methods. Finally, I have spent a good deal of time in physical archives, actually reading newspapers.

But, Mr. Penumbra’s comments do get at a central methodological difference that I think is worth attending to more closely. Because Mr. Penumbra is right: perhaps not that Clay cheated, despite Clay’s own concession to this charge, but that Clay’s methodology for finding the founder did not help him understand what he has accomplished. The pattern Clay uncovers in his visualization is “actually” embedded in codes, which are contained in the books the club members check out. The club members read the books—or perhaps more accurately, they study the books, which are not written to be read as narrative‐decipher one part of the code, and then move on to the next book. Illuminating the entire pattern takes years of study, but along the way the club members are also initiated into the Unbroken Spine, which is the name of this monkish order of bibliophiles and code breakers. To become full members of the Unbroken Spine, these readers must understand the codes, which is to understand the books, which is to understand the Unbroken Spine’s history and purpose, and so forth. By contrast, Clay does not read the books or crack the code within them. Instead he works with the Unbroken Spine’s metadata, “not reading” the books but tracking the readers of those books. He comes to the correct conclusion, a fact Mr. Penumbra acknowledges with his “you did it all the same,” by piggybacking on the Unbroken Spine members’ years of difficult labor. And even after he has found the answer in his visualization, Clay does not understand the pieces that constitute that answer. He has looked into a few of the books, and knows they are a code, but he couldn’t crack the code in even one of them if asked.

Of course, I read the Unbroken Spine throughout Mr. Penumbra as an unsubtle metaphor for humanities scholars, reading closely over many years in search of greater understanding: of a historical period, of a genre, of a social movement, &c. And this leads me to two ideas about computational work in literary history that this exchange in Mr. Penumbra suggests. First, I often comment that one of my favorite things about digital humanities projects is the way they “make good” on decades—or even better, centuries—of fastidious record-keeping, particularly in libraries. I get excited when a scholar figures out a way to visualize corpora using an obscure metadata field recorded by generations of librarians and largely ignored until: wow! I’m thinking here of my colleague Benjamin Schmidt’s work visualizing American shipping from a data set compiled and translated through a century of new storage technologies and largely used in environmental research. These eureka moments excite me, but I can understand a more cynical reading, as the work of centuries and generations is distilled into a one-minute video.

Perhaps more to Mr. Penumbra’s point, however, computational methods can reverse the usual order of scholarly investigation in the humanities. Had I gone into the physical archive to study reprinting, I would have read these reprinted texts as I identified them, along with a host of texts which were not reprinted. The act of discovery would have been simultaneously an act of understanding. I would spend years in archives reading, and would emerge ready to build an argument about both the form and content of nineteenth-century reprinting practices.

Computational approaches are often more exploratory, or perhaps screwmeneutical, at least at the beginning of projects. We begin with a big question—can we identify duplicated text within this huge corpus of unstructured text data?—and we try one approach, then another, in an attempt to answer that question. We tweak this parameter and that to see what emerges. When something interesting happens we follow that line for awhile. And new questions suggest themselves as we work.

But in our case, all that exploratory work preceded the bulk of the reading the project has required and will require. Of course we were checking our results along the way, reading this or that cluster of reprinted text to see if the method was working, but it wasn’t until we’d isolated a substantial corpus of reprinted texts that the reading began in earnest. Now that we have identified 40,000+ reprints from before the Civil War, I’m spending significant time with those reprints, thinking about the genres that seem to be most widely reprinted, the ways these texts reflect (or don’t) our ideas about literary production and popular culture in the antebellum period, and studying the ways individual texts changed as they were reprinted across the country. The project’s research assistants are now annotating the text clusters, giving them titles; identifying authors; and assigning tags based on the topics, genres, and ideas reflected in each piece.

In many ways, then, our methodology disambiguated the act of discovery from the act of understanding. We quite quickly uncovered patterns of reprinting in our corpus, and now that the algorithm works well we can even more quickly apply it to new corpora, as we are hoping to do in the near future. And we have been able to create some exciting and suggestive visualizations from those findings, visualizing reprints as signals of influence between publications, for instance, in a network graph. But really making sense of these findings will be the work of years, not days.

Ultimately, I think Mr. Penumbra’s comments get at a central challenge for computational work in the humanities: both for those evaluating computational work from the outside and for those doing computational work. It seems clear to me how “distant reading” methods could seem like “cheating,” bypassing some of the work and time typically required to analyze large swaths of historical or literary material using a machine: “I mean, a thousand computers, like that!” But of course, if the question at the heart of the analysis is good, and the methods uncover real and substantive results, they shouldn’t be dismissed on essentially moral grounds, because the researchers didn’t work hard enough. At the same time, those undertaking such projects should recognize when their methods do lead to gaps in understanding because they invert the typical order of humanities scholarship. In Clay’s case, it is only after he creates his visualization of the Unbroken Spine’s Founder—in other words, only after he solves the bigger puzzle—that he begins to understand the details of the group and its mission, and eventually to contribute to that mission. Perhaps this is a model for some DH projects, which tell a truth, but tell it slant. In my case, I am striving to be more transparent about both what we have learned and what we are still learning in the Viral Texts project. And even if the computational work stopped tomorrow, we would have far more to learn than we have yet learned. Understanding is always a little bit out of reach, whether or not you work with a computer.

Omeka/Neatline Workshop Agenda and Links

We’ll be working with the NULab’s Omeka Test Site for this workshop. You should have received login instructions before the workshop. If not, let us know so we can add you.

Workshop Agenda

9:00-9:15 Coffee, breakfast, introductions
9:15-9:45 Omeka project considerations

9:45-10:30 The basics of adding items, collections, and exhibits
10:30-10:45 Break!
10:45-11:15 Group practice adding items, collections, and exhibits
11:15-12:00 Questions, concerns
12:00-1:30 LUNCH!
1:30-2:15 Georectifying historical maps with WorldMap Warp
2:15-3:00 The basics of Neatline
3:00-3:15 Break!
3:15-3:45 Group practice creating Neatline exhibits
3:45-4:00 Final questions, concerns
4:00-5:00 Unstructured work time

Sample Item Resources

Historical Map Resources

Omeka Tutorial

Neatline Tutorials

Model Neatline Exhibits

7 Reasons 19th-Century Newspapers Were Actually the Original Buzzfeed

In March 2013 I had the opportunity to talk about the Viral Texts project for the “Breakfasts at Buzzfeed” speaker series. I gave my talk a gimmicky title worthy of the venue, which I was assured they appreciated rather than resented. It was a lively crowd of employees from around the company, and they asked some insightful questions during the Q&A. Here’s the video. I only wander off frame a few times!

Representing the “Known Unknowns” in Humanities Visualizations

Note: If this topic interests, you should read Lauren Klein‘s recent article in American Literature, “The Image of Absence: Archival Silence, Data Visualization, and James Hemings,” which does far more justice to the topic than I do in my scant paragraphs here.

Pretty much every time I present the Viral Texts Project, the following exchange plays out. During my talk I will have said something like, “Using these methods we have uncovered more than 40,000 reprinted texts from the Library of Congress’ Chronicling America collection, many hundreds of which were widely reprinted—and most of which have not been discussed by scholars.” During the Q&A following the talk, a scholar will inevitably ask, “you realize you’re missing lots of newspapers (and/or lots of the texts that were reprinted), right?”

To which my first instinct is exasperation. Of course we’re missing lots of newspapers. The majority of C19 newspapers aren’t preserved anywhere, and the majority of archived newspapers aren’t digitized. But the ability to identify patterns across large sets of newspapers is, frankly, transformative. The newspapers that have been digitized under the Chronicling America banner are actually the product of many state-level digitization efforts, which means we’re able to study patterns across collections that were housed in many separate physical archives, providing a level of textual address not impossible, but very difficult in the physical archive. So my flip answer—which I never quite give—is “yes, we’re missing a lot. But 40,000 new texts is pretty great.”

But those questions do nag at me. In particular I’ve been thinking about how we might represent the “known unknowns” of our work,1 particularly in visualizations. I really started picking at this problem after discussing the Viral Texts work with a group of librarians. I was showing them this map,

which transposes a network graph of our data onto a map which merges census data from 1840 with the Newberry Library’s Atlas of Historical County Boundaries. One of the librarians was from New Hampshire, and she told me she was initially dismayed that there were no influential newspapers from New Hampshire, until she realized that our data doesn’t include any newspapers from New Hampshire, because that state has not yet contributed to Chronicling America. She suggested our maps would be vastly improved if we somehow indicated such gaps visually, rather than simply talking about them.

In the weeks since then, I’ve been experimenting with how to visualize those absences without overwhelming a map with symbology. The simplest solution, as almost always, appears to be the best.

In this map I’ve visualized the 50 reprintings we have identified of one text, a religious reflection by Nashville editor George D. Prentice, often titled “Eloquent Extract,” between the years 1836-1860. The county boundaries are historical, drawn from the Newberry Atlas, but I’ve overlain modern state boundaries with shading to indicate whether we have significant, scant, or no open-access historical newspaper data from those states. This is still a blunt instrument. Entire states are shaded, even when our coverage is geographically concentrated. For New York, for instance, we have data from a few NYC newspapers and magazines, but nothing yet from the north or west of the state.

Nevertheless, I’m happy with these maps as helping me begin to think through how I can represent the absences of the digital archives from which our project draws. And indeed, I’ve begun thinking about how such maps might help us agitate—in admittedly small ways—for increased digitization and data-level access for humanities projects.

This map, for instance, visualizes the 130 reprints of that same “Eloquent Extract” which we were able to identify searching across Chronicling America and a range of commercial periodicals archives (and huge thanks to project RA Peter Roby for keyword searching many archives in search of such examples). For me this map is both exciting and dispiriting, pointing to what could be possible for large-scale text mining projects while simultaneously emphasizing just how much we are missing when forced to work only with openly-available data. If we had access to a larger digitized cultural record we could do so much more. A part of me hopes that if scholars, librarians, and others see such maps they will advocate for increased access to historical materials in open collections. As I said in my talk at the recent C19 conference:

While the dream of archival completeness will always and forever elude us—and please do not mistake the digital for “the complete,” which it never has been and never will be—this map is to my mind nonetheless sad. Whether you consider yourself a “digital humanist” or not, and whether you ever plan to leverage the computational potential of historical databases, I would argue that the contours and content of our online archive should be important to you. Scholars self-consciously working in “digital humanities” and also those working in literature, history, and related fields should make themselves heard in conversations about what will become our digital, scholarly commons. The worst possible thing today would be for us to believe this problem is solved or beyond our influence.

In the meantime, though, we’re starting conversations with commercial archive providers to see if they would be willing to let us use their raw text data. I hope maps like this can help us demonstrate the value of such access, but we shall see how those conversations unfold.

I will continue thinking about how to better represent absence as the geospatial aspects of our project develop in the coming months. Indeed, the same questions arise in our network visualizations. Working with historical data means that we have far more missing nodes than many network scientists working, for instance, with modern social media data. Finding a way to represent missingness—the “known unknowns” of our work—seems like an essential humanities contribution to geospatial and network methodologies.

1. Yes, I’m borrowing a term from Donald Rumsfeld here, which seems like a useful term for thinking about archival gaps, while perhaps not such a useful term for thinking about starting a war. We can blame this on me watching an interview with Errol Morris about The Unknown Known on The Daily Show last night.