On Ignoring Encoding

Lately we’ve seen a spate of articles castigating the digital humanities—perhaps most prominently, Adam Kirsch’s piece in New Republic, “Technology Is Taking Over English Departments: The False Promise of the Digital Humanities.” I don’t plan in this post to take on the genre or refute the criticisms of these pieces one by one; Ted Underwood and Glen Worthy have already made better global points than I could muster. My biggest complaint about the Kirsch piece—and the larger genre it exemplifies—would echo what many others have said: these pieces purport to critique a wide field in which their authors seem to have done very little reading. Also, as Roopika Risam notes, many of these pieces conflate “digital humanities” with the DH that happens in literary studies, leaving digital history, archeology, classics, art history, religious studies, and the many other fields that contribute to DH out of the narrative. In this way these critiques echo conversations happening with the DH community about its diverse genealogies, such as Tom Scheinfeldt’s The Dividends of Difference, Adeline Koh’s Niceness, Building, and Opening the Genealogy of the Digital Humanities, or Fiona M. Barnett’s “The Brave Side of Digital Humanities.”

Even taken as critiques of only digital literary studies, however, pieces such as Kirsch’s problematically conflate “big data” or “distant reading” with “the digital humanities,” seeing large-scale or corpus-level analysis as the primary activity of the field rather than one activity of the field, and explicitly excluding DH’s traditions of encoding, archive building, and digital publication. I have worked and continue to work in both these DH traditions, and have been struck by how reliably one is recongized—to be denounced—while the other is ignored or disregarded. The formula for denouncing DH seems at this point well established, though the precise order of its elements sometimes shifts from piece to piece:

  1. Juxtapose Aiden and Michel’s “culturomics” claims with the stark limitations of the Ngrams viewer.
  2. Cite Stephen Ramsay’s “Who’s in and Who’s Out,” specifically the line “Do you have to know how to code? I’m a tenured professor of digital humanities and I say ‘yes.’” Bemoan the implications of this statement.
  3. Discuss Franco Moretti on “distant reading.” Admit that Moretti is the most compelling of the DH writers, but remain dissatisfied with the prospects for distant reading.

These critiques are worth airing, though they’re not particularly surprising—if only because the DH community has been debating these ideas in books, blog posts, and journal articles for a long while now. Matt Jocker’s Macroanalysis alone could serve as a useful introduction to the contours of this debate within the field.

More problematically, however, by focusing on Ramsay and Moretti, these pieces ignore the field-constitutive work of scholars such as Julia Flanders, Bethany Nowviskie, and Susan Schreibman. This vision of DH is all Graphs, Maps, Trees and no Women Writers Project. All coding and no encoding.

When Kirsch gestures towards encoding in his piece, the gesture simply dismisses its importance or pertinence to the larger discussion of digital humanities. In Kirsch’s piece, for instance, he claims,

Within this range of approaches, we can distinguish a minimalist and a maximalist understanding of digital humanities. On the one hand, it can be simply the application of computer technology to traditional scholarly functions, such as the editing of texts. An exemplary project of this kind is the Rossetti Archive created by Jerome McGann, an online repository of texts and images related to the career of Dante Gabriel Rossetti: this is essentially an open-ended, universally accessible scholarly edition. (my italics)

For Kirsch, digital humanities equals big data, so digital humanities work that’s not about big data isn’t digital humanities, but “simply” textual scholarship masking itself as digital humanities. In a few lines, Kirsch invokes and trivializes—through words such as “simply” and “essentially”—what is arguably the longest-standing and most influential thread of digital humanities’ history in literary studies: the preservation, annotation, and representation of historical-literary works for the new medium of our time. Under the banner of “encoding,” I mean to write not only of TEI markup, but of a wider range of practices that have focused on digital preservation and publication. Alongside the TEI, then, we might think of Neatline, which Bethany Nowviskie argues “was carefully designed by humanities scholars and DH practitioners to emphasize what we found most humanistic about interpretive scholarship, and most compelling about small data in a big data world.” Even more recently, we might think of Andrew Stauffer’s Book Traces project, which aims to crowdsource the identification of unique physical books “in danger of being discarded as libraries go digital,” a project that would seem at odds with a purely techno-solutionist version of DH. And while I speak primarily from the archiving and encoding tradition in digital literary studies, I suspect archive building has occupied a similar primary space in the genealogy of digital history. I don’t have the numbers for this, but I strongly suspect that far more hours of labor and even, yes, far more financial support has gone into encoding and archival projects than into data analysis over the past decades of DH history. Certainly many, many, many DH “origin stories” begin, “I got a job as a graduate student doing encoding for project X, Y, or Z.”

Perhaps more importantly, however, this evolving, amorphous, decades-long work has significantly reshaped the horizons of literary-historical research for countless colleagues and students, both within and without DH. As Amy Earhart and others have shown, this collective project has not always opened the canon in the ways we might hope, and those engaged in this work must do more to make digital publication a space for the recovery of lost or underrepresented voices. We remain far from Jerome McGann’s oft-rearticulated vision that “that the whole of our cultural inheritance has to be recurated and reedited in digital forms.” And as Lisa Spiro and Jane Segal show, the true impact of this digital archival work is often elided when our colleagues use digital archives for teaching and research—and use them they do—but cite those materials as if they visited the physical archive. Nevertheless, the very idea of archival access means something different today than it did a few short decades ago, and the work that produced this new reality is a primary foundation of the digital humanities. Moreover, decades of conversations and collaborations around archives and encoding led to the development of standards that resonate far beyond research universities. This work has not been “simply” the application of computer technology to traditional scholarly functions; something like the TEI is one of the best examples of humanistic scholarship applied to computer technology. If you believe that encoding is simply the mechanical application of tags to documents, I encourage you to attend a WWP seminar or workshop, where you will be swiftly disabused of that notion. A project like the Rosetti archive is not simply or essentially “an open-ended, universally accessible scholarly edition,” it is “an open-ended, universally-accessible scholarly edition!!!!” which is a thing that did not exist before humanities computing/digital humanities. Now so many exist that we often, to our discredit, treat them as passé.

At an event here at Northeastern University last spring, Matthew Jockers and Julia Flanders were kind enough to stage a “debate” about scale and analysis in the digital humanities. The organizers of this symposium asked Julia and Matt to stage this exchange as a debate in large part to highlight what we saw as a false dichotomy between big and small data in DH work. Julia and Matt’s conversation—which I still chastise myself for failing to record—was one of the best articulations I’ve seen or read of the two poles of inquiry between which much DH work proceeds. I simply cannot read this exchange and see a field unreflective about its methods or unaware of both its potential and its limitations. To sample only one exchange:

Exciting, indeed! And you can expect a call from me next Monday. . . But you know, it occurs to me that you and I have been drinking out of the same kool aid firehose for a good number of years. It might be worthwhile to pause here and acknowledge a few of the real challenges associated with this kind of work. I worry a lot, for example, about how even our big data corpora are still really small, at least when it comes to making claims about “Literature” with a capital “L.”
One thing we wrestle with at the WWP is the problem of what our collection really represents. Back when the project was first envisioned, we thought that we could actually capture all of the extant women’s writing in English before 1830, so representativeness wasn’t so much of a problem. But (I guess we should be glad) that turned out to be wildly wrong—there were orders of magnitude more eligible texts than we had imagined, far more than we’ll ever likely capture before the heat death of the universe at the rate we’re going.
So now, when we offer tools for text analysis that operate on the whole collection, we have the question of what this collection can actually tell us: about genre, about authorship, about periodization, about anything. It’s a mid-size collection, about 350 texts from a wide range of genres, topics, periods, etc., and clearly there’s some very useful information to be gained from studying it, but precisely what kinds of conclusions can one draw? I like very much Steve Ramsay’s idea that the point of such tools is to permit exploration, to pique our interest and prompt further discovery, but if we were to provide tools for statistical analysis, I think they could easily be misleading given the nature of the sample.
That said, I think representativeness is a very vexed question for any collection—even if one is acutely aware of the problem, as the corpus linguists are, it seems that the best one can do is be very, very transparent about one’s collection development strategy, and hope that the user reads the documentation. But both of these conditions seem fragile… and as text analysis tools become more novice-friendly, I think they’re more likely to be used in a novice way. So how do you handle this?
At some point during my work on the 19th century novel, I had to make a decision to quit collecting texts and start analyzing them. How I got to that point is another matter, but when I began the project I had 950 books and when I made that decision to quit collecting I had 4,700 books. I mined that data and I wrote the last two chapters of my book. About the time I was getting ready to submit the final manuscript, I discovered that there were not 4,700 books. There were actually 3,346. It turned out that the materials my colleagues and I had collected included many multi-volume novels that had not been stitched together and also a good number of duplicates that we had acquired from different sources. When I sorted this all out, I had 3,346 books, and I ended up having to completely rewrite those last two chapters.

This is not the DH that gets quoted in pieces like Kirsch’s: not the scholar who analyzes thousands of books computationally and the scholar who encodes the minute details of individual texts, engaged in sincere and generative dialogue about the affordances and limitations of their respective approaches. Far easier to cite the field’s most grandiose claims and be done with it. But this dialogue, too, is DH—and not a minor or marginal part of the field.

Textual encoding has never been as sexy as text analysis, at least for those looking at DH work from outside the field. In many ways, encoding inherited the stigma of scholarly editing, which has in English Departments long been treated as a lesser activity than critique—though critique depends on careful scholarly editing, as text analysis depends on digitization and encoding. You may find encoding or archival metadata development boring or pedantic—certainly some do—but you cannot pretend that encoding is less a part of the digital humanities than coding. Indeed, for me and many others, one of the earliest appeals of DH was that the field attempts to make more transparent the relationships among preservation, presentation, access, and interpretation. In short, any vision of digital humanities that excludes or dismisses the close and careful work of digital preservation, editing, and publication is simply false.

Mr. Penumbra, Distant Reading, and Cheating at Scholarship

My Technologies of Text course is capping this semester reading Robin Sloan’s novel, Mr. Penumbra’s 24-Hour Bookstore, which Matt Kirschenbaum deemed “the first novel of the digital humanities” last year. Mr. Penumbra is a fine capstone because it thinks through so many of our course themes: the (a)materiality of reading, the book (and database) as physical objects, the relationship between computers and previous generations of information technology, &c. &c. &c. I will try not too spoil much of the book here, but I will of necessity give away some details from the end of the first chapter. So if you’ve not yet read it: go thou and do so.

Rereading the book for class, I was struck by one exchange between the titular Mr. Penumbra—bookstore owner and leader of a group of very close readers—and the narrator, Clay Jannon—a new bookstore employee curious about the odd books the store’s odd club members check out. In an attempt to understand what the club members are up to, Clay scans one of the store’s logbooks, which records the comings and goings of club members, the titles of the books they checked out, and when they borrowed each one. When he visualizes these exchanges over time within a 3d model of the bookstore itself, visual patterns of borrowing emerge, which seem, when compiled, to reveal an image of a man’s face. When Clay shows this visualization to Mr. Penumbra, they have an interesting exchange that ultimately hinges on methodology:

Half-smiling, he holds his glasses at an angle and peers down at the screen. His face goes slack, and then he says, quietly: “The Founder.” He turns to me. “You solved it.” He claps a hand to his forehead and his face splits into a giddy smile. “You solved it already! Look at him! Right there on the screen! [...] “How did you do it?” he continues. He’s so proud, like I’m his grandson and I just hit a home run, or cured cancer. “I must see your notes! Did you use Euler’s method? Or the Brito inversion? There is no shame in that, it clears away much of the confusion early on…”

“Mr. Penumbra,” I say, triumph in my voice, “I scanned an old logbook [...] because Google has this machine, it’s superfast, and Hadoop, it just goes—I mean, a thousand computers, like that!” I snap for emphasis. I don’t think he has any idea what I’m talking about. “Anyway, the point is, we just pulled out the data. Automatically.”

At first Mr. Penumbra is quiet, but then he responds to Clay’s news:

“Oh, yes, I know,” he says sharply, and his eyes flash at me. “I see it now. You cheated—would that be fair to say? And as a result, you have no idea what you have accomplished.”

I look down at the desk. That would be fair to say.

When I look back up at Penumbra, his gaze has softened. “And yet…you did it all the same.” He turns and wanders into the Waybacklist. “How curious.”

“Who is it?” I ask suddenly. “Whose face?”

“It is the Founder,” Penumbra says, running a long hand up along one of the shelves. “The one who waits, hiding. He vexes novices for years. Years! And yet you revealed him in—what? A single month?”

Not quite: “Just one day.”

Penumbra takes a sharp breath. His eyes flash again. They are pulled wide and, reflecting the light from the windows, they crackle electric blue in a way I’ve never seen. He gasps, “Incredible.”

As I read this conversation, I was immediately reminded of so many exchanges I’ve seen at conferences about projects that use computational methods—whether text mining, network graphs, or geospatial visualizations—to expose patterns in literary-historical texts. When I talk about our Viral Texts project, for instance, I typically begin by describing the archival challenge: in brief, there are just so many nineteenth-century periodicals that no scholar can read them all. I then discuss how we’ve leveraged the pattern-finding powers of the computer to begin addressing this problem, automatically—there’s that word from Mr. Penumbra—uncovering more than 40,000 reprinted texts in one large-scale digital newspaper archive and using that data to visualize the spread of texts around the country or the strength of connections among publications.

At the risk of sounding uncharitable here—a risk I hope to address in the following paragraphs, so please stick with me‐often the response to this work from scholars in my discipline can sound not unlike Mr. Penumbra’s initial response to Clay’s visualization—”I see it now. You cheated…[a]nd as a result, you have no idea what you have accomplished.” Often such responses come, as one might expect, from scholars who spent significant time researching nineteenth-century reprinting in the archive, reading newspapers one by one and taking careful notes. That two junior scholars—one a few measly years out of graduate school—are claiming to have identified 40,000 reprinted texts in a single year’s work, and without stepping foot into an actual newspaper archive, seems a lot like cheating.

If someone actually articulated such a caricatured version of our work—and I am deliberately overstating things here to cast a more subtle problem into sharper relief—I could quibble with details of that caricaturization. I brought to the project a disciplinary understanding of reprinting that shaped the early development of the duplicate-detection algorithm. We used known sets of widely-reprinted texts—typically drawn from the incredible work of book historians, bibliographers, and literary critics—to ensure we were capturing reprints we would expect to find, as well as new reprintings. We continually tweak the algorithm based on what it fails to find. We’re still not great, for instance, at identifying reprinted lyric poems, because such texts simply don’t include enough 5-grams (sequences of 5 words) to be identified using our current algorithm. Working through such problems and perfecting our computational methods requires that we draw on literary-historical knowledge and literary-historical methods. Finally, I have spent a good deal of time in physical archives, actually reading newspapers.

But, Mr. Penumbra’s comments do get at a central methodological difference that I think is worth attending to more closely. Because Mr. Penumbra is right: perhaps not that Clay cheated, despite Clay’s own concession to this charge, but that Clay’s methodology for finding the founder did not help him understand what he has accomplished. The pattern Clay uncovers in his visualization is “actually” embedded in codes, which are contained in the books the club members check out. The club members read the books—or perhaps more accurately, they study the books, which are not written to be read as narrative‐decipher one part of the code, and then move on to the next book. Illuminating the entire pattern takes years of study, but along the way the club members are also initiated into the Unbroken Spine, which is the name of this monkish order of bibliophiles and code breakers. To become full members of the Unbroken Spine, these readers must understand the codes, which is to understand the books, which is to understand the Unbroken Spine’s history and purpose, and so forth. By contrast, Clay does not read the books or crack the code within them. Instead he works with the Unbroken Spine’s metadata, “not reading” the books but tracking the readers of those books. He comes to the correct conclusion, a fact Mr. Penumbra acknowledges with his “you did it all the same,” by piggybacking on the Unbroken Spine members’ years of difficult labor. And even after he has found the answer in his visualization, Clay does not understand the pieces that constitute that answer. He has looked into a few of the books, and knows they are a code, but he couldn’t crack the code in even one of them if asked.

Of course, I read the Unbroken Spine throughout Mr. Penumbra as an unsubtle metaphor for humanities scholars, reading closely over many years in search of greater understanding: of a historical period, of a genre, of a social movement, &c. And this leads me to two ideas about computational work in literary history that this exchange in Mr. Penumbra suggests. First, I often comment that one of my favorite things about digital humanities projects is the way they “make good” on decades—or even better, centuries—of fastidious record-keeping, particularly in libraries. I get excited when a scholar figures out a way to visualize corpora using an obscure metadata field recorded by generations of librarians and largely ignored until: wow! I’m thinking here of my colleague Benjamin Schmidt’s work visualizing American shipping from a data set compiled and translated through a century of new storage technologies and largely used in environmental research. These eureka moments excite me, but I can understand a more cynical reading, as the work of centuries and generations is distilled into a one-minute video.

Perhaps more to Mr. Penumbra’s point, however, computational methods can reverse the usual order of scholarly investigation in the humanities. Had I gone into the physical archive to study reprinting, I would have read these reprinted texts as I identified them, along with a host of texts which were not reprinted. The act of discovery would have been simultaneously an act of understanding. I would spend years in archives reading, and would emerge ready to build an argument about both the form and content of nineteenth-century reprinting practices.

Computational approaches are often more exploratory, or perhaps screwmeneutical, at least at the beginning of projects. We begin with a big question—can we identify duplicated text within this huge corpus of unstructured text data?—and we try one approach, then another, in an attempt to answer that question. We tweak this parameter and that to see what emerges. When something interesting happens we follow that line for awhile. And new questions suggest themselves as we work.

But in our case, all that exploratory work preceded the bulk of the reading the project has required and will require. Of course we were checking our results along the way, reading this or that cluster of reprinted text to see if the method was working, but it wasn’t until we’d isolated a substantial corpus of reprinted texts that the reading began in earnest. Now that we have identified 40,000+ reprints from before the Civil War, I’m spending significant time with those reprints, thinking about the genres that seem to be most widely reprinted, the ways these texts reflect (or don’t) our ideas about literary production and popular culture in the antebellum period, and studying the ways individual texts changed as they were reprinted across the country. The project’s research assistants are now annotating the text clusters, giving them titles; identifying authors; and assigning tags based on the topics, genres, and ideas reflected in each piece.

In many ways, then, our methodology disambiguated the act of discovery from the act of understanding. We quite quickly uncovered patterns of reprinting in our corpus, and now that the algorithm works well we can even more quickly apply it to new corpora, as we are hoping to do in the near future. And we have been able to create some exciting and suggestive visualizations from those findings, visualizing reprints as signals of influence between publications, for instance, in a network graph. But really making sense of these findings will be the work of years, not days.

Ultimately, I think Mr. Penumbra’s comments get at a central challenge for computational work in the humanities: both for those evaluating computational work from the outside and for those doing computational work. It seems clear to me how “distant reading” methods could seem like “cheating,” bypassing some of the work and time typically required to analyze large swaths of historical or literary material using a machine: “I mean, a thousand computers, like that!” But of course, if the question at the heart of the analysis is good, and the methods uncover real and substantive results, they shouldn’t be dismissed on essentially moral grounds, because the researchers didn’t work hard enough. At the same time, those undertaking such projects should recognize when their methods do lead to gaps in understanding because they invert the typical order of humanities scholarship. In Clay’s case, it is only after he creates his visualization of the Unbroken Spine’s Founder—in other words, only after he solves the bigger puzzle—that he begins to understand the details of the group and its mission, and eventually to contribute to that mission. Perhaps this is a model for some DH projects, which tell a truth, but tell it slant. In my case, I am striving to be more transparent about both what we have learned and what we are still learning in the Viral Texts project. And even if the computational work stopped tomorrow, we would have far more to learn than we have yet learned. Understanding is always a little bit out of reach, whether or not you work with a computer.

Omeka/Neatline Workshop Agenda and Links

We’ll be working with the NULab’s Omeka Test Site for this workshop. You should have received login instructions before the workshop. If not, let us know so we can add you.

Workshop Agenda

9:00-9:15 Coffee, breakfast, introductions
9:15-9:45 Omeka project considerations

9:45-10:30 The basics of adding items, collections, and exhibits
10:30-10:45 Break!
10:45-11:15 Group practice adding items, collections, and exhibits
11:15-12:00 Questions, concerns
12:00-1:30 LUNCH!
1:30-2:15 Georectifying historical maps with WorldMap Warp
2:15-3:00 The basics of Neatline
3:00-3:15 Break!
3:15-3:45 Group practice creating Neatline exhibits
3:45-4:00 Final questions, concerns
4:00-5:00 Unstructured work time

Sample Item Resources

Historical Map Resources

Omeka Tutorial

Neatline Tutorials

Model Neatline Exhibits

7 Reasons 19th-Century Newspapers Were Actually the Original Buzzfeed

In March 2013 I had the opportunity to talk about the Viral Texts project for the “Breakfasts at Buzzfeed” speaker series. I gave my talk a gimmicky title worthy of the venue, which I was assured they appreciated rather than resented. It was a lively crowd of employees from around the company, and they asked some insightful questions during the Q&A. Here’s the video. I only wander off frame a few times!

Representing the “Known Unknowns” in Humanities Visualizations

Note: If this topic interests, you should read Lauren Klein‘s recent article in American Literature, “The Image of Absence: Archival Silence, Data Visualization, and James Hemings,” which does far more justice to the topic than I do in my scant paragraphs here.

Pretty much every time I present the Viral Texts Project, the following exchange plays out. During my talk I will have said something like, “Using these methods we have uncovered more than 40,000 reprinted texts from the Library of Congress’ Chronicling America collection, many hundreds of which were widely reprinted—and most of which have not been discussed by scholars.” During the Q&A following the talk, a scholar will inevitably ask, “you realize you’re missing lots of newspapers (and/or lots of the texts that were reprinted), right?”

To which my first instinct is exasperation. Of course we’re missing lots of newspapers. The majority of C19 newspapers aren’t preserved anywhere, and the majority of archived newspapers aren’t digitized. But the ability to identify patterns across large sets of newspapers is, frankly, transformative. The newspapers that have been digitized under the Chronicling America banner are actually the product of many state-level digitization efforts, which means we’re able to study patterns across collections that were housed in many separate physical archives, providing a level of textual address not impossible, but very difficult in the physical archive. So my flip answer—which I never quite give—is “yes, we’re missing a lot. But 40,000 new texts is pretty great.”

But those questions do nag at me. In particular I’ve been thinking about how we might represent the “known unknowns” of our work,1 particularly in visualizations. I really started picking at this problem after discussing the Viral Texts work with a group of librarians. I was showing them this map,

which transposes a network graph of our data onto a map which merges census data from 1840 with the Newberry Library’s Atlas of Historical County Boundaries. One of the librarians was from New Hampshire, and she told me she was initially dismayed that there were no influential newspapers from New Hampshire, until she realized that our data doesn’t include any newspapers from New Hampshire, because that state has not yet contributed to Chronicling America. She suggested our maps would be vastly improved if we somehow indicated such gaps visually, rather than simply talking about them.

In the weeks since then, I’ve been experimenting with how to visualize those absences without overwhelming a map with symbology. The simplest solution, as almost always, appears to be the best.

In this map I’ve visualized the 50 reprintings we have identified of one text, a religious reflection by Nashville editor George D. Prentice, often titled “Eloquent Extract,” between the years 1836-1860. The county boundaries are historical, drawn from the Newberry Atlas, but I’ve overlain modern state boundaries with shading to indicate whether we have significant, scant, or no open-access historical newspaper data from those states. This is still a blunt instrument. Entire states are shaded, even when our coverage is geographically concentrated. For New York, for instance, we have data from a few NYC newspapers and magazines, but nothing yet from the north or west of the state.

Nevertheless, I’m happy with these maps as helping me begin to think through how I can represent the absences of the digital archives from which our project draws. And indeed, I’ve begun thinking about how such maps might help us agitate—in admittedly small ways—for increased digitization and data-level access for humanities projects.

This map, for instance, visualizes the 130 reprints of that same “Eloquent Extract” which we were able to identify searching across Chronicling America and a range of commercial periodicals archives (and huge thanks to project RA Peter Roby for keyword searching many archives in search of such examples). For me this map is both exciting and dispiriting, pointing to what could be possible for large-scale text mining projects while simultaneously emphasizing just how much we are missing when forced to work only with openly-available data. If we had access to a larger digitized cultural record we could do so much more. A part of me hopes that if scholars, librarians, and others see such maps they will advocate for increased access to historical materials in open collections. As I said in my talk at the recent C19 conference:

While the dream of archival completeness will always and forever elude us—and please do not mistake the digital for “the complete,” which it never has been and never will be—this map is to my mind nonetheless sad. Whether you consider yourself a “digital humanist” or not, and whether you ever plan to leverage the computational potential of historical databases, I would argue that the contours and content of our online archive should be important to you. Scholars self-consciously working in “digital humanities” and also those working in literature, history, and related fields should make themselves heard in conversations about what will become our digital, scholarly commons. The worst possible thing today would be for us to believe this problem is solved or beyond our influence.

In the meantime, though, we’re starting conversations with commercial archive providers to see if they would be willing to let us use their raw text data. I hope maps like this can help us demonstrate the value of such access, but we shall see how those conversations unfold.

I will continue thinking about how to better represent absence as the geospatial aspects of our project develop in the coming months. Indeed, the same questions arise in our network visualizations. Working with historical data means that we have far more missing nodes than many network scientists working, for instance, with modern social media data. Finding a way to represent missingness—the “known unknowns” of our work—seems like an essential humanities contribution to geospatial and network methodologies.

1. Yes, I’m borrowing a term from Donald Rumsfeld here, which seems like a useful term for thinking about archival gaps, while perhaps not such a useful term for thinking about starting a war. We can blame this on me watching an interview with Errol Morris about The Unknown Known on The Daily Show last night.