Data curation for the humanities

[23 May 2009]

Most of this week I was in Illinois, attending a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC).

The week began on Monday with useful and proficient introductions to the general idea of data curation from Melissa Cragin, Carole Palmer, John MacMullen, and Allen Renear; Craigin, Palmer, and MacMullen talked a lot about scientific data, for which the term data curation was first formulated. (Although social scientists have been addressing these problems systematically for twenty or thirty years and have a well developed network of social science data archives and data libraries, natural scientists and the librarians working with them don’t seem to have paid much attention to their experience.) They were also working hard to achieve generally applicable insights, which had the unfortunate side effect of raising the number of abstract noun phrases in their slides. Toward the end of the day, I began finding the room a little airless; eventually I concluded that this was partly oxygen starvation from the high density of warm bodies in a room whose air conditioning was not working, and partly concrete-example starvation.

On Tuesday, Syd Bauman and Julia Flanders of the Brown Univerisity Women Writers Project (now housed in the Brown University Library) gave a one-day introduction to XML, TEI, and the utility of TEI for data curation. The encoding exercises they gave us had the advantage of concreteness, but also (alas) the disadvantage: as they were describing yet another way that TEI could be used to annotate textual material, one librarian burst out “But we can’t possibly be expected to do all this!”

If in giving an introduction to TEI you don’t go into some detail about the things it can do, no one will understand why people might prefer to archive data in TEI form instead of HTML or straight ASCII. If you do, at least some in the audience will conclude that you are asking them to do all the work, instead of (as here) making them aware of some of the salient properties of the data they may be responsible in future for curating (and, a fortiori, understanding).

Wednesday morning, I was on the program under the title Markup semantics and the preservation of intellectual content, but I had spent Monday and Tuesday concluding that I had more questions than answers, so I threw away most of my plans for the presentation and turned as much of the morning as possible into group discussion. (Perversely, this had the effect of making me think I had things I wanted to say, after all.) I took the opportunity to introduce briefly the notion of skeleton sentences as a way of representing the meaning of markup in symbolic logic or English, and to explain why I think that skeleton sentences (or other similar mechanisms) provide a way to verify the correctness and completeness of the result, when data are migrated from one representation to another. This certainly works in theory, and almost certainly it will work in practice, although the tools still need to be built to test the idea in practce. When I showed thd screen with ten lines or so of first-order predicate calculus showing the meaning of the oai:OAI-PMH root element of an OAI metadata harvesting message, some participants (not unreasonably) looked a bit like deer caught in headlights. But others seemed to follow without effort, or without more puzzlement than might be occasioned by the eccentricities of my translation.

Wednesday afternoon, John Unsworth introduced the tools for text analysis on bodies of similarly encoded TEI documents produced by the MONK project (the name is alleged to be an acronym for Metadata offers new knowledge, but I had to think hard to see where the tools actually exploited metadata very heavily. If you regard annotations like part-of-speech tagging as metadata, then the role of metadata is more obvious.)

And at the end of Thursday, Lorcan Dempsey, the vice president of OCLC, gave a thoughtful and humorous closing keynote.

For me, no longer able to spend as much time in libraries and with librarians as in some former lives, the most informative presentation was surely Dorothea Salo’s survey of issues facing institutional repositories and other organizations that wish to preserve digital objects of interest to humanists and to make them accessible. She started from two disarmingly simple questions, which seem more blindingly apposite and obvious every time I think back to them. (The world clicked into a different configuration during her talk, so it’s now hard for me to recreate that sense of non-obviousness, but I do remember that no one else had boiled things down so effectively before this talk.)

For all the material you are trying to preserve, she suggested, you should ask

  • “Why do you want to preserve this?”
  • “What threats are you trying to preserve it against?”

The first question led to a consideration of collection development policy and its (often unrecognized) relevance to data curations; the second led to an extremely helpful analysis of the threats to data against which data curators must fight.

I won’t try to summarize her talk further; her slides and her blog post about them will do it better justice than I can.

Persistence and dereferenceability

[31 March 2009]

My esteemed former colleague Thomas Roessler has posted a musing on the fragility of the electronic historical record and the difficulties of achieving persistence, when companies go out of existence and coincidentally stop maintaining their Web sites or paying their domain registration fees.

After reading Thomas’s post, my evil twin Enrique came close to throwing a temper tantrum. (Actually, that’s quite unfair. For Enrique, he was remarkably well behaved.)

“The semantic web partisans,” he shouted, “have spent the last ten years or more telling us that URLs are the perfect naming mechanism: a single, integrated space of names with distributed naming authority. Haven’t they?”

“Well,” I said, “strictly speaking, I think they have mostly been talking about URIs, for the last few years at least.” He ignored this.

“They have been telling us we should use URLs for naming absolutely everything. Including everything we care about. Including Aeschylus and Euripides! Homer! Sappho! Including Shelley, and Keats, and Pope!”

I couldn’t help starting to hum ‘Brush up your Shakespeare’ at this, but he ignored me. This in itself was unusual; he is usually a sucker for Cole Porter. I guess he really was kind of worked up.

“And when anyone expressed concern about (a) the fact that the power to mint URLs is tied up with the regular payment of fees, so it’s really not equally accessible to everyone, or (b) the possibility that URLs don’t have the kind of longevity needed for real persistence, they just told us again, louder, that we should be using URLs for everything.”

“Now, don’t bring up URNs!” I told him, in a warning tone. “We don’t want to open those old wounds again, do we?”

“And why the hell not?” he roared. “What do the SemWeb people think they are playing at?!”

“Well,” I said.

“Either they are surprised at this problem, in which case you have to ask: ‘How can they be surprised? What kind of idiots must they be not to have seen this coming?’“

“Well,” I said.

“Or else they aren’t surprised, in which case you have to ask what they are smoking! Is it their attention span so short that it has never occurred to them that names sometimes need to last for longer than Netscape, Inc., happens to be in business?”

“Well,” I said. I realized I didn’t really have a good answer.

“And you?!” he snarled, turning on me and grabbing my lapels. “You were there for years — you couldn’t take a moment to point out to them that a naming convention can be used for everything we care about only if it can be used for the monuments of human culture? You couldn’t be bothered to point out that URLs can be suitable for naming parts of our cultural heritage only if they can last for a few hundred, preferably a few tens of thousands, of years? What use are you?!”

“Well,” I said.

“What use are URLs and their much hyped dereferenceability, if they can break this fast?”

“Well,” I said.

Long pause.

I am not sure Enrique’s complaints are entirely fair, but I also didn’t know how to answer them. I fear he is still waiting for an answer.