Thinking about schema mappings

[3 August 2008]

At the Digital Humanities conference in Finland in June, two papers made me think about a problem that has worried me off and on for a long time, ever since Mark Olsen at the ARTFL Project at the University of Chicago asked how he was supposed to provide searches across a large collection of documents, if all the documents were marked up differently.

Mark’s solution was simple, Procrustean, and effective: if I understood things correctly and remember aright, he translated everything into a single common vocabulary, which in the nature of things was a sort of lowest common denominator of text structure.

Stephen Ramsay and Brian Pytlik Zillig spoke about “Text analytics: a TEI format for cross-collection text analysis”, in which they described an approach similar to Mark’s in spirit, but crucially different in details. That is, like him their idea is to translate into a single common system of markup, so that the collection they are searching uses consistent ways of signaling textual features. Along the way, they will throw away information they believe to be of no interest for the kind of text analysis their tool is to support. The next day, Fotis Jannidis and Thorsten Vitt gave a paper on “Markup in Textgrid”, which also touched on the problem of providing a homogeneous interface to a heterogeneous collection of documents; if I understood them correctly, they didn’t want to throw away information, but were planning simply to store both the original and a modified (homogenized) form of the data. In the discussion period, we discussed briefly the relative merits of translating the heterogeneous material into a common format and of leaving it in its original formats.

The translation into a common format frequently involves loss of some information. For example, if not every document in the collection has been encoded in such a way as to mark all line-end hyphens according to the recommendations of the MLA’s Committee on Scholarly Editions, then it may be better to strip that information out rather than expose it and risk allowing the user to conclude that the other documents were printed originally without any line-end hyphens at all (after all, the query shows no line-end hyphens in those documents!). But that, in turn, means that you’d better be careful if you expect the work performed through the common interface to produce results which may lead to someone wanting to enrich the markup in the documents. If you’ve stripped out information from the original encoding, and now you enrich your stripped copy, later users are unlikely to thank you when they find themselves trying to re-unify the information you’ve added and the information you stripped out.

It would be nice to have a way to present heterogeneous collections through an interface that allows them to look homogeneous, without actually having to lose the details of the original markup.

It has become clear to me that this problem is closely related to problems of interest in relational databases and in RDF queries. (And probably in other areas where people worry about query languages, too, but if Topic Maps people have talked about this in my hearing, they did so without my understanding that they were also addressing this same problem.)

“Ah,” said Enrique. “They used the muffliato spell on you, did they?” “Hush,” I said.

Database people are interested in this problem in a variety of contexts. Perhaps they are performing a federated search and the common schema in terms of which the query is formulated doesn’t match the actual schemas in which the data are stored and exposed by the database management systems. Perhaps it’s not a federated query but there are other reasons we (a) want to query the data in terms of a schema that doesn’t match the ‘native’ schema, and (b) don’t want to transform the storage from the native schema into the query schema. My colleague Eric Prud’hommeaux has been working on a similar problem in the context of RDF. And of course as I say it’s been on the minds of markup people for a while; I’ve just found a paper that Nancy Ide and I wrote for the ASIS 97 conference in which we tried to stagger towards a better understanding of the problem. I have the sense that I understand the problem better now than I did then, but I could be wrong.

Two basic techniques seem to be possible, if you have a body of data in one vocabulary (let’s call it the “source vocabulary”) and would like to be able to query it using terms from a different vocabulary (the “target vocabulary”). Both assume that it’s possible to map information from the source vocabulary to the target vocabulary.

The first technique is Mark Olsen’s: you have or develop a mapping to go from the source vocabulary to the target vocabulary; you apply that mapping. You now have data in the target vocabulary, and you can query it in the usual way. Done. I believe this is what database people call “materializing the view”.

The second technique took me a while to get my head around. Again, we start from a mapping from the source vocabulary to the target vocabulary, and a query using the target vocabulary. The technique has several steps.

  1. Invert the mapping, so it maps from the target vocabulary to the source vocabulary. (Call the result “the inverse mapping”.)
  2. Apply the inverse mapping to the query, to produce a semantically equivalent query expressed in terms of the source vocabulary. (Since the query is not itself a relational database, or an RDF graph, or an XML document, there’s a certain sleight-of-hand going on here: even if you have successfully inverted the mapping, it will take some legerdemain to apply it to a query instead of to data. But just how hard or easy that is will depend a lot on the nature of the query and the nature of the mapping rules. One of the reasons for this klog post is that I want to be able to set up this context, so I can usefully think aloud about the implications for query languages and mapping rules.)
  3. Apply the source-vocabulary query to the source-vocabulary data. Simple, right? Well, no, not simple, but at least it’s a well known problem.
  4. Take the results of your query, and apply the original source-to-target mapping to them, to produce results expressed in / marked up in the target vocabulary.

Eric Prud’hommeaux may have been surprised, when he brought this topic up the other day, at the speed with which I told him that the key rule which any application of the second technique must obey is a principle I first learned in a course on language pedagogy, years ago in graduate school. (If so, he hid it well.)

The unit of translation is the utterance, not the word.

Everything else follows from this, so let me say it again. The unit of translation is the utterance, not the word. And almost every account of ‘semantic mapping’ systems I have heard in the last fifteen years goes wrong because it assumes the contrary. So let me say it a third time. The specific implications of this may vary from system to system, and need some unpacking I’m not prepared to do this afternoon, but the basic principle remains what I learned from Gertrude Mahrholz thirty years ago:

The unit of translation is the utterance, not the word.

More on this later. In the meantime, think about that.

Allowing case-insensitive language tags in XSD

[30 July 2008]

I found a note to myself today, while going through some old papers, reminding me to write up an idea the i18n Working Group and I had when we were discussing the problem of case (in)sensitivity and language tags, some time ago. Here it is, for the record.

The discussion of the language datatype in XSD 1.1 includes a note reading:

Note: [BCP 47] specifies that language codes “are to be treated as case insensitive; there exist conventions for capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.” Since the language datatype is derived from string, it inherits from string a one-to-one mapping from lexical representations to values. The literals ‘MN’ and ‘mn’ (for Mongolian) therefore correspond to distinct values and have distinct canonical forms. Users of this specification should be aware of this fact, the consequence of which is that the case-insensitive treatment of language values prescribed by [BCP 47] does not follow from the definition of this datatype given here; applications which require case-insensitivity should make appropriate adjustments.

The same is true of XSD 1.0, even if it doesn’t point out the problem as clearly.

As can be imagined, there have been requests that XSD 1.1 define a new variant of xsd:string which is case-insensitive, to allow language tags to be specified properly as a subtype of case-insensitive string and not as a subtype of the existing string type. Or perhaps language tags need to be a primitive type (as John Cowan has argued) instead of a subtype of string. The former opens a large can of worms, the same one that led XML 1.0 to be case-sensitive in the first place. (If you haven’t tried to work out a really good locale-insensitive internationalized rule for case folding, try it sometime when your life is too placid and simple and all your problems are too tractable; if the difference between metropolitan French and Quebecois doesn’t make you give up, remember to explain how you’re going to handle Turkish and English in the same rule.) The latter doesn’t open that can of worms (for the restricted character set allowed in language tags, case folding is well behaved and well defined), but it does open others. I’ve talked about language codes as a subtype of string before, so I won’t repeat it here.

In some cases the case-sensitivity of xsd:language is not a serious problem: we can write our schema to enforce the usual case conventions, by which language tags should be lower-case and country codes should be uppercase, and we can specify that a particular element should have a language tag of “en”, “fr”, “ja”, or “de” by using an enumeration in the usual way:

<xsd:simpleType name="langtag">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:enumeration value="en"/>
<xsd:enumeration value="fr"/>
<xsd:enumeration value="ja"/>
<xsd:enumeration value="de"/>
</xsd:restriction>
</xsd:simpleType>

This datatype will accept any of the four language codes indicated, but only if they are written in lower case.

But what if we want a more liberal schema, which allows case-insensitive language tags? We want to accept not just “en” but also “EN”, “En”, and (because we are determined to do the thing properly) even “eN”.

We could add those to the enumeration: for each language, specify all four possible forms. No one seems to like this idea: it makes the declaration four times as big but much less clear. When I suggested it to the i18n WG, they just groaned and François Yergeau looked at me as if I had emitted an indelicate noise he didn’t want to call attention to.

We were all happier when a different idea occurred to us. First note that the datatype definition given above can easily be reformulated using a pattern facet instead of an enumeration:

<xsd:simpleType name="langtag2">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="en|fr|ja|de"/>
</xsd:restriction>
</xsd:simpleType>

This definition can be adjusted to make it case sensitive in a relatively straightforward way:

<xsd:simpleType name="langtag3">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="[eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE]"/>
</xsd:restriction>
</xsd:simpleType>

Voilà, case-insensitive language tags. The pattern is not quite four times larger than the old pattern, but the declaration is still smaller than the first one using enumerations.

A side benefit of using the pattern instead of the enumeration is that it’s easier to allow for subtags (so we can accept “en-US” and “en-UK”, etc., as well as just “en”) by expanding on the pattern:

<xsd:simpleType name="langtag4">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern
value="([eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE])(-[a-zA-Z0-9]{1,8})*"/>
</xsd:restriction>
</xsd:simpleType>

In a perfect system, there would be some way to signal that the four upper-, lower-, and mixed-case forms of “en” all mean the same thing and map to the same value. This technique does not provide that. But then, I don’t know any good way to provide it. (I do know ways to provide it, just not any ones I think are good.) In an imperfect world, if I want case-insensitive language tags, I suppose I should be happy that I can find a way to define them without much inconvenience. And that, this technique provides.

Namespace documents (kudos to XHTML)

[28 July 2008]

Lately I’ve had occasion to spend some time dereferencing namespaces and looking at what you get when you do so. If, for example, you have encountered some qualified name and want to know what it might mean, the “follow-your-nose” principle says it’s a good idea that you should be able to find out by dereferencing the namespace name. (The follow your nose priniplce introduced to me under that name by Dan Connolly, but I think he’d prefer to think of it as a general principle of Web architecture than as an invention of his own. And indeed the Architecture of the World Wide Web, as documented by the W3C’s Technical Architecture Group, explicitly recommends that namespace documents be provided for all namespaces.)

The upshot of my recent examinations is that for some namespaces, even otherwise exemplary applications and demos fail to provide namespace documents. For others, the only namespace document is a machine-readable document (e.g. an OWL ontology) without any human-comprehensible documentation of what the terms in the namespace are intended to mean; for still others, there is useful human-readable description (sometimes only in a comment, but it’s there) if you can can find it. And for a few, there is something approaching a document intended to be accessible to a human reader.

So far, however, the best namespace document I’ve seen recently is the one produced by the XHTML Working Group for the namespace http://www.w3.org/1999/xhtml/vocab — human-readable, and reasonably clear. Not perfect (no document date? no description of whether the vocabulary is subject to change?) but far, far, better than average.

Kudos to the XHTML Working Group!

Posted in XML

Descriptive markup and data integration

In his enlightening essay Si tacuisses, Enrique …, my colleague Thomas Roessler outlines some specific ways in which RDF’s provision of a strictly monotonic semantics makes some things possible for applications of RDF, and makes other things impossible. He concludes by saying

RDF semantics, therefore, is exposed to criticism from two angles: On the small scale, it imposes restrictions on those who model data … that can indeed bite badly. On the large scale, real life isn’t monotonic …, and RDF’s modeling can’t deal with that….

XML is “dumb” enough to not be subject to either of these criticisms. It is, however, not even trying to address the issues that large-scale data integration and aggregation will bring.

I think TR may both underestimate the degree to which XML (like SGML before it) contributes to making large-scale data integration possible, and overestimate the contribution that can be made to this task by monotonic semantics. To make large-scale data integration and aggregation possible, what must be done? I think that in a lot of situations, the first task is not “ensure that the application semantics are monotonic” but “try to record the data in an application-independent, reusable form”. If you cannot say what the data mean without reference to a single authoritative application, then you cannot reuse the data. If you have not defined an application-independent semantics for the data, then you will experience huge difficulties with any reuse of the data. Bear in mind that data integration and aggregation (whether large-scale or small-) are intrinsically, necessarily, kinds of data reuse. No data reuse, no data integration.

For that reason, I think TR’s final sentence shows an underdeveloped appreciation for the relevant technologies. Like the development of centralized databases designed to control redundancy and store common information in application-independent ways, the development of descriptive markup in SGML helped lay an essential foundation for any form of secondary data integration. Or is there a way to integrate data usefully without knowing anything at all about what it means? Having achieved the hard-won ability to own and control our own information, instead of having it be owned by software vendors, we can now turn to ways in which we can organize its semantics to minimize downstream complications. But there is no need to begin the effort by saying “well, the effort to wrest control of information from proprietary formats is all well and good, but it really isn’t trying to solve the problems of large-scale data integration that we are interested in.”

(Enrique whistled when he read that sentence. “You really want to dive down that rathole? Look, some people worked hard to achieve something; some other people didn’t think highly enough of the work the first people did, or didn’t talk about it with enough superlatives. Do you want to spend this post addressing your deep-seated feelings of inadequacy and your sense of being under-appreciated? Or do you want to talk about data integration? Sheesh. Dry up, wouldja?“)

Conversely, I think TR may overestimate the importance of the contribution RDF, or any similar technology, can make to useful data integration. Any data store that can be thought of as a conjunction of sentences can be merged through the simple process of set union; RDF’s restriction to atomic triples contributes nothing (as far as I can currently see) to that mergeability. (Are there ways in which RDF triple stores are mergeable that Topic Map graphs are not mergeable? Or relational data stores?)

And it’s not clear to me that simple mechanical mergeability in itself contributes all that much to our ability to integrate data from different sources. Data integration, as I understand the term, involves putting together information from different source to achieve some purpose or accomplish some task. But using information to achieve a purpose always involves understanding the information and seeing how it can be brought to bear on the problem. In my experience, finding or making a human brain with the required understanding is the hard part; once that’s available, the kinds of simple automatic mergers made possible by RDF or Topic Maps have seemed (in my experience, which may be woefully inadequate in this regard) a useful convenience, but not always an essential one. It might well be that the data from source A cannot be merged mechanically with that from source B, but an integrator who understands how to use the data from A and B to solve a problem will often experience no particular difficulty working around that impossibility.

I don’t mean to underestimate the utility of simple mechanical processing steps. They can reduce costs and increase reliability. (That’s why I’m interested in validation.) But by themselves they will never actually solve any very interesting problems, and the contribution of mechanical tools seems to me smaller than the contribution of the human understanding needed to deploy them usefully.

And finally, I think Thomas’s post raises an important and delicate question about the boundaries RDF sets to application semantics. An important prerequisite for useful data integration is, it would seem, that there be some useful data worth retaining and integrating. How thoroughly can we convince ourselves that in requiring monotonic semantics RDF has not excluded from its purview important classes of information most conveniently represented in other ways?

RDF, Topic Maps, predicate calculus, and the Queen of Romania

[22 July 2008; minor revisions 23 July]

Some colleagues and I spent time not long ago discussing the proposition that RDF has intrinsic semantics in a way that XML does not. My view, influenced by some long-ago thoughts about RDF, was that there is no serious difference between RDF and XML here: from interesting semantics we learn things about the real world, and neither the RDF spec nor the XML spec provides any particular set of semantic primitives for talking about the world. The maker of the vocabulary can (I oversimplify slightly, complexification below) make terms mean pretty much anything they want: this is critical both to XML and to RDF. The only way, looking at an RDF graph or the markup in an XML document, to know whether it is talking about the gross national product or the correct way to make adobe, is to look at the documentation. This analysis, of course, is based on interpreting the propositition we were discussing in a particular way, as claiming that in some way you know more about what an RDF graph is saying than you know about what an SGML or XML document is saying, without the need for human intervention. Such a claim does not seem plausible to me, but it is certainly what I have understood some of my RDF-enthusiast friends to have been saying over the years.

(I should point out that if one understands the vocabulary used to define classes and subclasses in the RDF graph, of course, the chances of hitting upon useful documentation are somewhat increased. If you don’t know what vug means, but know that it is a subclass of cavity, which in turn is (let’s say) a subclass of the class of geological formations, then even if vug is otherwise inadequately documented you may have a chance of understanding, sort of, kind of, what’s going on in the part of the RDF graph that mentions vugs. I was about to say that this means one’s chances of finding useful documentation may be better with RDF than with naked XML, but my evil twin Enrique points out that the same point applies if you understand the notation used to define superclass/subclass relations [or, as they are more usually called, supertype/subtype relations] in XSD [the XML Schema Definition Language]. He’s right, so the ability to find documentation for sub- and superclasses doesn’t seem to distinguish RDF from XML.)

This particular group of colleagues, however, had (for the most part) a different reason for saying that RDF has more semantics than XML.

Thomas Roessler has recently posted a concise but still rather complex statement of the contract that producers of RDF enter into with the consumers of RDF, and the way in which it can be said to justify the proposition that RDF has more semantics built-in than XML.

My bumper-sticker summary, though, is simpler. When looking at an XML document, you know that the meaning of the document is given by an interaction of (1) the rules for interpreting the document shaped by the designer of the vocabulary and by the usage of the document creator with (2) the actual content of the document. The rules given by the vocabulary designer and document author, in turn, are limited only by human ingenuity. If someone wants to specify a vocabulary in which the correct interpretation of an element requires that you perform gematriya on the element’s generic identifier (element type name, as the XML spec calls it) and then feed the resulting number into a specific random number generator as a seed, then we can say that that’s probably not good design, but we can’t stop them. (Actually, I’m not sure that RDF can stop that particular case, either. Hmm. I keep trying to identify differences and finding similarities instead.)

(Enrique interrupted me here. “Gematriya?” “A hermeneutic tool beloved of some Jewish mystics. Each letter of the alphabet has a numeric value, and the numerical value for a concept may be derived from the numbers of the letters which spell the word for the concept. Arithmetic relations among the gematriya for different words signal conceptual relations among the ideas they denote.” “Where do you get this stuff? Reading Chaim Potok or something?” “Well, yeah, and Knuth for the random-number generator, but there are analogous numerological practices in other traditions, too. Should I add a note saying that the output of the random number generator is used to perform the sortes Vergilianae?” “No,” he said, “just shut up, would you?”)

In RDF, on the other hand, you do know some things.

  1. You know the “meaning” of the RDF graph can be paraphrased as the conjunction of a set of declarative sentences.
  2. You know that each of those declarative sentences is atomic and semantically independent of all others. (That is, RDF allows no compound structures other than conjunction; it differs in this way from programming languages and from predicate logic — indeed, from virtually all formally defined notations which require context-free grammars — which allow recursive structures whose meaning must be determined top-down, and whose meaning is not the same as the conjunction of their parts. The sentences P and Q are both part of the sentence “if P then Q”, but the meaning of that sentence is not the same as the conjunction of the parts P and Q.)

When my colleagues succeeded in making me understand that on the basis of these two facts one could plausibly claim that RDF has, intrinsically, more semantics than XML, I was at first incredulous. It seems a very thin claim. Knowing that the graph in front of me can be paraphrased as a set of short declarative sentences doesn’t seem to tell me what it means, any more than suspecting that the radio traffic between spies and spymasters consists of reports going one direction and instructions going the other tells us how to crack the code being used. But as Thomas points out, these two facts are fairly important as principles that allow RDF graphs to be merged without violence to their meaning, which is an important task in data integration. Similar principles (or perhaps at this level of abstraction they are the same principles) are important in allowing topic maps to be merged safely.

Of course, there is a flip side. If a notation restricts itself to a monotonic semantics of this kind (in which no well formed formula ever appears in an expression without licensing us to throw away the rest of the expression and assume that the formula we found in it has been asserted), then some important conveniences seem to be lost. I am told that for a given statement P, it’s not impossible to express the proposition “not P” in RDF, but I gather than it does not involve any construct that resembles the expression for P itself. And similarly, constructions familiar from sentential logic like “P or Q”, “P only if Q”, and “P if and only if Q” must all be translated into constructions which do not contain, as subexpressions, the expressions for P or Q themselves.

At the very least, this seems likely to be inconvenient and opaque.

Several questions come thronging to the fore whenever I get this far in my ruminations on this topic.

  • Do Topic Maps have a similarly restrictive monotonic semantics?
  • Could we get a less baroque representation of complex conditionals with something like Lars-Marius Garshol’s quads, in which the minimal atomic form of utterance has subject, verb, object, and who-said-so components, so that having a quad in your store does not commit you to belief in the proposition captured in its triple the way that having a triple in your triple-store does? Or do quads just lead to other problems?
  • If we accept as true my claim that XML can in theory express imperative, interrogative, exclamatory, or other non-declarative semantics (fans of Roman Jakobson’s 1960 essay on Linguistics and Poetics may now chant, in unison, “expressive, conative, meta-lingual, phatic, poetic”, thank you very much, no, don’t add “referential”, that’s the point, the ability to do referential semantics is not a distinguishing feature here), does that fact do anyone any good? The fundamental idea of descriptive markup has sometimes been analysed as consisting of (a) declarative (not imperative!) semantics and (b) logical rather than appearance-oriented markup of the document; if that analysis is sound (and I had always thought so), then presumably the use of XML for non-declarative semantics should be regarded as eccentric and probably not good practice, but unavoidable. In order to achieve declarative semantics, it was necessary to invent SGML (or something like it), but neither SGML nor XML enforce, or attempt to enforce, a declarative semantics. So is the ability to define XML vocabularies with non-declarative semantics anything other than an artifact of the system design? (I’m tempted to say “a spandrel”, but let’s not go into evolutionary biology.)
  • Is there a short, clear story about the relation between the kinds of things you can and cannot express in RDF, or Topic Maps, and the kinds of things expressible and inexpressible in other notations like first-order predicate calculus, sentential calculus, the relational model, and natural language? (Or even a long opaque story?) What i have in mind here is chapter 10 in Clocksin and Mellish’s Programming in Prolog, “The Relation of Prolog to Logic”, in which they clarify the relative expressive powers of first-order predicate calculus and Prolog by showing how to translate sentences from the first to the second, observing along the way exactly when and how expressive power or nuance gets lost. Can I translate arbitrary first-order predicate calculus expressions into RDF? How? Into Topic Maps? How? What gets lost on the way?

It will not surprise me to learn that these are old well understood questions, and that all I really need to do is RTFM. (Actually, that would be good news: it would indicate that it’s a well understood and well solved problem. In another sense, of course, it would be less good news to be told to RTFM. I’ve tried several times to read Resource Description Framework (RDF) Model and Syntax Specification but never managed to get my head around it. But knowing that there is an FM to read would be comforting in its way, even if I never managed to read it. RDF isn’t really my day job, after all.)

How comfortable can we be in our formalization of the world, when for the sake of tractability our formalizations are weaker than predicate calculus, given that even predicate calculus is so poor at capturing even simple natural-language discourse? Don’t tell me we are expending all this effort to build a Semantic Web in which we won’t even be able to utter counterfactual conditionals?! What good is a formal notation for information which does not allow us to capture a sentence like the one with which Lou Burnard once dismissed a claim I had made:

“If that is the case, then I am the Queen of Romania.”