Why …

Posted on 24 July 2008 by cmsmcq

Why do so many proponents of new technologies spend so much time misrepresenting existing technologies and spreading misinformation about them?

Do they misrepresent the existing technologies in order to make the new technology they are selling look better?

Or did they get involved in the new technology only because they failed to understand the existing technology and how to use it well?

Daniel Boone meets the consistent Web

Posted on 22 July 2008 by cmsmcq

[22 July 2008]

My colleague Thomas Roessler writes:

[The monotonic semantics of RDF] guarantee that you won’t run into a world of inconsistency when you discover additional information, and they also guarantee that you can learn things about the world piece by piece.

My evil twin Enrique responds: So let us start with the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is identical to the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which I assume I can express using some predicate like the OWL sameAs.

And now let us discover additional information in another triple store, which contains the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is distinct from the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which it expresses using some predicate like the OWL differentFrom.

I’m having trouble understanding (concludes Enrique) how we can do this without either running into a world of inconsistency (a small world, perhaps, bounded in a nutshell, but still a world big enough for joe and Josephus to be both the same and different), or else running into a world in which we find that “inconsistency” has been defined to have a highly technical meaning under which the two triples just described are not actually inconsistent in the technical sense (why do I expect someone to start lecturing me about Herbrand models any moment now?), even though any application relying on the usual notions of identity and difference may find itself at a loss as to what to make of seeing them both in the same graph.

I reminded Enrique of the American pioneer Daniel Boone, who proudly claimed that he had never been lost in his life. Never? Never. [Pause.] “But I was a mite bewildered once for three days.” [Rimshot.]

Descriptive markup and data integration

Posted on 22 July 2008 by cmsmcq

In his enlightening essay Si tacuisses, Enrique …, my colleague Thomas Roessler outlines some specific ways in which RDF’s provision of a strictly monotonic semantics makes some things possible for applications of RDF, and makes other things impossible. He concludes by saying

RDF semantics, therefore, is exposed to criticism from two angles: On the small scale, it imposes restrictions on those who model data … that can indeed bite badly. On the large scale, real life isn’t monotonic …, and RDF’s modeling can’t deal with that….

XML is “dumb” enough to not be subject to either of these criticisms. It is, however, not even trying to address the issues that large-scale data integration and aggregation will bring.

I think TR may both underestimate the degree to which XML (like SGML before it) contributes to making large-scale data integration possible, and overestimate the contribution that can be made to this task by monotonic semantics. To make large-scale data integration and aggregation possible, what must be done? I think that in a lot of situations, the first task is not “ensure that the application semantics are monotonic” but “try to record the data in an application-independent, reusable form”. If you cannot say what the data mean without reference to a single authoritative application, then you cannot reuse the data. If you have not defined an application-independent semantics for the data, then you will experience huge difficulties with any reuse of the data. Bear in mind that data integration and aggregation (whether large-scale or small-) are intrinsically, necessarily, kinds of data reuse. No data reuse, no data integration.

For that reason, I think TR’s final sentence shows an underdeveloped appreciation for the relevant technologies. Like the development of centralized databases designed to control redundancy and store common information in application-independent ways, the development of descriptive markup in SGML helped lay an essential foundation for any form of secondary data integration. Or is there a way to integrate data usefully without knowing anything at all about what it means? Having achieved the hard-won ability to own and control our own information, instead of having it be owned by software vendors, we can now turn to ways in which we can organize its semantics to minimize downstream complications. But there is no need to begin the effort by saying “well, the effort to wrest control of information from proprietary formats is all well and good, but it really isn’t trying to solve the problems of large-scale data integration that we are interested in.”

(Enrique whistled when he read that sentence. “You really want to dive down that rathole? Look, some people worked hard to achieve something; some other people didn’t think highly enough of the work the first people did, or didn’t talk about it with enough superlatives. Do you want to spend this post addressing your deep-seated feelings of inadequacy and your sense of being under-appreciated? Or do you want to talk about data integration? Sheesh. Dry up, wouldja?“)

Conversely, I think TR may overestimate the importance of the contribution RDF, or any similar technology, can make to useful data integration. Any data store that can be thought of as a conjunction of sentences can be merged through the simple process of set union; RDF’s restriction to atomic triples contributes nothing (as far as I can currently see) to that mergeability. (Are there ways in which RDF triple stores are mergeable that Topic Map graphs are not mergeable? Or relational data stores?)

And it’s not clear to me that simple mechanical mergeability in itself contributes all that much to our ability to integrate data from different sources. Data integration, as I understand the term, involves putting together information from different source to achieve some purpose or accomplish some task. But using information to achieve a purpose always involves understanding the information and seeing how it can be brought to bear on the problem. In my experience, finding or making a human brain with the required understanding is the hard part; once that’s available, the kinds of simple automatic mergers made possible by RDF or Topic Maps have seemed (in my experience, which may be woefully inadequate in this regard) a useful convenience, but not always an essential one. It might well be that the data from source A cannot be merged mechanically with that from source B, but an integrator who understands how to use the data from A and B to solve a problem will often experience no particular difficulty working around that impossibility.

I don’t mean to underestimate the utility of simple mechanical processing steps. They can reduce costs and increase reliability. (That’s why I’m interested in validation.) But by themselves they will never actually solve any very interesting problems, and the contribution of mechanical tools seems to me smaller than the contribution of the human understanding needed to deploy them usefully.

And finally, I think Thomas’s post raises an important and delicate question about the boundaries RDF sets to application semantics. An important prerequisite for useful data integration is, it would seem, that there be some useful data worth retaining and integrating. How thoroughly can we convince ourselves that in requiring monotonic semantics RDF has not excluded from its purview important classes of information most conveniently represented in other ways?

RDF, Topic Maps, predicate calculus, and the Queen of Romania

Posted on 22 July 2008 by cmsmcq

[22 July 2008; minor revisions 23 July]

Some colleagues and I spent time not long ago discussing the proposition that RDF has intrinsic semantics in a way that XML does not. My view, influenced by some long-ago thoughts about RDF, was that there is no serious difference between RDF and XML here: from interesting semantics we learn things about the real world, and neither the RDF spec nor the XML spec provides any particular set of semantic primitives for talking about the world. The maker of the vocabulary can (I oversimplify slightly, complexification below) make terms mean pretty much anything they want: this is critical both to XML and to RDF. The only way, looking at an RDF graph or the markup in an XML document, to know whether it is talking about the gross national product or the correct way to make adobe, is to look at the documentation. This analysis, of course, is based on interpreting the propositition we were discussing in a particular way, as claiming that in some way you know more about what an RDF graph is saying than you know about what an SGML or XML document is saying, without the need for human intervention. Such a claim does not seem plausible to me, but it is certainly what I have understood some of my RDF-enthusiast friends to have been saying over the years.

(I should point out that if one understands the vocabulary used to define classes and subclasses in the RDF graph, of course, the chances of hitting upon useful documentation are somewhat increased. If you don’t know what vug means, but know that it is a subclass of cavity, which in turn is (let’s say) a subclass of the class of geological formations, then even if vug is otherwise inadequately documented you may have a chance of understanding, sort of, kind of, what’s going on in the part of the RDF graph that mentions vugs. I was about to say that this means one’s chances of finding useful documentation may be better with RDF than with naked XML, but my evil twin Enrique points out that the same point applies if you understand the notation used to define superclass/subclass relations [or, as they are more usually called, supertype/subtype relations] in XSD [the XML Schema Definition Language]. He’s right, so the ability to find documentation for sub- and superclasses doesn’t seem to distinguish RDF from XML.)

This particular group of colleagues, however, had (for the most part) a different reason for saying that RDF has more semantics than XML.

Thomas Roessler has recently posted a concise but still rather complex statement of the contract that producers of RDF enter into with the consumers of RDF, and the way in which it can be said to justify the proposition that RDF has more semantics built-in than XML.

My bumper-sticker summary, though, is simpler. When looking at an XML document, you know that the meaning of the document is given by an interaction of (1) the rules for interpreting the document shaped by the designer of the vocabulary and by the usage of the document creator with (2) the actual content of the document. The rules given by the vocabulary designer and document author, in turn, are limited only by human ingenuity. If someone wants to specify a vocabulary in which the correct interpretation of an element requires that you perform gematriya on the element’s generic identifier (element type name, as the XML spec calls it) and then feed the resulting number into a specific random number generator as a seed, then we can say that that’s probably not good design, but we can’t stop them. (Actually, I’m not sure that RDF can stop that particular case, either. Hmm. I keep trying to identify differences and finding similarities instead.)

(Enrique interrupted me here. “Gematriya?” “A hermeneutic tool beloved of some Jewish mystics. Each letter of the alphabet has a numeric value, and the numerical value for a concept may be derived from the numbers of the letters which spell the word for the concept. Arithmetic relations among the gematriya for different words signal conceptual relations among the ideas they denote.” “Where do you get this stuff? Reading Chaim Potok or something?” “Well, yeah, and Knuth for the random-number generator, but there are analogous numerological practices in other traditions, too. Should I add a note saying that the output of the random number generator is used to perform the sortes Vergilianae?” “No,” he said, “just shut up, would you?”)

In RDF, on the other hand, you do know some things.

You know the “meaning” of the RDF graph can be paraphrased as the conjunction of a set of declarative sentences.
You know that each of those declarative sentences is atomic and semantically independent of all others. (That is, RDF allows no compound structures other than conjunction; it differs in this way from programming languages and from predicate logic — indeed, from virtually all formally defined notations which require context-free grammars — which allow recursive structures whose meaning must be determined top-down, and whose meaning is not the same as the conjunction of their parts. The sentences P and Q are both part of the sentence “if P then Q”, but the meaning of that sentence is not the same as the conjunction of the parts P and Q.)

When my colleagues succeeded in making me understand that on the basis of these two facts one could plausibly claim that RDF has, intrinsically, more semantics than XML, I was at first incredulous. It seems a very thin claim. Knowing that the graph in front of me can be paraphrased as a set of short declarative sentences doesn’t seem to tell me what it means, any more than suspecting that the radio traffic between spies and spymasters consists of reports going one direction and instructions going the other tells us how to crack the code being used. But as Thomas points out, these two facts are fairly important as principles that allow RDF graphs to be merged without violence to their meaning, which is an important task in data integration. Similar principles (or perhaps at this level of abstraction they are the same principles) are important in allowing topic maps to be merged safely.

Of course, there is a flip side. If a notation restricts itself to a monotonic semantics of this kind (in which no well formed formula ever appears in an expression without licensing us to throw away the rest of the expression and assume that the formula we found in it has been asserted), then some important conveniences seem to be lost. I am told that for a given statement P, it’s not impossible to express the proposition “not P” in RDF, but I gather than it does not involve any construct that resembles the expression for P itself. And similarly, constructions familiar from sentential logic like “P or Q”, “P only if Q”, and “P if and only if Q” must all be translated into constructions which do not contain, as subexpressions, the expressions for P or Q themselves.

At the very least, this seems likely to be inconvenient and opaque.

Several questions come thronging to the fore whenever I get this far in my ruminations on this topic.

Do Topic Maps have a similarly restrictive monotonic semantics?
Could we get a less baroque representation of complex conditionals with something like Lars-Marius Garshol’s quads, in which the minimal atomic form of utterance has subject, verb, object, and who-said-so components, so that having a quad in your store does not commit you to belief in the proposition captured in its triple the way that having a triple in your triple-store does? Or do quads just lead to other problems?
If we accept as true my claim that XML can in theory express imperative, interrogative, exclamatory, or other non-declarative semantics (fans of Roman Jakobson’s 1960 essay on Linguistics and Poetics may now chant, in unison, “expressive, conative, meta-lingual, phatic, poetic”, thank you very much, no, don’t add “referential”, that’s the point, the ability to do referential semantics is not a distinguishing feature here), does that fact do anyone any good? The fundamental idea of descriptive markup has sometimes been analysed as consisting of (a) declarative (not imperative!) semantics and (b) logical rather than appearance-oriented markup of the document; if that analysis is sound (and I had always thought so), then presumably the use of XML for non-declarative semantics should be regarded as eccentric and probably not good practice, but unavoidable. In order to achieve declarative semantics, it was necessary to invent SGML (or something like it), but neither SGML nor XML enforce, or attempt to enforce, a declarative semantics. So is the ability to define XML vocabularies with non-declarative semantics anything other than an artifact of the system design? (I’m tempted to say “a spandrel”, but let’s not go into evolutionary biology.)
Is there a short, clear story about the relation between the kinds of things you can and cannot express in RDF, or Topic Maps, and the kinds of things expressible and inexpressible in other notations like first-order predicate calculus, sentential calculus, the relational model, and natural language? (Or even a long opaque story?) What i have in mind here is chapter 10 in Clocksin and Mellish’s Programming in Prolog, “The Relation of Prolog to Logic”, in which they clarify the relative expressive powers of first-order predicate calculus and Prolog by showing how to translate sentences from the first to the second, observing along the way exactly when and how expressive power or nuance gets lost. Can I translate arbitrary first-order predicate calculus expressions into RDF? How? Into Topic Maps? How? What gets lost on the way?

It will not surprise me to learn that these are old well understood questions, and that all I really need to do is RTFM. (Actually, that would be good news: it would indicate that it’s a well understood and well solved problem. In another sense, of course, it would be less good news to be told to RTFM. I’ve tried several times to read Resource Description Framework (RDF) Model and Syntax Specification but never managed to get my head around it. But knowing that there is an FM to read would be comforting in its way, even if I never managed to read it. RDF isn’t really my day job, after all.)

How comfortable can we be in our formalization of the world, when for the sake of tractability our formalizations are weaker than predicate calculus, given that even predicate calculus is so poor at capturing even simple natural-language discourse? Don’t tell me we are expending all this effort to build a Semantic Web in which we won’t even be able to utter counterfactual conditionals?! What good is a formal notation for information which does not allow us to capture a sentence like the one with which Lou Burnard once dismissed a claim I had made:

“If that is the case, then I am the Queen of Romania.”

The OOXML debates (non-combatant’s perspective)

Posted on 22 July 2008 by cmsmcq

[21-22 July 2008]

So far, I have managed to avoid participating in the debates over standardizing OOXML, and I don’t plan for that to change. But my evil twin Enrique and I spent some wickedly enjoyable time this afternoon reading a lot of postings in that debate, from a variety of sources, when I should have been working on other things. (“Log it as ‘Professional – continuing education’,” suggested Enrique. I may do that.)

It’s interesting to be able to observe a hard-fought technical battle in which (other people’s) feelings run high but in which one does not have a large personal stake. So many rhetorical maneuvers are familiar, the deterioration of the quality of the argument brings back so many memories of other technical arguments in which (distracted by caring about the outcome) the observer may not have been able to appreciate the rhetorical ingenuity of some of the contributions.

What strikes both Enrique and me is how distinct the styles of argumentation on the various sides of the debate are. We counted three, not two, in this battle, but we could be undercounting.

On one side, there is a class of contributions carefully kept as thoroughly emotionless as possible, focusing exclusively on technical (or at least substantive) issues — even when the contribution was intended to persuade others of a course of action. This seems, at first, an unusual rhetorical choice: I think most advertisers tend to prefer enthusiasm to a studied lack of emotion in trying to sell things. Still, this class includes some of the people whose judgement I have the most reason to respect, and in an over-heated environment a strict objectivity can be immensely attractive.

There is a second class of contributions, which provide a complex mix of a more emotional, excitable, even passionate, style of argumentation, which is however almost always tethered to concrete, verifiable (or falsifiable) propositions about technical properties of OOXML (and ODF), about process issues, and so on. The contributions of this class are by no means always well reasoned or insightful, but they are all recognizably arguments which can be refuted.

And there is a third class, which contains some of the most inventive ad hominem attacks, imaginative name-calling, and insidious smears I have ever seen outside of recent U.S. national electoral politics.

What is striking and puzzling to me is how cleanly the three different rhetorical styles seem to me to map to different positions (let me call them left and right, without mapping left/right into pro/con) on OOXML. If you see a statement that could in principle be verified or falsified by an impartial third party, there is a much better than even chance that it’s from a contribution arguing, let us call it, the left-hand position. And if you see an infuriatingly smug piece which avoids addressing actual technical issues and confines itself to name-calling, slander, and innuendo, there is a very strong chance that it’s taking a right-hand position. (I’m speaking here mostly of bloggers and essayists, not of those who have commented on various blog posts — the blog comments are uniformly smug and infuriating regardless of handedness.)

I have tried not to say explicitly which position each of these styles is associated with, because if Enrique and I are right then all you have to do is (re)read some of the rhetorical barrages of the last year or two to see which is which. (Those of my readers who care about the outcome, or about the health and reputation of the institutions involved, may find this too painful to contemplate. I’m sorry; you don’t have to if you don’t want to.) And if we’re wrong (and we may be — we only had stomach for an afternoon’s worth of the stuff, not more), then there’s no fairness in pointing the finger of blame at just one side for the incivility that can be seen in the discussion of OOXML.

And in any case, as Enrique points out with a certain malicious glee, “Most people who don?~~t look into it will assume that the merits of the technical arguments must be with the first or second groups, because they don’t descend (or more correctly they descend less often) to slander and name-calling. But there is no rule that says that just because those on one side of an argument argue unfairly or irrelevantly, or act with infuriating disregard of basic rules of courteous technical discussion, then it’s safe to conclude that they have the wrong end of the technical stick, any more than it’s safe to conclude that an invalid argument has reached a false conclusion. Unfairness and low behavior don’t mean people aren’t right in the end.”

Enrique may be right. But watching the OOXML debates serves as a salutary reminder that when some in a technical discussion descend to name-calling and slander (and what better to spice up a blog with?), the animosities created during the process will hover over the result of the decision for a long time.

Memo to self: in future, try to be calmer and more fair in discussions.

(“Yeah,” I hear Enrique mutter. “Leave the dirty work to me.”)

Messages in a Bottle

CMSMcQ's klog

Why …

Daniel Boone meets the consistent Web

Descriptive markup and data integration

RDF, Topic Maps, predicate calculus, and the Queen of Romania

The OOXML debates (non-combatant’s perspective)