Treating our information with the care it deserves

[21-22 July 2008]

I don’t make a habit of recording here all of the interesting, useful, or amusing things I read. But I am quite taken with Steve Pepper’s account of the situation in which many large organizations find themselves in. In a blog post devoted to a different topic (the history of Norway’s vote on OOXML), he describes (his understanding of) one organization’s point of view and motivations:

They are a big MS Office user, they participated in TC45 (the Ecma committee responsible for OOXML) and they clearly feel that OOXML is important to them.

I can understand why. An enormous amount of their intellectual capital is tied up in proprietary formats ?~~ in particular Excel ?~~ that have been owned and controlled by a vendor for the last 20 or so years. StatoilHydro has literally had no way of getting at its own information, short of paying license fees to Microsoft. Recently the company has started to realize the enormity of the mistake it has made in not treating its information with the care and respect it deserves.

As he points out, they are of course not alone in having made this mistake, particularly if one includes other proprietary formats beyond Office, and other vendors than Microsoft.

Several points occur to me:

  • It’s easy for me to feel superior and to lack interest in the problems of converting legacy data: I stopped using proprietary formats about twenty years ago, not very long after I had acquired a personal computer and gained the opportunity to start using them in the first place, and so with very few exceptions pretty much every piece of information I have created over my career is still readable. (A prospective collaboration did collapse once when at the end of a full-day meeting, as we were deciding who would draft what piece of the grant proposal, I asked what DTD we would be using, and my soon-to-be-former prospective collaborators said they had planned to be working in a proprietary word processor.) But feeling superior is not really a useful analysis of the situation.

Enrique: Are you saying that there are no proprietary data formats on mainframes? Whom are you trying to kid?

Me: No, but all my mainframe usage was on university mainframes; we don’t seem to have been able to afford any seriously proprietary software, at least any that was interesting to me. I was mostly doing document preparation, and later on database work. And for a while I maintained the terminal translation tables.

Enrique: The what?

Me: Never mind. There used to be things called terminals, and … Sorry I brought it up.

Enrique: And your databases didn’t use proprietary formats?

Me: Internally, sure. But they could all dump the data in a reusable text file format. I think I translated the Spires dump of my bibliographic data to XML once. Or maybe that was just something that went on the Someday pile.

Enrique: You’re right. Feelings of superiority are not really an adequate analysis of a complex situation. Even if the feelings were justified, which in this case, Bucko, does not seem to be the case.

  • The right solution for these organizations is, perhaps, to move away from such closed systems once for all, and use semantically richer markup. Certainly that’s where my immediate sympathies lie. It’s not impossible: lots of organizations use surprisingly rich markup for data they care about.
  • But how are they to get there, starting from where they are now? Even if the long-term benefits are substantial (which is close to self-evident for me, but is likely to sound very unproven to any serious organizational IT person), you have to get through the short term in order to reach the long term. So the ideal migration path starts paying off very quickly, even before you’ve gone very far. (Paoli’s Law: if people put five cents of effort in, they want to see a nickel in return, and quickly.) Can there be such a migration path? Or is going cold turkey the only way to go?
  • The desire to get as much benefit for as little work as possible seems to make everyone with a legacy-data problem easy prey for snake-oil salesmen. I don’t see any prospect of this changing, though, ever.

Enrique: Nah. Snake oil, now there’s a growth stock.

Posted in XML

XSD 1.1 is in Last Call

Yesterday the World Wide Web Consortium published new drafts of its XML Schema Definition Language (XSD) 1.1, as ‘last-call’ drafts.

The idiom has an obscure history, but is clearly related to the last call for orders in pubs which must close by a certain hour. The working group responsible for a specification labels it ‘last call’, as in ‘last call for comments’, to indicate that the working group believes the spec is finished and ready to move forward. If other working groups or external readers have been waiting to review the document, thinking “there’s no point reviewing it now because they are still changing things”, the last call is a signal that the responsible working group has stopped changing things, so if you want to review it, it’s now or never.

The effect, of course, can be to evoke a lot of comments that require significant rework of the spec, so that in fact it would be foolish for a working group to believe they are essentially done when they reach last call. (Not that it matters what the WG thinks: a working group that believes last call is the end of the real work will soon be taught better.)

In the case of XSD 1.1, this is the second last call publication both for the Datatypes spec and for the Structures spec (published previously as last-call working drafts in February 2006 and in August 2007, respectively). Each elicited scores of comments: by my count there are 126 Bugzilla issues on Datatypes opened since 17 February 2006, and 96 issues opened against Structures since 31 August 2007. We have closed all of the substantive comments, most by fixing the problem and a few (sigh) by discovering either that we could not reach consensus on what to do about the problem (or in some cases could not reach consensus about whether there was really a problem before us) or that we could not make the requested change without more delay than seemed warrantable. There are still a number of ‘editorial’ issues open, which are expected not to affect the conformance requirements for the spec or to change the results of anyone’s review of the spec, and which we therefore hope to be able to close after going to last call.

XSD 1.1 is, I think, somewhat improved over XSD 1.0 in a number of ways, ranging from the very small but symbolically very significant to much larger changes. On the small but significant side: the spec has a name now (XSD) that is distinct from the generic noun phrase used to describe the subject matter of the spec (XML schemas), which should make it easier for people to talk about XML schema languages other than XSD without confusing some listeners. On the larger side:

  • XSD 1.1 supports XPath 2.0 assertions on complex and simple types. The subset of XPath 2.0 defined for assertions in earlier drafts of XSD 1.1 has been dropped; processors are expected to support all of XPath 2.0 for assertions. (There is, however, a subset defined for conditional type assignment, although here too schema authors are allowed to use, and processors are allowed to support, full XPath.)
  • ‘Negative’ wildcards are allowed, that is wildcards which match all names except some specified set. The excluded names can be listed explicitly, or can be “all the elements defined in the schema” or “all the elements present in the content model”.
  • The xs:redefine element has been deprecated, and a new xs:override element has been defined which has clearer semantics and is easier to use.

Some changes vis-a-vis 1.0 were already visible in earlier drafts of 1.1:

  • The rules requiring deterministic content models have been relaxed to allow wildcards to compete with elements (although the determinism rule has not been eliminated completely, as some would prefer).
  • XSD 1.1 supports both XML 1.0 and XML 1.1.
  • A conditional inclusion mechanism is defined for schema documents, which allows schema authors to write schema documents that will work with multiple versions of XSD. (This conditional inclusion mechanism is not part of XSD 1.0, and cannot be added to it by an erratum, but there is no reason a conforming XSD 1.0 processor cannot support it, and I encourage makers of 1.0 processors to add support for it.)
  • Schema authors can specify various kinds of ‘open content’ for content models; this can make it easier to produce new versions of a vocabulary with the property that any document valid against the new vocabulary will also be valid against the old.
  • The Datatypes spec includes a precisionDecimal datatype intended to support the IEEE 754R floating-point decimal specification recently approved by IEEE.
  • Processors are allowed to support primitive datatypes, and datatype facets, additional to those defined in the specification.
  • We have revised many, many passages in the spec to try to make them clearer. It has not been easy to rewrite for clarity while retaining the kind of close correspondence to 1.0 that allows the working group and implementors to be confident that the rewrite has not inadvertently changed the conformance criteria. Some readers will doubtless wish that the working group had done more in this regard. But I venture to hope that many readers will be glad for the improvements in wording. The spec is still complex and some parts of it still make for hard going, but I think the changes are trending in the right direction.

If you have any interest in XSD, or in XML schema languages in general, I hope you will take the time to read and comment on XSD 1.1. The comment period runs through 12 September 2008. The specs may be found on the W3C Technical Reports index page.

Caspar

[8 April 2008]

I had another bad dream last night. Enrique came back.

“I don’t think you liked Cheney very much. Or Eric van der Vlist, either, judging from his comment. So I’ve written another conforming XSD 1.0 processor, in a different vein.” He handed me a piece of paper which read:

#!/bin/sh
echo "Input not accepted; unknown format."

“This is Caspar,”

“Caspar as in Weinberger?”

“Hauser. As you can see, Caspar doesn’t have the security features of Cheney, but it’s also conforming.”

I should point out for readers who have not encountered the figure of Kaspar Hauser that he was a mysterious young man found in Nuremberg in the 1820s, raised allegedly without language or human contact, who never quite found a way to fit into society, and is thus a convenient focal point for meditations on language and civilization by artists as diverse as Werner Herzog, Paul Verlaine, and Elizabeth Swados.

“Conforming? I don’t think I want to know why you think so.”

“Oh, sure you do. Don’t be such a wet blanket.”

“I gather you’re going to tell me anyway. OK, if there’s no way out of it, let’s get this over with. Why do you think Caspar is a conforming XSD processor?”

“I’m not required to accept XML, right? Because that, in the logic of the XSD 1.0 spec, would impede my freedom to accept input from a DOM or from a sequence of SAX events. And I’m not required to accept any other specific representation of the infoset. And I’m also not required to document the formats I do accept.”

“There don’t seem to be any conditionals here: it looks at first glance as if Caspar doesn’t support any input formats.” (Enrique, characteristically sloppy, seems to have thought that Hauser was completely alingual and thus could not understand anything at all. That’s not so. But I didn’t think it was worth my time, even during a bad dream, to argue with Enrique over the name of his program.)

“Right. Caspar understands no input formats at all. I love XSD 1.0; I could write conforming XSD 1.0 processors all day; I don’t understand why some people find it difficult!”

Skippy returns (XML documents and infosets)

[5 March 2008]

My evil twin Skippy continues his reaction to the issue Henry Thompson has raised against the SML specification. He has lost the first point, in which he claimed that the XML spec doesn’t actually say that XML documents are character sequences.

Well, OK. But even if only character sequences can be XML documents strictu sensu, how does referring to information sets instead of XML documents actually help? Any system which handles XML documents in their serial form can also, if it wishes to, handle other representations of the same information without any loss of conformance.

At this point, I interrupt to remind Skippy that some people, at least, believe that if a spec requires that its implementations work upon XML documents, then by that same rule it forbids them from accepting DOM objects, or SAX event streams, and working on them. This is roughly how the XSD spec came to describe an XML syntax for schema documents, while carefully denying that its input needed to be in XML.

Skippy responds:

Suppose I claim to be a conforming implementation of the XYZ spec, which defines document the way SML does, as well-formed XML, instead of as XSD does, as “an information set as defined in [XML-Infoset]”.

Actually, I interject, this isn’t offered as a definition of document, just as a characterization of the input required for schema-validity assessment. Skippy points out, in return, that “as defined in [XML-Infoset]” is more hand-waving, since in fact the Infoset spec does not offer any definition of the term information set.

Eventually, we stop squabbling and I let him continue his argument.

I allow people to define new documents through a DOM or SAX interface, and I work with those documents just as I work with documents I acquire through a sequence of characters conforming to the grammar of the XML spec.

Now, suppose someone who believes that angle brackets are the true test of XMLness decides to challenge my claim to conformance. They will claim, I believe, that I am working with things that are not really XML documents, and thus violating the XYZ spec, which says that they MUST be XML documents. My response will be to ask why they think that. “Show us the data!” they will cry. And I will provide a sequence of characters which is indisputably an XML document.

The claim that I am not conforming is necessarily based on a claim about what I am doing ‘under the covers’, rather than at my interfaces to the outside world. But precisely because it’s a question of what I do “inside”, there is no way for an outside observer to know whether I am working directly from the DOM data structure or SAX event stream, or am making a sequence of characters internally, and working from that. One reason there is so little problem in reality about the distinction between serialized XML documents and other representations, even in arguments about conformance, is that in reality, if XML is accepted at, and produced at, the interfaces when requested, there is no way to make any argument that forbids other forms from being accepted at interfaces.

I have always thought of myself as a reasonably good language lawyer, but I am not sure how to refute Skippy’s claim. Can any readers suggest ways to prove that it matters? Or is Skippy right that SML will do better to refer to the XML spec, to describe its inputs, than to the Infoset spec?

Are XML documents character sequences?

[5 March 2008]

Henry Thompson has raised an issue against the SML spec, asking why SML defines the term document as

A well-formed XML document, as defined in [XML].

Henry suggests:

‘document’ as defined in the XML spec. is a character sequence — this seems unnecessarily restrictive — surely a reference to the Infoset spec. would be more appropriate here?

At this point, a voice which I identify as that of my evil twin Skippy sounds in my head, to wonder whether it really makes a difference or not.

I disagree with the premise that the XML spec defines “an XML document” solely as a character sequence. What I remember from the XML spec about the fundamental nature of XML documents is mostly energetic hand-waving.

Hmm. What is the truth of the matter?

I think Skippy has a point here, at least about the hand-waving. Both the first edition of the XML spec, with which I am probably most familiar, and the current (fourth) edition, say “A textual object is a well-formed XML document if” certain conditions are met. Surely the phrase “textual object” begs the question.

Both versions of the spec also hyperlink the words “XML document” to what seems to be a definition of sorts:

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.

The phrase “data object” may be ever so slightly less suspect than “textual object”, but I’m feeling quite a breeze here.

Certainly this is consistent with my recollection that I actively resisted Dan Connolly’s repeated request to nail down, once and for all, what counted as an “XML document”. Most SGML users, I think, were inclined to regard the terms document and SGML document as denoting not just sequences of characters, but also any data structures with recognizable elements, attributes, parents, siblings, and children, and so forth that one might choose to represent the same information in. And I certainly planned to do the same for the term XML document. If anyone had told me that a DOM or SAX interface does not expose an XML document but something which must be religiously distinguished from XML documents, I would have said “Bullshit. The document production defines the serialization of a document, not the document.”

Some people might say, given the level of abstraction implicit in the XML WG’s usage of the term document (or at least in mine), that we regarded the term document as denoting any instance of the data model. (Of course, others will immediately chime in to say that SGML and XML don’t have data models. But as I’ve said elsewhere, at this point I in the discussion I don’t have any idea what the second group means by the words “have a data model”, so I can’t say one way or another.) Speaking for myself, I think I would have been horrified by the suggestion that once one had read a document into memory and begun working with it, one was no longer working with an XML document.

But nevertheless, the XML spec may have nailed things down in just the way I thought I had successfully resisted doing. The conditions imposed on a data object, in order for it to be an XML document, include that it be well-formed, which in turn entails that

Taken as a whole, it matches the production labeled document.

It’s embarrassing for those who like to think of XML documents in a slightly more abstract way, but it’s not clear to me how anything but a character sequence can match the ‘document’ production.

Sorry, Skippy, but while I think Henry’s note over-simplified things, he is probably more nearly right than you.

it probably won’t change by one whit the way I use the term XML document. But it’s nice to learn some new things now and then, even if what you learn is that you failed at something you attempted, ten years ago, and that as a consequence, you’ve been wrong about a related question ever since.