Cheney

[22 March 2008]

My evil twin Enrique [formerly known as Skippy] dropped by again the other day.

“I’ve written the world’s smallest conforming XSD 1.0 implementation, #x201D; he announced.

“Really?”

“And it’s open source. Wanna see it? I call it Cheney. (Think of it as a kind of hommage.)”

I began to have a bad feeling about this. “Cheney?”

“Yeah. He da man, when it comes to fast schema validation.”

“Tell me about it.”

“Well, you know how proud the XML Schema working group is that the spec doesn’t require that XSD processors accept schema documents or input that’s actually in XML?”

“Er, some of the working group, yes.”

“And you know how carefully 1.0 was drafted so as to ensure that processors can be written for very specialized environments? For example, highly security-conscious ones. So there’s no required output format, no required API — nothing is required.”

“Ah, yes, right.”

“And you will recall, of course, how strongly some in the working group have resisted any suggestion in the spec that implementations be required to document anything.”

“I remember.”

“Because that would be telling people how to write their documentation.” He was smiling.

I refused to take that bait.

“And the working group is so proud of achieving perfect interoperability in XSD 1.0.”

As it happens, I prefer not to talk about that, especially not with Enrique when he is smiling like a cat working with a mouse. So I contented myself with asking “So tell me about Cheney. You mentioned security a moment ago. Is that the connection to the Vice President?”

“Yes, Cheney is intended for deployment in high-security situations. It’s got rock-solid security, and it’s fully conforming to XSD 1.0. Oh, and it’s highly optimized, so it’s blazingly fast.”

“How does it work?”

He handed me a piece of paper on which was written:

#!/bin/sh
echo "You are not authorized for that information."

“That’s it?” I asked.

“Yes. Sweet, isn’t it? And fast. I figure I call sell this for a bundle, to people who need a support contract.”

“Uh, you think this is conforming?”

“Sure is,” he said.

“But it doesn’t seem to do anything.”

“Well, I’m not required to expose all of the PSVI, or any particular part of it. I’m not required to expose any of the PSVI. So I don’t.”

“But —”

“And if it can be proven that some particular part of the PSVI (a) isn’t exposed by a given implementation, and (b) doesn’t affect the parts of the PSVI that are exposed by that implementation, then I can optimize by not actually performing the calculation that would otherwise be necessary. Since I don’t expose anything, I get some pretty good optimization.”

“So I see. But isn’t — well, but having a schema validator that doesn’t actually do anything — isn’t that kind of useless?”

“So?”

Long pause.

“My customers show up with a checklist saying they need a conforming processor, and I give it to them. They have been trained to think that conformance or non-conformance is a useful way to describe processors. Most of them think it’s the most important way. And you can understand why. Lots of specs go to great lengths to try to ensure that conforming processors are useful. For those specs, asking first of all about conformance is a very useful way to find useful software. Is it my fault that the XSD 1.0 spec doesn’t bother to make sure that conformance is a useful idea? Eventually, customers catch on; I don’t get much repeat business. But they don’t usually catch on until after their checks have cleared.”

“Isn’t that cheating?”

“Cry me a river, liberal. If you had wanted conforming processors to be useful, and to give the same results as other conforming processors, then you would have done a better job defining conformance, wouldn’t you?”

He paused and then declaimed in the tones of a television advertisement: “Cheney. What you don’t know, we won’t tell you. Interoperate with that.”

Skippy returns (XML documents and infosets)

[5 March 2008]

My evil twin Skippy continues his reaction to the issue Henry Thompson has raised against the SML specification. He has lost the first point, in which he claimed that the XML spec doesn’t actually say that XML documents are character sequences.

Well, OK. But even if only character sequences can be XML documents strictu sensu, how does referring to information sets instead of XML documents actually help? Any system which handles XML documents in their serial form can also, if it wishes to, handle other representations of the same information without any loss of conformance.

At this point, I interrupt to remind Skippy that some people, at least, believe that if a spec requires that its implementations work upon XML documents, then by that same rule it forbids them from accepting DOM objects, or SAX event streams, and working on them. This is roughly how the XSD spec came to describe an XML syntax for schema documents, while carefully denying that its input needed to be in XML.

Skippy responds:

Suppose I claim to be a conforming implementation of the XYZ spec, which defines document the way SML does, as well-formed XML, instead of as XSD does, as “an information set as defined in [XML-Infoset]”.

Actually, I interject, this isn’t offered as a definition of document, just as a characterization of the input required for schema-validity assessment. Skippy points out, in return, that “as defined in [XML-Infoset]” is more hand-waving, since in fact the Infoset spec does not offer any definition of the term information set.

Eventually, we stop squabbling and I let him continue his argument.

I allow people to define new documents through a DOM or SAX interface, and I work with those documents just as I work with documents I acquire through a sequence of characters conforming to the grammar of the XML spec.

Now, suppose someone who believes that angle brackets are the true test of XMLness decides to challenge my claim to conformance. They will claim, I believe, that I am working with things that are not really XML documents, and thus violating the XYZ spec, which says that they MUST be XML documents. My response will be to ask why they think that. “Show us the data!” they will cry. And I will provide a sequence of characters which is indisputably an XML document.

The claim that I am not conforming is necessarily based on a claim about what I am doing ‘under the covers’, rather than at my interfaces to the outside world. But precisely because it’s a question of what I do “inside”, there is no way for an outside observer to know whether I am working directly from the DOM data structure or SAX event stream, or am making a sequence of characters internally, and working from that. One reason there is so little problem in reality about the distinction between serialized XML documents and other representations, even in arguments about conformance, is that in reality, if XML is accepted at, and produced at, the interfaces when requested, there is no way to make any argument that forbids other forms from being accepted at interfaces.

I have always thought of myself as a reasonably good language lawyer, but I am not sure how to refute Skippy’s claim. Can any readers suggest ways to prove that it matters? Or is Skippy right that SML will do better to refer to the XML spec, to describe its inputs, than to the Infoset spec?

Are XML documents character sequences?

[5 March 2008]

Henry Thompson has raised an issue against the SML spec, asking why SML defines the term document as

A well-formed XML document, as defined in [XML].

Henry suggests:

‘document’ as defined in the XML spec. is a character sequence — this seems unnecessarily restrictive — surely a reference to the Infoset spec. would be more appropriate here?

At this point, a voice which I identify as that of my evil twin Skippy sounds in my head, to wonder whether it really makes a difference or not.

I disagree with the premise that the XML spec defines “an XML document” solely as a character sequence. What I remember from the XML spec about the fundamental nature of XML documents is mostly energetic hand-waving.

Hmm. What is the truth of the matter?

I think Skippy has a point here, at least about the hand-waving. Both the first edition of the XML spec, with which I am probably most familiar, and the current (fourth) edition, say “A textual object is a well-formed XML document if” certain conditions are met. Surely the phrase “textual object” begs the question.

Both versions of the spec also hyperlink the words “XML document” to what seems to be a definition of sorts:

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.

The phrase “data object” may be ever so slightly less suspect than “textual object”, but I’m feeling quite a breeze here.

Certainly this is consistent with my recollection that I actively resisted Dan Connolly’s repeated request to nail down, once and for all, what counted as an “XML document”. Most SGML users, I think, were inclined to regard the terms document and SGML document as denoting not just sequences of characters, but also any data structures with recognizable elements, attributes, parents, siblings, and children, and so forth that one might choose to represent the same information in. And I certainly planned to do the same for the term XML document. If anyone had told me that a DOM or SAX interface does not expose an XML document but something which must be religiously distinguished from XML documents, I would have said “Bullshit. The document production defines the serialization of a document, not the document.”

Some people might say, given the level of abstraction implicit in the XML WG’s usage of the term document (or at least in mine), that we regarded the term document as denoting any instance of the data model. (Of course, others will immediately chime in to say that SGML and XML don’t have data models. But as I’ve said elsewhere, at this point I in the discussion I don’t have any idea what the second group means by the words “have a data model”, so I can’t say one way or another.) Speaking for myself, I think I would have been horrified by the suggestion that once one had read a document into memory and begun working with it, one was no longer working with an XML document.

But nevertheless, the XML spec may have nailed things down in just the way I thought I had successfully resisted doing. The conditions imposed on a data object, in order for it to be an XML document, include that it be well-formed, which in turn entails that

Taken as a whole, it matches the production labeled document.

It’s embarrassing for those who like to think of XML documents in a slightly more abstract way, but it’s not clear to me how anything but a character sequence can match the ‘document’ production.

Sorry, Skippy, but while I think Henry’s note over-simplified things, he is probably more nearly right than you.

it probably won’t change by one whit the way I use the term XML document. But it’s nice to learn some new things now and then, even if what you learn is that you failed at something you attempted, ten years ago, and that as a consequence, you’ve been wrong about a related question ever since.

Scenes from a Recommendation 4: name characters

The recent publication of XML 1.0 Fifth Edition, and the ensuing discussion of how best to define the set of name characters, has made me think about a proposal that came up during the original development of XML 1.0.

We had taken as our task the definition of a grammar production to select all and only the Unicode characters suitable for use in identifiers. For our SGML-influenced minds, suitability essentially meant being a letter or something letter-like, by analogy with the characters A through Z and a through z allowed in SGML’s reference concrete syntax, or a numeric symbol, by analogy with 0 through 9. Now, it is an empirical fact that any such grammar production will be complicated, especially if it excludes not just unsuitable Unicode characters but unassigned code points.

As we were contemplating the unappetizing lists of ranges and properties that became productions [84] through [89] of the XML spec, someone — my memory says it was Tim Bray, but I have not tried to verify this memory against the historical record, and I don’t remember whether it was in email or on a phone call — said “Well, I realize I risk being labeled a xenophobe for suggesting this, but you know, the lexer has to have character tables for recognizing names and other tokens, and this rule is going to make those character tables just huge. We all know that in the ideal, normal case, the user is not going to be eyeballing XML source directly, you’re going to want to have some sort of presentation interface. So the element names are just codes, not utterances in natural language. They might as well be E001, E002, E003, and so on, as far as the parser is concerned. So why don’t we just do the same thing as SGML’s reference concrete syntax and restrict names to ASCII characters? That would make the tables you need for the lexical scanner a lot smaller. Any UCS character can go in the data, and the data is all the human reader needs to care about anyway. It’s only in the element and attribute names that we make the restriction, and the proposal is not really Anglo-centric since element and attribute names don’t have to be natural-language words; they shouldn’t be part of the user interface in the first place.”

I remember having a sinking heart at this point. The analysis is so plausible, and yet wrong in so many ways. One of the design goals of XML is that it should be legible to humans, and having identifiers that take the form of words in a natural language is an important tool in making markup legible. E001, E002, and so on just don’t cut it. (A librarian did suggest once in all seriousness that SGML vocabularies should have numeric identifiers, like the numeric field identifiers in MARC records. 245 means main title, everyone knows that, but unlike “main title” it doesn’t have an Anglophone bias. But as a design principle for actual vocabularies, this seemed to me too much like ensuring that all languages are treated equally by ensuring that the vocabulary is hard to understand for everyone.)

Also, while I have seen some very nice interfaces to marked up text, I’ve spent most of my writing life over the last twenty years working with SGML and XML source, mostly in emacs, and no interface that hides the markup has ever made me want to change. (For a while I did use the hide-tags mode in XMetal for rereading and copy editing, but I could never draft in it, and when I upgraded my operating system and XMetal ceased to run, I didn’t actually spend a lot of time looking for a new copy, or a new GUI.) I don’t want to hide the tags. The tags are what make things work. There is a story that a worker came into Bertolt Brecht’s apartment in Berlin once to install some curtains, and after doing so he started to put up a valance to hide the curtain rod and the hooks and strings. Brecht came into the room and made him stop and take it back down, with (as I remember it — I have not spent any time looking for the source of this story to make sure I’m getting the details right) the words “Never hide the mechanism!” When it comes to markup, at least, I’m with Brecht.

So for me, and anyone like me, the claim “no human really ever looks at this stuff” is false (and also conveys indirectly the suggestion that we counter-examples aren’t really human and thus don’t need to be accounted for). Even if everyone in the world used tag-hiding editors, the programmers who create those editors and the stylesheet authors who specify the mapping from tags to display form will spend time looking at the raw markup. And last I looked, they were all humans, too.

But none of those arguments were going to go anywhere against that brutally simple argument about the size and complexity of the scanner tables, and the seductive suggestion that it really would be inhumane to require users to use tag-visible interfaces. So for a while I was really worried that the proposal for ASCII-only identifiers was going to carry the day.

I’m not sure whether my memory of what happened next is what happened in reality, or whether I’ve just persuaded myself to accept as real a fantasy of what could have happened, if only someone had had the wit to think of it, and the nerve to say it, at the time.

The way I remember it is this. Someone (I don’t remember who) responds to the suggestion carefully, thoughtfully, and without screaming. They agree that it’s important to keep the lexical scanner simple, and that identifiers really don’t need to be used to represent arbitrary natural-language words or phrases. And they conclude by saying “So I think I’m willing to support this proposal, if we can improve it a little bit. Right now, the natural way to represent the scanner table for ASCII-only identifiers is to use a table 128 octets wide. But that’s a lot bigger than we need. So I propose that we don’t allow all the characters A through Z in both upper and lower case. If we restrict ourselves to A through O, uppercase only, then we can get by with a much smaller table. So I’ll support the proposal if we restrict identifiers not to any ASCII letter, but to A through O uppercase only, i.e. U+0041 through U+004F. And since no one wants to use natural-language words for identifiers, that’s not a problem, right?”

And then, silence. You can just about hear the proponents of ASCII-only identifiers starting to say “but then you couldn’t use ‘q” or ‘quote” or even ‘html” or ‘body” as element names,” before biting their tongue as they realize that they can’t say that without giving away the store.

Whether it happened the way I remember it or in some other less dramatic and memorable way, the end is the same: after people thought about it for a while, the proposal for ASCII-only identifiers evaporated, like dew in bright sunlight.

A little formalism (variable names)

Many readers associate the use of variables with mathematics and feel threatened by paragraphs that begin “Let E be … and F be …. Then …” And similarly with technical terms: when a text defines and uses a lot of technical terms, it can be very daunting to the first-time reader (and many others).

So it’s understandable that sometimes, in trying to keep a text accessible to the reader, one works hard to avoid having to introduce variables to refer to things, and to avoid relying on technical terms with special meanings.

But sometimes such efforts backfire. In the XSD (XML Schema Definition Language) 1.0 spec, you end up with rules that read like this:

Validation Rule: Element Locally Valid (Type)

For an element information item to be locally ·valid· with respect to a type definition all of the following must be true:

1 The type definition must not be ·absent·;
2 It must not have {abstract} with value true.
3 The appropriate case among the following must be true:

3.1 If the type definition is a simple type definition, then all of the following must be true:

3.1.1 The element information item’s [attributes] must be empty, excepting those whose [namespace name] is identical to http://www.w3.org/2001/XMLSchema-instance and whose [local name] is one of type, nil, schemaLocation or noNamespaceSchemaLocation.
3.1.2 The element information item must have no element information item [children].
3.1.3 If clause 3.2 of Element Locally Valid (Element) (§3.3.4) did not apply, then the ·normalized value· must be ·valid· with respect to the type definition as defined by String Valid (§3.14.4).
3.2 If the type definition is a complex type definition, then the element information item must be ·valid· with respect to the type definition as per Element Locally Valid (Complex Type) (§3.4.4);

I would say “Maybe it’s just me, but I find that kind of hard to read,” but that would be disingenuous. There is ample evidence from the last eight or nine years that I am not the only reader of the XSD 1.0 spec who finds parts of it hard to read. This is a relatively mild example, as the XSD spec goes. But if we can overcome our fear of formality, the text can become a bit simpler. Two changes in particular seem useful here.

  • Introduce the names E for the element and T for the type, and use them.
  • Follow the example of most specs that define and use namespaces: specify and use a conventional prefix to represent a given namespace, and say once and for all, when that prefix is identified, that in practice the user can use any prefix they wish (or none). Then just use the QNames, rather than writing out the namespace in full each time you have to talk about names in that namespace.

Applying these rules to the fragment just given, we get something a bit easier to read.

Validation Rule: Element Locally Valid (Type)

For an element information item E to be locally ·valid· with respect to a type definition T all of the following must be true:

1 T is not ·absent·;
2 T does not have {abstract} with value true.
3 The appropriate case among the following is true:

3.1 If T is a simple type definition, then all of the following are true:

3.1.1 E‘s [attributes] are empty, excepting those named xsi:type, xsi:nil, xsi:schemaLocation, or xsi:noNamespaceSchemaLocation.
3.1.2 E has no element information item [children].
3.1.3 If clause 3.2 of Element Locally Valid (Element) (§3.3.4.3) did not apply, then the ·normalized value· is ·valid· with respect to T as defined by String Valid (§3.16.4).
3.2 If T is a complex type definition, then E is ·valid· with respect to T as per Element Locally Valid (Complex Type) (§3.4.4.2);
4 If E has an xsi:type [attribute] and does not have a ·governing element declaration·, then the ·actual value· of xsi:type ·resolves· to T.

I won’t claim that the text has become easy to read and follow, but I think there is one salient difference: in the first text above, my first difficulty as a reader is understanding what the text is trying to say, and once I have figured that out, I may or may not have energy left to try to understand why it’s saying that. In the second text, it’s easier (I think) to understand what the individual clauses are saying. The reader still has the task of understanding why, but at least the difficulties of comprehension are now those related to the intrinsic difficulty of the topic, without the additional barrier of complex syntax.

Another tactic adopted by some in trying to make difficult material easier to read is to avoid defining technical terms. The XSD 1.0 spec raises this to a fine art; often, the easiest way to understand how a given rule came to be formulated as it is, is to imagine that it was first written in a simple, straightforward clause using technical terms, and then the technical terms were eliminated and their definitions inserted inline. And then the process was repeated once, or twice, or more. The result is mostly devoid of difficult or obscure technical usages, but it’s often also a sentence only an eighth-grade English teacher teaching the unit on sentence diagramming could love.

If we re-introduce appropriate technical terms, this process can be reversed. Sometimes the introduction of even a single technical term can do a surprising amount of good.

Take the following example from the XSD spec:

2.3.1 The element declaration is local (i.e. its {scope} must not be global), its {abstract} is false, the element information item’s [namespace name] is identical to the element declaration’s {target namespace} (where an ·absent· {target namespace} is taken to be identical to a [namespace name] with no value) and the element information item’s [local name] matches the element declaration’s {name}.

In this case the element declaration is the ·context-determined declaration· for the element information item with respect to Schema-Validity Assessment (Element) (§3.3.4) and Assessment Outcome (Element) (§3.3.5).

This is followed by another clause with almost identical wording, covering global elements.

If we make use of the term expanded names, defined by the Namespaces in XML recommendation, and refer to the expanded names of the declaration and element instead of inlining the definition of expanded name by referring to namespace name + local name pairs — this entails defining the term expanded name as it applies to schema components — and supply the obvious variable names for element and declaration, then it’s easier to see that this rule for local element declarations can be merged with the following rule for global element declarations, since the two do exactly the same thing. So we can replace both the rule above and the the rule that follows it in the spec with:

If I’m smiling this evening, it’s because this morning the XML Schema working group agreed to these changes, and scores of other similar changes, to the text of the XSD 1.1 spec. The design of the language, I admit, is still very complex. The exposition, I concede, still has a sub-optimal structure. But the third source of difficulty, namely the complexity of individual sentences in the validation rules and contraints on schema components, is somewhat diminished by this change.

Variable names as a short-hand for complex noun phrases; technical terms to capture frequently needed concepts; conventions to allow things to be said simply instead of in convoluted clauses: it’s almost enough to make you think that mathematical writing is the way it is, in order to make things easier to read, instead of harder to read. Food for thought.