Scenes from a Recommendation 4: name characters

The recent publication of XML 1.0 Fifth Edition, and the ensuing discussion of how best to define the set of name characters, has made me think about a proposal that came up during the original development of XML 1.0.

We had taken as our task the definition of a grammar production to select all and only the Unicode characters suitable for use in identifiers. For our SGML-influenced minds, suitability essentially meant being a letter or something letter-like, by analogy with the characters A through Z and a through z allowed in SGML’s reference concrete syntax, or a numeric symbol, by analogy with 0 through 9. Now, it is an empirical fact that any such grammar production will be complicated, especially if it excludes not just unsuitable Unicode characters but unassigned code points.

As we were contemplating the unappetizing lists of ranges and properties that became productions [84] through [89] of the XML spec, someone — my memory says it was Tim Bray, but I have not tried to verify this memory against the historical record, and I don’t remember whether it was in email or on a phone call — said “Well, I realize I risk being labeled a xenophobe for suggesting this, but you know, the lexer has to have character tables for recognizing names and other tokens, and this rule is going to make those character tables just huge. We all know that in the ideal, normal case, the user is not going to be eyeballing XML source directly, you’re going to want to have some sort of presentation interface. So the element names are just codes, not utterances in natural language. They might as well be E001, E002, E003, and so on, as far as the parser is concerned. So why don’t we just do the same thing as SGML’s reference concrete syntax and restrict names to ASCII characters? That would make the tables you need for the lexical scanner a lot smaller. Any UCS character can go in the data, and the data is all the human reader needs to care about anyway. It’s only in the element and attribute names that we make the restriction, and the proposal is not really Anglo-centric since element and attribute names don’t have to be natural-language words; they shouldn’t be part of the user interface in the first place.”

I remember having a sinking heart at this point. The analysis is so plausible, and yet wrong in so many ways. One of the design goals of XML is that it should be legible to humans, and having identifiers that take the form of words in a natural language is an important tool in making markup legible. E001, E002, and so on just don’t cut it. (A librarian did suggest once in all seriousness that SGML vocabularies should have numeric identifiers, like the numeric field identifiers in MARC records. 245 means main title, everyone knows that, but unlike “main title” it doesn’t have an Anglophone bias. But as a design principle for actual vocabularies, this seemed to me too much like ensuring that all languages are treated equally by ensuring that the vocabulary is hard to understand for everyone.)

Also, while I have seen some very nice interfaces to marked up text, I’ve spent most of my writing life over the last twenty years working with SGML and XML source, mostly in emacs, and no interface that hides the markup has ever made me want to change. (For a while I did use the hide-tags mode in XMetal for rereading and copy editing, but I could never draft in it, and when I upgraded my operating system and XMetal ceased to run, I didn’t actually spend a lot of time looking for a new copy, or a new GUI.) I don’t want to hide the tags. The tags are what make things work. There is a story that a worker came into Bertolt Brecht’s apartment in Berlin once to install some curtains, and after doing so he started to put up a valance to hide the curtain rod and the hooks and strings. Brecht came into the room and made him stop and take it back down, with (as I remember it — I have not spent any time looking for the source of this story to make sure I’m getting the details right) the words “Never hide the mechanism!” When it comes to markup, at least, I’m with Brecht.

So for me, and anyone like me, the claim “no human really ever looks at this stuff” is false (and also conveys indirectly the suggestion that we counter-examples aren’t really human and thus don’t need to be accounted for). Even if everyone in the world used tag-hiding editors, the programmers who create those editors and the stylesheet authors who specify the mapping from tags to display form will spend time looking at the raw markup. And last I looked, they were all humans, too.

But none of those arguments were going to go anywhere against that brutally simple argument about the size and complexity of the scanner tables, and the seductive suggestion that it really would be inhumane to require users to use tag-visible interfaces. So for a while I was really worried that the proposal for ASCII-only identifiers was going to carry the day.

I’m not sure whether my memory of what happened next is what happened in reality, or whether I’ve just persuaded myself to accept as real a fantasy of what could have happened, if only someone had had the wit to think of it, and the nerve to say it, at the time.

The way I remember it is this. Someone (I don’t remember who) responds to the suggestion carefully, thoughtfully, and without screaming. They agree that it’s important to keep the lexical scanner simple, and that identifiers really don’t need to be used to represent arbitrary natural-language words or phrases. And they conclude by saying “So I think I’m willing to support this proposal, if we can improve it a little bit. Right now, the natural way to represent the scanner table for ASCII-only identifiers is to use a table 128 octets wide. But that’s a lot bigger than we need. So I propose that we don’t allow all the characters A through Z in both upper and lower case. If we restrict ourselves to A through O, uppercase only, then we can get by with a much smaller table. So I’ll support the proposal if we restrict identifiers not to any ASCII letter, but to A through O uppercase only, i.e. U+0041 through U+004F. And since no one wants to use natural-language words for identifiers, that’s not a problem, right?”

And then, silence. You can just about hear the proponents of ASCII-only identifiers starting to say “but then you couldn’t use ‘q” or ‘quote” or even ‘html” or ‘body” as element names,” before biting their tongue as they realize that they can’t say that without giving away the store.

Whether it happened the way I remember it or in some other less dramatic and memorable way, the end is the same: after people thought about it for a while, the proposal for ASCII-only identifiers evaporated, like dew in bright sunlight.

A little formalism (variable names)

Many readers associate the use of variables with mathematics and feel threatened by paragraphs that begin “Let E be … and F be …. Then …” And similarly with technical terms: when a text defines and uses a lot of technical terms, it can be very daunting to the first-time reader (and many others).

So it’s understandable that sometimes, in trying to keep a text accessible to the reader, one works hard to avoid having to introduce variables to refer to things, and to avoid relying on technical terms with special meanings.

But sometimes such efforts backfire. In the XSD (XML Schema Definition Language) 1.0 spec, you end up with rules that read like this:

Validation Rule: Element Locally Valid (Type)

For an element information item to be locally ·valid· with respect to a type definition all of the following must be true:

1 The type definition must not be ·absent·;
2 It must not have {abstract} with value true.
3 The appropriate case among the following must be true:

3.1 If the type definition is a simple type definition, then all of the following must be true:

3.1.1 The element information item’s [attributes] must be empty, excepting those whose [namespace name] is identical to http://www.w3.org/2001/XMLSchema-instance and whose [local name] is one of type, nil, schemaLocation or noNamespaceSchemaLocation.
3.1.2 The element information item must have no element information item [children].
3.1.3 If clause 3.2 of Element Locally Valid (Element) (§3.3.4) did not apply, then the ·normalized value· must be ·valid· with respect to the type definition as defined by String Valid (§3.14.4).
3.2 If the type definition is a complex type definition, then the element information item must be ·valid· with respect to the type definition as per Element Locally Valid (Complex Type) (§3.4.4);

I would say “Maybe it’s just me, but I find that kind of hard to read,” but that would be disingenuous. There is ample evidence from the last eight or nine years that I am not the only reader of the XSD 1.0 spec who finds parts of it hard to read. This is a relatively mild example, as the XSD spec goes. But if we can overcome our fear of formality, the text can become a bit simpler. Two changes in particular seem useful here.

  • Introduce the names E for the element and T for the type, and use them.
  • Follow the example of most specs that define and use namespaces: specify and use a conventional prefix to represent a given namespace, and say once and for all, when that prefix is identified, that in practice the user can use any prefix they wish (or none). Then just use the QNames, rather than writing out the namespace in full each time you have to talk about names in that namespace.

Applying these rules to the fragment just given, we get something a bit easier to read.

Validation Rule: Element Locally Valid (Type)

For an element information item E to be locally ·valid· with respect to a type definition T all of the following must be true:

1 T is not ·absent·;
2 T does not have {abstract} with value true.
3 The appropriate case among the following is true:

3.1 If T is a simple type definition, then all of the following are true:

3.1.1 E‘s [attributes] are empty, excepting those named xsi:type, xsi:nil, xsi:schemaLocation, or xsi:noNamespaceSchemaLocation.
3.1.2 E has no element information item [children].
3.1.3 If clause 3.2 of Element Locally Valid (Element) (§3.3.4.3) did not apply, then the ·normalized value· is ·valid· with respect to T as defined by String Valid (§3.16.4).
3.2 If T is a complex type definition, then E is ·valid· with respect to T as per Element Locally Valid (Complex Type) (§3.4.4.2);
4 If E has an xsi:type [attribute] and does not have a ·governing element declaration·, then the ·actual value· of xsi:type ·resolves· to T.

I won’t claim that the text has become easy to read and follow, but I think there is one salient difference: in the first text above, my first difficulty as a reader is understanding what the text is trying to say, and once I have figured that out, I may or may not have energy left to try to understand why it’s saying that. In the second text, it’s easier (I think) to understand what the individual clauses are saying. The reader still has the task of understanding why, but at least the difficulties of comprehension are now those related to the intrinsic difficulty of the topic, without the additional barrier of complex syntax.

Another tactic adopted by some in trying to make difficult material easier to read is to avoid defining technical terms. The XSD 1.0 spec raises this to a fine art; often, the easiest way to understand how a given rule came to be formulated as it is, is to imagine that it was first written in a simple, straightforward clause using technical terms, and then the technical terms were eliminated and their definitions inserted inline. And then the process was repeated once, or twice, or more. The result is mostly devoid of difficult or obscure technical usages, but it’s often also a sentence only an eighth-grade English teacher teaching the unit on sentence diagramming could love.

If we re-introduce appropriate technical terms, this process can be reversed. Sometimes the introduction of even a single technical term can do a surprising amount of good.

Take the following example from the XSD spec:

2.3.1 The element declaration is local (i.e. its {scope} must not be global), its {abstract} is false, the element information item’s [namespace name] is identical to the element declaration’s {target namespace} (where an ·absent· {target namespace} is taken to be identical to a [namespace name] with no value) and the element information item’s [local name] matches the element declaration’s {name}.

In this case the element declaration is the ·context-determined declaration· for the element information item with respect to Schema-Validity Assessment (Element) (§3.3.4) and Assessment Outcome (Element) (§3.3.5).

This is followed by another clause with almost identical wording, covering global elements.

If we make use of the term expanded names, defined by the Namespaces in XML recommendation, and refer to the expanded names of the declaration and element instead of inlining the definition of expanded name by referring to namespace name + local name pairs — this entails defining the term expanded name as it applies to schema components — and supply the obvious variable names for element and declaration, then it’s easier to see that this rule for local element declarations can be merged with the following rule for global element declarations, since the two do exactly the same thing. So we can replace both the rule above and the the rule that follows it in the spec with:

If I’m smiling this evening, it’s because this morning the XML Schema working group agreed to these changes, and scores of other similar changes, to the text of the XSD 1.1 spec. The design of the language, I admit, is still very complex. The exposition, I concede, still has a sub-optimal structure. But the third source of difficulty, namely the complexity of individual sentences in the validation rules and contraints on schema components, is somewhat diminished by this change.

Variable names as a short-hand for complex noun phrases; technical terms to capture frequently needed concepts; conventions to allow things to be said simply instead of in convoluted clauses: it’s almost enough to make you think that mathematical writing is the way it is, in order to make things easier to read, instead of harder to read. Food for thought.

Choosing schema-validation roots

[18 February 2008, Prague]

A colleague asks:

naive XML schema question — How does a validating parser know which xs:element is supposed to be the root/document element? I don’t see anything in the schema that tells it.

I’m not getting any love from google or the schema Recs. (I’ve looked at every use of the word “root” in the Recs, with no clues.)

I hate it when smart people who are willing to put in some work to understand things can’t find the answer to their questions in the schema spec. So first of all, I’m sorry. I apologize on behalf of the spec on which I’ve now spent a large proportion of my working life. (I wish I thought I could do something about it, but the XML Schema WG has been appallingly reluctant to fixing the incomprehensibility problems of the spec. I think the 1.1 spec is marginally better than 1.0 in some ways, but only marginally and only in some ways. If you hated the 1.0 spec, you may find you hate 1.1 ever so slightly less, but it’s unlikely to charm you into liking it.)

But this question does come up a lot. And if the WG won’t explain it clearly in the spec, then at least I can try to explain it clearly here.

The choice of validation root is not specified by XSD. Formally it’s regarded as out of scope; in practice, the expectation is that processors will either provide a useful method of choosing where to start validation and users will specify the validation root at invocation time, or that processors will provide a useful default choice (e.g. the document root), or that in some cases processors will provide a fixed choice (e.g. the document root). In the latter case the user can be said to have chosen to start validation at that fixed point by choosing to use that particular validator. That may sound Orwellian, but in principle, at least, the rule is that if you don’t like the level of control given you by a given tool, then why are you using that tool? File a bug report, or an enhancement request, or get another tool. Or both.

The closest the XSD spec comes to talking about this is in section 5.2 (“Assessing Schema-validity”). Personally, I find the discussion in XSD 1.1 marginally clearer than the discussion in 1.0, but I may be exhibiting my bias in that.

My colleague continues:

Preliminary experiments suggest that at least in a normal schema, you can, in fact, just give a fragment of a document and have the document be considered schema valid. So “<br/>” is a schema-valid HTML document? Very odd.

Well, no and yes. “<br/>” is schema-valid against the HTML schema, if schema-validity assessment starts with that element and any of (a) the corresponding element declaration, (b) the relevant type definition, or (c) the instruction to start in lax or strict wildcard mode and look for an applicable definition. And if that element happens to be the document root, then yes, it’s a document valid against the XHTML schema.

Since the default setting for many XSD validators is to start at the document root in lax-wildcard mode, they accept your sample document as valid.

An analogous result could be achieved using a DTD, by writing

<!DOCTYPE br PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<br/>

I think that those who run an XML validator over that document will find that it is valid against the DTD.

The document type definition at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd has no formal specification that any particular element must be the root element; the constraint on the generic identifier of the root element is specified as part of the document type declaration, the “<!DOCTYPE” part. Analogously, the XSD schema for XHTML doesn’t have any formal specification of any required root element, or required starting declaration; both get specified at validation time. Both when using DTD s and when using XSD, this allows you to validate one part of a document at a time. If you’re editing a large document and are storing different parts of it in different files, it’s convenient to be able to validate each part independently.

Another analogy is with the formal definition of a grammar: the set of productions that most of us think of a grammar does not specify the start symbol. The start symbol is specified in a different part of the tuple that is, for formal-language purposes, the grammar. To describe schemas in these terms: the schema, or the collection of element and other declarations in a DTD file, does not define a full document grammar, but a set of productions for a document grammar. The start symbol is specified separately, in a doctype declaration for DTDs, and at validator invocation time for XSD schemas.

The rules for the HTML vocabulary specify that a conforming HTML document should start with an ‘html’ element, so if you want to check conformance to the HTML spec (as opposed to schema-validity against the XHTML schema, which is not quite the same thing) you don’t get so much choice of how to invoke the validator: you should start with the declaration for the ‘html’ element and with the document’s root element.

If the validator you’re using doesn’t allow you to specify (a) where to start, and (b) what to start with, then you really should file a bug report or a request for enhancement. And whether you do that or not, you really should understand that some of the consequences of the implementation’s default choices are properties of how you are performing validity assessment, not properties of XSD validation in itself.

Some people dislike having to say explicitly that use of a particular vocabulary must start with a particular element, so they take pains to make only that one element top-level; all other elements are defined locally to complex types. This is an effective way of preventing abuse, but it also pretty effectively prevents re-use, and it makes the schema harder to maintain, work with, or reason about. I can’t see such a schema without thinking someone has just cut off their nose in order to spite their face.

Scenes from a Recommendation 3: Boston, Prudential Tower

Another memory from the development of XML.

It’s November 1996, at the GCA SGML ’96 conference, at the Sheraton in Boston. The SGML on the Web Working Group and ERB have just been through an exhausting and exhilarating few weeks, when from a standing start we prepared the first public working draft of XML. At this conference, we have been given a slot for late-breaking news and will give the first public presentation of our work.

Lou Burnard, of Oxford University Computing Services, the founder of the Oxford Text Archive, is there to give an opening plenary talk about the British National Corpus, a 100-million-word representative corpus of British English, tagged in SGML. Lou and I are old friends; since 1988 we have worked together as editors of the Guidelines of the Text Encoding Initiative. Working together to shepherd a couple dozen working groups and task forces full of recalcitrant academics and other-worldly text theorists (“but why should a stanza have to contain lines of verse? I can perfectly well imagine a stanza containing no lines at all”) from requirements to draft proposals, to turn their wildly inconsistent and incomplete results into something resembling a coherent set of rules for encoding textual material likely to be useful to scholarship, and to produce in the end 1500 pages of mostly coherent prose documentation for the TEI encoding scheme, Lou and I have been effectively joined at the hip for years. We have consumed large quantities of good Scotch whisky together, and some quantities of beer and not so good whisky. We have told each other our life stories, in a state of sufficient inebriation that neither of us remembers any details beyond our shared admiration for Frank Zappa. We have sympathized with each other in our struggles with our respective managements; we have supported each other in working group and steering committee meetings; we have pissed each other off repeatedly, and learned, with a little help from our friends (thank you, Elaine), to patch things up and keep going. No one but my wife knew me better than Lou; no one but my wife could push my buttons and enrage me more effectively. (And she didn’t push those buttons nearly as often as Lou did.)

Tim Bray is also there, naturally. He and I have not worked together nearly as long as Lou and I have, but the compressed schedule and the intensity of the XML work have made it a similarly close relationship. We spend time on the phone arguing about the best way to define this feature or that, or counting noses to see which way a forthcoming decision is likely to come out (we liked to try to draft wording in advance of the decisions, when possible). We commiserate when Charles Goldfarb calls and spends a couple hours trying to wear us down on the technical issue of the day. (Fortunately, Charles called Tim and Jon Bosak more often than me. Either he decided he couldn’t wear me down, or he concluded I was a lightweight not worth worrying about. I’m not complaining.) Like Lou, Tim often reads a passage I have drafted and says “This is way too complicated, let’s just leave this, and this, and this, and that, out. See? Now it’s a lot simpler.”

At one point I believed it was generally a good idea for an editorial team to have a minimalist and a maximalist yoked together: the maximalist gets you the functionality you need, and the minimalist keeps it from being much more than you need. Maybe it is a good idea in general. Or maybe it was just that it worked well both in the TEI and in XML. At the very least, it’s suggestive that in the work on the XML Schema spec, I was the resident minimalist; if in any working group I am the minimalist, it’s a good bet that the product of that WG will be regarded as baroque by most readers.

It’s the evening before the conference proper, and there is a reception for attendees in a lounge at the top of the Prudential Tower. I am standing chatting with Tim Bray and Lauren Wood, and suddenly Lou comes striding urgently across the room towards us. He reaches us. He looks at me; he looks at Tim; and he says, in pitch-perfect tones of the injured spouse, “So this is the other editor you’ve been seeing behind my back!”

Scenes from a Recommendation 2: subtle and devious

Tim Bray’s prose sketch of Jon Bosak is good, and vivid, but it doesn’t mention what I think is one of Jon’s outstanding traits. In a quiet, utterly unassuming way, Jon is one of the most persuasive and politically astute people I have ever met. He will not thank me for pointing this out: I think he thinks that if people know, they’ll be on their guard. He doesn’t do a hard sell (at least not to me); he takes the trouble to understand where his interlocutor is coming from and to find common ground with them. And he has patience; he is not dissuaded from a goal by the idea that it might take a while, or that it must be approached indirectly. And he is very reticent about taking credit.

We wrote Jon’s name into the XML spec, in the passage

XML was developed by an XML Working Group … formed under the auspices of the World Wide Web Consortium (W3C) in 1996. It was chaired by Jon Bosak of Sun Microsystems with the active participation of an XML Special Interest Group … also organized by the W3C.

because we wanted to get Jon’s contribution on record and force him to accept credit. Without Tim Bray, or without me, or any of the other members of the editorial review board and working group, the spec would have been different. Without Jon it would not have come to pass.

In memory of a particularly difficult political task undertaken and successfully negotiated, some of Jon’s friends once gave him a gift that I have always thought apposite: a dark gray bomber jacket, embroidered in dark gray (so the embroidery was virtually invisible) with the words “Subtle and devious”.

Hail, xml:Father!