XSD 1.1 is a Candidate Recommendation

Posted on 4 May 2009 by cmsmcq

[4 May 2009; some typos corrected and phrases tweaked 5, 6, and 7 May]

The World Wide Web Consortium has published XSD 1.1 Part 1: Structures and Part 2: Datatypes as Candidate Recommendations, and issued a call for implementation.

As the version number is intended to suggest, XSD 1.1 is mostly very similar to XSD 1.0 and restricts itself to relatively modest changes to the spec.

[At this point, Enrique snorted loudly enough to break my concentration. “If it’s just modest changes, why did it take so long? Let’s see, when did you start? XSD 1.0 was 2001, so …”

“Well, we didn’t start on 1.1 right away,” I hurriedly interjected. “But, well, I guess you’re right. It did take a lot longer than you would have expected.”

“Why? What could possibly take that long?”

“Well, different members of the working group turned out to entertain rather different views of what counts as a modest change. So we spent a lot of the last several years arguing about the relative importance of compatibility, of fixing problems in the spec, and of making the spec more useful for users. And then, on the next issue, arguing about them again. And again. And again.”

“And again?”

“And again. You know, some people say you can be a success in committee work in several different ways: being smarter than everyone else, —”

“You mean, the James Clark approach?”

“Yeah — only that doesn’t always work for people who aren’t James Clark. Or by working harder than everyone else,”

“Paul Cotton always used to talk about how much leverage you have to influence a group if you are the one who always does the minutes. I always thought he was just trying to find a sucker.”

“Well, maybe. But I think he also meant it; it really can be an important role.”

“Then why are members of the W3C Team so strongly encouraged not to do it?”

“Long story; another time, perhaps. Or, third alternative, you can just have more endurance than everyone else.”

“The ‘Iron Butt Rule’?”

“Exactly. The XML Schema working group had several members who seemed determined to try their hand at that technique.”

“Well, there’s you, of course. That would be your only option, really, wouldn’t it? I mean, the other methods …. But you mean, others tried to play the Iron Butt card, too?”

“Hush. I was going to talk about what 1.1 has that 1.0 doesn’t have.”

“So who’s stopping you?”]

XSD 1.1 is mostly similar to 1.0, I was saying before being interrupted. But it does have a number of improvements that can make a difference.

XSD 1.1 supports XML 1.1 and XML 1.0 Fifth Edition. (That last does not distinguish it, in my view, from XSD 1.0. But some people believe that 1.0 requires old versions of its normative dependencies, because the working group did not instruct the editors to say explicitly that of course newer editions can be used. Some things should go without saying, you know?)
This constitutes a significant improvement from the point of view of internationalization.
There’s a conditional inclusion mechanism (the vc:* attributes) for allowing a schema document to provide multiple versions of a declaration and select the right one at schema construction time based on which version of XSD the processor supports, what spec- and implementation-defined datatypes are automatically available, and so on.
This mechanism should make it much easier to produce new versions of XSD without being tied in knots over questions of what back-level processors will make of schema documents which use new constructs. (If XSD 1.0 had had such a mechanism, we could probably have done a better 1.1 in half the time. But we did not learn enough, when doing 1.0, from the example of, say, XSLT 1.0.)
Elements can now be declared with a form of conditional type assignment that makes the type assigned in an instance depend on the values of its attributes; this allows a variety of co-occurrence constraints on attributes and content to be expressed.
Assertions can be associated with complex and simple types. This also makes it easier (or in some cases possible for the first time) to express certain co-occurrence constraints on attributes and content.
The assertions of XSD 1.1 are less powerful than the assertions of Schematron, in that they cannot refer to anything outside the element being validated. They will in some cases be less convenient to express. (Ask about the HTML input rule, for example.) But they preserve the context-independence of type validity and an aggressive optimizer should be able to check them in a streaming context, which is not true in general of Schematron assertions.
Attributes can be marked inherited; inherited values are written into the XDM data model instance before assertions and conditional type assignment evaluate any XPath expressions, which means that inherited attributes like xml:lang can be consulted in conditional type assignment and assertions.
I’m proud of this not only because it helps handle internationalization better, but because it aligns the principle of context-free validation better with some of the core idioms of XML.
A precisionDecimal datatype has been added, which is intended to mirror the new IEEE 754-2008 specification of floating-point decimal.
This one is controversial: some members of the XSL and XML Query working groups are vocal in saying it’s a bad idea, it will complicate their type hierarchy and type coercion rules yet again, and we shouldn’t support it.

[“Of course, some of the same members of QT also predicted that the IEEE spec would never be finished at all, and that the sky would fall, hell would freeze over, and Intel would fall into the Pacific Ocean before supporting it, didn’t they?” said Enrique. “But the spec was published, and Intel is supporting it. So …” “Hush,” I said. “They’ll hear you.” But it doesn’t matter: they don’t much care what Enrique thinks.]
The xsd:redefine construct has been deprecated.
This is a disappointment to some people, who believe that it had great promise. And they are right: it did have great promise. But the 1.0 spec is vague (to put it charitably) on some points; interoperability problems in 1.0 implementations have been reported and the working group has been unable to agree on the correct interpretation of the 1.0 spec.
A simpler mechanism for reusing an existing schema document while changing it selectively is now provided under the name xsd:override. For the situations where redefine turns out to be under- (or over-) specified, override provides relatively clear, straightforward answers.
The rules for restriction have been made much simpler and more correct. It is no longer possible to use xsi:type with the name of a member type in order to evade facet restrictions on a union.
The determinism rule (the so-called “unique particle attribution” constraint) has been relaxed. It’s now legal for wildcards to compete with element declarations; elements win.
It’s easier to specify ‘open content’ and effectively insert wildcards everywhere, without cluttering up your content models.
Wildcards can now say, in effect, “any of these, except for those.” Some people call these “negative wildcards”.
All-groups can now contain wildcards, the elements and wildcards in all-groups can now have maxOccurs greater than one, and all-groups can be extended.
To align better with XPath 2.0 and related specs, the simple type hierarchy now includes an xsd:anyAtomicType. Also, the two totally ordered subtypes of duration defined for XPath 2.0 and related specs have (with the cooperation of the XML Query and XSL working groups) been integrated into the XML Schema namespace.
A new facet has been added for requiring the timezone to be present (or absent) in datatypes derived by restriction from any of the date/time types; a dateTimeStamp datatype which requires a timezone has been added, at the suggestion of the OWL working group.
Lists and unions contructed from ID and IDREF retain the ID- and IDREFness of the ID and IDREF values. Also, you can have more than one ID on an element, which means it’s now a lot easier to support xml:id without having to whack the rest of your vocabulary.
Much of the spec has been rewritten, sentence by sentence and phrase by phrase. It was not possible to reorganize the exposition from the ground up (although I agree with those who believe the spec could use it), but while retaining the same organization we were able to make individual paragraphs and sentences easier to follow and understand. More liberal use of technical terms, variable notation, and section headings may seem like trivial changes, but empirically they appear to have a perceptible effect on the readability of the spec.
Most users, of course, don’t read the spec, even power users. But implementors do, members of the working group do, members of other working groups who need to layer their stuff on top of XSD do. And some users do. I wish we could do more to make the spec more welcoming and legible for them. But while there is a lot of room for further improvement, I think (if I say so myself) that 1.1 is somewhat easier to read than 1.0. It benefits, of course, from being the second go at formulating these things.

It has been a long, hard slog — I lied to Enrique, we actually did start on it in 2001, though we also were doing a lot of other things at the same time — and I think we would not have made it without the perseverance of the chair, David Ezell of Verifone, representing the National Association of Convenience Stores (to both of whom thanks for seconding David to the group and supporting the time he spends on XSD), and the hard work of Sandy Gao of IBM on the Structures spec and Dave Peterson of SGMLWorks! (who serves as an invited expert) on the Datatypes spec. XSD 1.1 is not a perfect spec, by any means. But it’s an improvement on 1.0, and it’s worth pushing forward for that reason. And without David, and Sandy, and Dave, it would not be happening. Anyone interested in the validation of XML owes these three a debt of gratitude.

The long hard slog is by no means over. Publication as a Candidate Recommendation means the W3C has now called for implementations. If you are a programmer looking for a challenge, I challenge you to implement XSD 1.1! If you are a user, not a provider, of XSD software, urge the supplier of your software to implement XSD 1.1, and test their implementation! The more you push on the implementations now, the stronger they will be when the time comes to demonstrate implementation experience and progress the spec to Proposed Recommendation. And the more experience we will have gained towards the goal of having a broadly supported validation language which supports the full spectrum of XML usage.

[“Wow!” said Enrique. “Did you know that perseverance is a theological term? “‘continuance in a state of grace leading to a state of glory’!” “In other words,” I said, “you looked it up because you didn’t think I knew how to spell it correctly, did you?” “Oh, hush,” he said.]

Base 64 binary, hex binary, and bit sequences

Posted on 26 April 2009 by cmsmcq

[26 April 2009]

I had an interesting exchange with a friend recently. He wrote

Presumably a binary octet is a sequence of bits. A sequence of sequences of bits is not a sequence of bits, since a bit is not a sequence of bits. (Please, let’s not violate the axiom of regularity!) Therefore, a finite-length sequence of zero or more binary octets is not a sequence of bits.

The upshot, he said, was that the description of the value space of the base64Binary and hexBinary datatypes was wrong in both XSD 1.0 and XSD 1.1. The 1.0 spec says (in section 3.2.16):

The value space of base64Binary is the set of finite-length sequences of binary octets.

XSD 1.1 says, very similarly (but with pedantic attention to details on which people have professed uncertainty):

The value space of base64Binary is the set of finite-length sequences of zero or more binary octets. The length of a value is the number of octets.

But if my friend is right, and the binary datatypes are intended to have value spaces which are sequences of bits, then those descriptions are wrong. So, my friend continues, the description of the value space really ought to be:

… the set of finite-length concatenations of sequences of zero or more binary octets.

Shouldn’t it?

This sounds plausible at first glance but the answer is no, it shouldn’t. The two binary datatypes can certainly be used, in fairly obvious ways, to encode sequences of bits, but their value space is not the set of bit sequences, not even the set of bit sequences whose length is an integer multiple of eight (and whose length for purposes of the XSD length facet would be their bit length divided by eight).

My friend’s argument suffers, I think, from two faults. Most important, he seems to assume

That an octet is a sequence of bits.
That purpose of base64Binary (and of the Base 64 Encoding on which it is based) is to encode sequences of bits.

I don’t think either of these is true in detail.

Are octets sequences of bits?

Is an octet a sequence of bits? Certainly it’s often thought of that way (e.g. in the Wikipedia article on ‘octet’). But strictly speaking I think the term ‘octet’ is best taken as denoting a group of bits, without assuming a geometry, in which (for purposes of most network transmission protocols) each bit is asociated with a power of two.

But if we number the bits with their powers as 0 .. 7, is the octet the sequence of b0 b1 b2 b3 b4 b5 b6 b7? or b7 b6 b5 b4 b3 b2 b1 b0? Or some other sequence? On architectures where the byte is the smallest addressable unit, there is no requirement that the bits be thought of as being in any particular order, although the design of big- and little-endian machines makes better intuitive sense if we assume least-significant-bit first order for little-endian, and most-significant-first for big-endian. I believe that some protocols for serial port protocols specify least-first, others greatest-first, order (with least-first most common). But I suspect that most networking protocols today (and for a long time since) assume parallel transmission of bits, in which asking about the sequence of bits within an octet is nothing but a category error.

But IANAEE. Am I wrong?

Does base 64 encoding encode bits?

RFC 3548, which defines Base 64 encoding, says

The Base 64 encoding is designed to represent arbitrary sequences of octets in a form that requires case sensitivity but need not be humanly readable.

It uses similar wording for the base 32 and base 16 encodings it also defines.

Note the choice of words: octets, not bits.

I wasn’t there when it was developed, but I’m guessing that base 64 notation is carefully formulated to be agnostic on the sequence of bits within an octet, and to be equally implementable on big-endian, little-endian, and other machines. That would be one reason for it to talk about encoding sequences of octets, not sequences of bits.

One consequence of such agnosticism is that if one wanted to specify a sequence of bits using base64Binary, or hexBinary, one would need to specify what sequence to assign to the eight bits of each octet. And indeed RFC 2045 specifies that “When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first.” I believe that that stipulation is necessary precisely because it doesn’t follow from the nature of octets and thus doesn’t go without saying. The RFCs also don’t say, when you are encoding a sequence of six bits, for example, whether it should be left-aligned or right-aligned in the octet.

Bottom line: I think the XSD spec is right to say the value spaces of the binary types are sequences of octets, not of bits.

[In preparing this post, I notice (a) that RFC 3548 appears to have been obsoleted by RFC 4648, and (b) that XSD 1.1 still cites RFC 2045. I wonder if anyone cares.]

For collectors of headline ambiguities …

Posted on 22 April 2009 by cmsmcq

[22 April 2009; unintended injection of draft notes for other postings removed 18 November 2017]

Linguists (and others) like to collect cases in which the compressed, telegraphic style of newspaper headlines lead to unexpected syntactic ambiguities.

The Albuquerque Journal and the Santa Fe New Mexican both carried stories yesterday about the results of a new University of New Mexico study on the incidence of spinal injuries in the U.S.

I showed them to Enrique, who glanced at the headline in the Journal:

Paralysis More Widespread Than Thought

and asked “When did the Albuquerque Journal start covering W3C and ISO Working Groups?”

Grant and contract-supported software development

Posted on 7 April 2009 by cmsmcq

[7 April 2009]

Bob Sutor asks, in a blog post this morning, some questions about government funding and open source software. Since some of them, at least, are questions I have thought about for a while as a reviewer for the National Endowment for the Humanities and other funding agencies, I think I’ll take a shot at answering them. To increase the opportunity for independent thought to occur, I’ll answer them before I read Bob Sutor’s take on them; if we turn out to agree or disagree in ways that require comment, that can be separate.

He asks:

When a government provides funding to a research project, should any software created in the project be released under an open source license?

It depends.

In practice, I think it almost always should, but in theory I recognize the possibility of cases in which it needn’t.

When I review a funding proposal, I ask (among other things): what is the quid pro quo? The people of the country fund this proposal; what are they buying with that money? A reliable survey of the work of Ramon Llull and its relevance to today? Sounds good (assuming I think the applicant is actually in a position to produce a reliable survey, and the cost is not exorbitant). A better tool for finding and studying emblem books? Insight into methods of performing some important task? (Digitizing cultural artefacts, archiving digital research results for posterity, creating reliable language corpora, handling standoff annotation, … there are a whole lot of tasks it would be good to know better how to do.) How interesting is what we would be learning about? How much are we likely to learn?

My emphasis on what we get for the money sometimes leads other reviewers or panelists to regard me as cold and mean-hearted, insufficiently concerned with encouraging a movement here, nurturing a career there. But I have noticed that the smartest and most attractive members of panels I’ve been on are almost always even tougher graders than I am. When funds are as tight as they typically are, you really do need to put them where they will do the most good.

If the value proposition of the funding proposal is “we’ll develop this cool software”, then as a reviewer I want the public to own that software. Otherwise, what did we buy with that money?

If the value proposition is “we’ll develop these cool algorithms and techniques, and write them up, so the community now has better knowledge of how to do XYZ — oh, and along the way we will write this software, it’s necessary to the plan but it’s not what this grant is buying”, then I don’t think I want to insist on open-sourcing the software. But it does make it harder for the applicant to make the case that the results will be worth the money.

Stipulating that software produced in a project will be open-source does usually help persuade me that its benefit will be greater and more permanent. If the primary deliverable I care about is insight, or an algorithm, open-sourcing the software may not be essential. But it helps guarantee that there will be a mercilessly complete account of the algorithm with all details. (It does have the potential danger, though, that it may allow other reviewers or the applicants to believe that the source code provides an adequate account of the algorithm and there is no need for a well written paper or series of papers on the technical problem. I am told that some programmers write source code so clear and beautiful that it might suffice as a description of the algorithm. I say, if writing documentation as well as source code is good enough for Donald Knuth, it’s good enough for the rest of us.)

On the other hand, I don’t think deciding not to open-source the software is necessarily an insuperable barrier. The question is: what value is the nation or the world going to get from this funding? Sometimes the value clearly lies with the software people are proposing to develop, sometimes it clearly lies elsewhere and the software plays a purely subordinate, if essential, role. (But although I admit this in principle, I am not sure that in practice I have ever liked a proposal that proposed to spend a lot of effort on software but not to make it generally available. So maybe my generosity toward non-open-source projects is a purely theoretical quantity, not observable in practice.)

If software is involved, you also have to ask yourself as a reviewer how well it is likely to be engineered and whether the release of the software will serve the greater good, or whether it will act like a laboratory escape, not providing good value but inhibiting the devotion of resources to creating better software.

The chances and consequences of suboptimal engineering vary, of course, with whether the research in question is focused specifically on computer science and software engineering, or on an application domain, in which case there is a long and often proud history of good science being performed with software that would make any self-respecting software engineer gag. (A long time ago, I worked at a well known university where the music department burned more CPU cycles on the university mainframe than any other department. Partly this was because Physics had its own machines, of course, and partly it was because the music people were doing some really really cool and interesting stuff. But was it also partly because they were lousy programmers who ran the worst optimized code east of the Mississippi? I never found out.)

Does this change if commercial companies are involved? How?

If the work is being done by a commercial company, they are historically perhaps less likely to want to make the software they develop open-source. That’s one way the process is affected.

But also, if a government agency is contracting with a commercial organization to develop some software, there may be a higher chance that the agency wants some software for particular parties to use, and the main benefit to be gained is the availability to those parties of the software involved. In some cases, the benefit may be the existence of commercially viable organizations willing and able to support software of a particular class and develop it further.

There are plenty of examples of commercial codebases developed in close consultation with an initial client or with a small group of initial clients. The developer gets money with which to do the development; the initial clients get to help shape the product and ensure that at least one commercial product on the market meets their needs. In the cases I have heard of, the clients don’t typically turn around and demand that the code base be open-source.

It’s not clear to me that government funding agencies should be barred from acting as clients in scenarios like this. This kind of arrangement isn’t precisely what I tend to think of as “research”, but whether it’s appropriate or not in a given research program really depends on the terms of reference of that program, and not on what counts as research in the institutions that trained me.

I have been told on what I think is good authority that if it had not been for contracts let by various defence agencies, the original crop of SGML software might never have been commercially viable. (And since it was that crop of software that demonstrated the utility of descriptive markup and made XML possible, I wouldn’t like to try to outlaw whatever practices led to that software being developed.)

Does this change if academic institutions are involved? How?

I don’t think so.

How should the open source license be chosen? Who gets to decide?

Yes.

Two umbrellas and a prime number.

I think I mean “Huh?” Is this a trick question?”

To the extent that we think of funded research as the purchase (on spec) of certain research products we hope the funding will produce, then the funding agency can certainly say “We want the … license”. And then the Golden Rule of Arts and Sciences applies. Or the people writing the proposal can say “We want to use the … license; take it or leave it.” And the funding agency, guided by its reviewers and panelists and staff and the native wit of those responsible for the final decision, will either leave it or take it.

The only thing that would make me more suspicious and worried than this chaotic back and forth would be an attempt to make an iron-clad rule to cover all cases, for all projects, for all governmental funding agencies.

Balisage: The markup conference 2009

Posted on 3 April 2009 by cmsmcq

[3 April 2009]

Three weeks to go until the 24 April deadline for papers for the 2009 edition of Balisage: The markup conference.

We want your paper. So give it to us, already!

This is a peer-reviewed conference that seeks to be of interest both to the theorist and to the practitioner of markup. That makes it a lot of fun (at least for people like me who are interested both in theory and in practice, and who like to see them informing each other). And the peer reviews are unusual (at least in my experience of conference paper submissions) in the detail and passion of their comments.

If you have markup-related work to report on, you will not get better feedback from any conference on the planet. (Disclaimer: I am one of the organizers, and have been known to have a small soft spot in my heart for this conference. But don’t take my word for it: ask anyone who has spoken at Balisage how the peer review and the questions at the conference compared with other conferences.)

Details of submission procedure, of course, in the call for participation.

I look forward to reading your papers.

Messages in a Bottle

CMSMcQ's klog