XSD 1.1 is in Last Call

Yesterday the World Wide Web Consortium published new drafts of its XML Schema Definition Language (XSD) 1.1, as ‘last-call’ drafts.

The idiom has an obscure history, but is clearly related to the last call for orders in pubs which must close by a certain hour. The working group responsible for a specification labels it ‘last call’, as in ‘last call for comments’, to indicate that the working group believes the spec is finished and ready to move forward. If other working groups or external readers have been waiting to review the document, thinking “there’s no point reviewing it now because they are still changing things”, the last call is a signal that the responsible working group has stopped changing things, so if you want to review it, it’s now or never.

The effect, of course, can be to evoke a lot of comments that require significant rework of the spec, so that in fact it would be foolish for a working group to believe they are essentially done when they reach last call. (Not that it matters what the WG thinks: a working group that believes last call is the end of the real work will soon be taught better.)

In the case of XSD 1.1, this is the second last call publication both for the Datatypes spec and for the Structures spec (published previously as last-call working drafts in February 2006 and in August 2007, respectively). Each elicited scores of comments: by my count there are 126 Bugzilla issues on Datatypes opened since 17 February 2006, and 96 issues opened against Structures since 31 August 2007. We have closed all of the substantive comments, most by fixing the problem and a few (sigh) by discovering either that we could not reach consensus on what to do about the problem (or in some cases could not reach consensus about whether there was really a problem before us) or that we could not make the requested change without more delay than seemed warrantable. There are still a number of ‘editorial’ issues open, which are expected not to affect the conformance requirements for the spec or to change the results of anyone’s review of the spec, and which we therefore hope to be able to close after going to last call.

XSD 1.1 is, I think, somewhat improved over XSD 1.0 in a number of ways, ranging from the very small but symbolically very significant to much larger changes. On the small but significant side: the spec has a name now (XSD) that is distinct from the generic noun phrase used to describe the subject matter of the spec (XML schemas), which should make it easier for people to talk about XML schema languages other than XSD without confusing some listeners. On the larger side:

  • XSD 1.1 supports XPath 2.0 assertions on complex and simple types. The subset of XPath 2.0 defined for assertions in earlier drafts of XSD 1.1 has been dropped; processors are expected to support all of XPath 2.0 for assertions. (There is, however, a subset defined for conditional type assignment, although here too schema authors are allowed to use, and processors are allowed to support, full XPath.)
  • ‘Negative’ wildcards are allowed, that is wildcards which match all names except some specified set. The excluded names can be listed explicitly, or can be “all the elements defined in the schema” or “all the elements present in the content model”.
  • The xs:redefine element has been deprecated, and a new xs:override element has been defined which has clearer semantics and is easier to use.

Some changes vis-a-vis 1.0 were already visible in earlier drafts of 1.1:

  • The rules requiring deterministic content models have been relaxed to allow wildcards to compete with elements (although the determinism rule has not been eliminated completely, as some would prefer).
  • XSD 1.1 supports both XML 1.0 and XML 1.1.
  • A conditional inclusion mechanism is defined for schema documents, which allows schema authors to write schema documents that will work with multiple versions of XSD. (This conditional inclusion mechanism is not part of XSD 1.0, and cannot be added to it by an erratum, but there is no reason a conforming XSD 1.0 processor cannot support it, and I encourage makers of 1.0 processors to add support for it.)
  • Schema authors can specify various kinds of ‘open content’ for content models; this can make it easier to produce new versions of a vocabulary with the property that any document valid against the new vocabulary will also be valid against the old.
  • The Datatypes spec includes a precisionDecimal datatype intended to support the IEEE 754R floating-point decimal specification recently approved by IEEE.
  • Processors are allowed to support primitive datatypes, and datatype facets, additional to those defined in the specification.
  • We have revised many, many passages in the spec to try to make them clearer. It has not been easy to rewrite for clarity while retaining the kind of close correspondence to 1.0 that allows the working group and implementors to be confident that the rewrite has not inadvertently changed the conformance criteria. Some readers will doubtless wish that the working group had done more in this regard. But I venture to hope that many readers will be glad for the improvements in wording. The spec is still complex and some parts of it still make for hard going, but I think the changes are trending in the right direction.

If you have any interest in XSD, or in XML schema languages in general, I hope you will take the time to read and comment on XSD 1.1. The comment period runs through 12 September 2008. The specs may be found on the W3C Technical Reports index page.

Thinking about test cases for grammars

[19 June 2008]

I’ve been thinking about testing and test cases a lot recently.

I don’t have time to write it all up, and it wouldn’t fit comfortably in a single post anyway. But several questions have turned out to provide a lot of food for thought.

The topic first offered itself in connection with several bug reports against the grammar for regular expressions in XSD Part 2: Datatypes, and with the prospect of revising the grammar to resolve the issues. When revising a grammar, it would be really useful to be confident that the changes one is making change the parts of the language one wants to change, and leave the rest of the language untouched. In the extreme case, perhaps we don’t want to change the language at all, just to reformulate the grammar to be easier to follow or to provide semantics for. How can we be confident that the old grammar and the new one describe the same language?

For regular languages, I believe the problem is fairly straightforward. You can calculate the minimal FSA for each language and check whether the two minimal FSAs are isomorphic. Or you can calculate both set differences (L1 – L2 and L2 – L1) and check that both of them are the empty set. And there are tools like Grail that can help you perform the check, although the notation Grail uses is just different enough from the notation XSD uses to make confusion and translation errors possible (but similar enough that you think it would be over-engineering to try to automate the translation).

Buyt for context-free languages, the situation is not so good. In principle, the equivalence of context-free languages is decidable, but I would have to spend time rereading Hopcroft and Ullman, or Grune and Jacobs, to figure out how to go about it. And I don’t know of any good grammar-analysis tools. (When I ask people, they say the closest thing they know of to a grammar analysis tool are the error messages from yacc and its ilk.) So even if one did buckle down and try to prove the original form of the grammar and the new form equivalent, the possibility of errors in the proof is quite real and it would be nice to have a convenient way of generating a thorough set of test cases.

I can think of two ways to generate test cases:

  • Generation of random or pseudo-random strings; let’s call this Monte Carlo testing.
  • Careful systematic creation of test cases. I.e., hard work, either in the manual construction of tests or in setting things up for automatic test generation.

Naturally my first thought was how to avoid hard work by generating useful test cases with minimal labor.

The bad news is that this only led to other questions, like “what do you mean by useful test cases?”

The obvious answer is that in the grammar comparison case, one wants to generate test cases which will expose differences in the languages defined by the two grammars, just as in the case of software one wants test cases which will expose errors in the program. The parallel suggests that one might learn something useful by attempting to apply general testing principles to grammars and to specifications based on grammars.

So I’ve been thinking about some questions which arise from that idea. In much of this I am guided by Glenford J. Myers, The art of software testing (New York: Wiley, 1979). If I had no other reasons for being grateful to Paul Cotton, his recommending Myers to me would still put me in his debt.

  • For the various measures of test ‘coverage’ defined by Myers (statement coverage, decision coverage, condition coverage, decision/condition coverage, multiple condition coverage), what are the corresponding measures for grammars?
  • If one generates random strings to use as test cases, how long does the string need to be in order to be useful? (For some meaning of “useful” — for example, in order to ensure that all parts of the grammar can be exercised.)
  • How long can the strings get before they are clearly not testing anything that shorter strings haven’t already tested adequately (for some meaning of “adequate”)?
  • From earlier experience generating random strings as test cases, I know that for pretty much any definition of “interesting test case”, the large majority of random test cases are not “interesting”. Is there a way to increase the likelihood of a test case being interesting? A way that doesn’t involve hard work, I mean.
  • How good a job can we do at generating interesting test cases with only minimal understanding of the language, or with only minimal analysis of its grammar or FSA?
  • What kinds of analysis offer the best bang for the buck in terms of improving our ability to generate test cases automatically?

Balisage offers hope for the deadline-challenged

In my never-ending quest to help those who, like myself, never get around to things until the deadline is breathing down their necks, I have until now avoided mentioning that Balisage, the conference on markup theory and practice, has issued a call for late-breaking news.

The deadline for late-breaking submissions is 13 June 2008. It is now officially breathing down your neck.

There, will that do the job?

The 13th of June is this Friday. Just enough time to write up that great piece of work you just did, but not long enough to make a huge big thing of it and get all worked up in knots.

Balisage is an annual conference on markup theory and practice, held in early August each year in Montréal. Well, I say annual, but strictly speaking this is Balisage’s first year. The organizers have in the past been involved in other conferences in Montreal in August (most recently Extreme Markup Languages), and we regard Balisage as the natural continuation. So if you have always wanted to go to the Extreme Markup Languages conference, and are disappointed to see no announcements this year for that conference, come to Balisage. I think you’ll find what you’re looking for.

The full call for late-breaking news, and details of how to make a submission, are at http://www.balisage.net/latebreaking-call.html.

XML catalogs vs local caching proxy

[21 May 2008]

I have a senior colleague who has maintained, for several years, that SGML and XML catalogs are a deplorable special-case hack for a problem that should be solved by the more general means of HTTP caches. (Most recently, he was arguing against a proposal that the W3C distribute convenient packages of our most frequently used DTDs and schemas, with a catalog to make them easy to use. How someone so smart can be so deeply wrong-headed, I’m not sure.)

So when I had a network outage the other day that made it hard to get any work done, I thought about setting up a local caching proxy. Why did the outage make it hard to get anything done? Because I do use some software that doesn’t support catalogs, and which reacts to network outages by imposing a thirty-second delay for each DTD fetch (while its network request times out) and then proceeds anyway, without the DTD. Since it does proceed eventually, I can in fact build a new HTML version of the XSD spec (for example); it’s only that the process becomes painfully slow (or rather, even slower and more painful than usual).

But, I thought, the systems guys assure me that it’s not really hard for a user (not the system administrator, just a user) to set up a local caching proxy. So I’ll give it a try.

The upshot so far is: yes, it’s possible, though I wouldn’t call it easy. And managing catalogs still seems an order of magnitude easier and more straightforward. Here’s what I’ve done so far:

1 Apache ships with Mac OS X, it’s already running on my system (I use a local CGI script to log where my time goes), and mod_proxy enables it to serve as a local caching proxy. So I decided to try that, instead of installing squid or something similar. Found instructions for configuring Apache as a local caching proxy on a Mac OSX site; they worked (although they suggest commenting out the line “Deny all”, in the mistaken belief that otherwise nothing works). I followed his advice and blocked a couple of random sites I can live without, in order to be able to request them and tell, from the resulting failure message, that the proxy service was working.

2 In System Preferences / Network / Airport / Proxies, I told the system to use http://localhost:80 as a proxy for HTTP requests.

I had illusions that this was it. At the system level (I fantasized), outgoing HTTP requests would be re-routed to the local Apache.

Ha.

This does suffice for Safari, and possibly for other Apple software (I don’t know, haven’t looked, don’t much care right now). But Firefox must be told separately about the proxy server. And Opera.

And the command-line tools that were the main reason I wanted a caching proxy in the first place? RXP, libxml, Saxon, and so on? Nope, not using the proxy.

3 After some disappointing experiences with the documentation for the tools I’m using (none of the documentation I found says anything at all about how to tell the software to use a proxy server), I learned from oblique references somewhere that setting the environment variable http_proxy works for some Unix tools.

So I tried export http_proxy=http://localhost:80 and curl, at least, started using the proxy server. libxml (and thus xmllint and xsltproc) also started using it, I think, or trying to, but the main symptom of this success was that they started emitting error messages informing me helpfully that

error : Operation in progress

When I stopped Apache, that message went away. When I unset the http_proxy environment variable, it also went away (whether Apache was running or not).

4 Along about this time I decided just to make libxml use my local catalogs. This turned out to be harder than I thought: setting XML_CATALOG_FILES=/Library/SGML/Public/catalog.xml elicited only the laconic message from xsltproc, xmllint, xmlcatalog, and anything else that uses libxml: /Library/SGML/Public/Misc/catalog.xml:0: Catalog error : File /Library/SGML/Public/Misc/catalog.xml is not an XML Catalog.

But of course it is an XML catalog. I can see that.

I validated it, just to make sure, using both xmllint and rxp. No problems.

5 Eventually, it became clear that libxml wanted an explicit namespace declaration in the root element. (I had been relying on the default value given in the DTD.) So <catalog> had to become <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> in all my XML catalogs. (DV, are you listening? The Namespaces Rec is quite explicit that namespace declarations may be defaulted by the DTD. Otherwise I never would have voted for it. RXP gets it right; thank you, Richard!)

6 Eventually, the sick minds of Liam Quin and John Snelson suggested that perhaps I should try a different value for http_proxy: instead of http://localhost:80 I should try export http_proxy=http://127.0.0.1:80/. This eliminated the “error : Operation in progess” messages.

So I now have a local caching proxy working, and some of my tools, at least, are using it when they don’t find what they need using catalogs. I’ll assume that this is a Good Thing. But nothing I’ve seen so far tells me how to configure Apache (or squid, or any other proxy) the way I want to. I want a convenient list of the resources in the local cache, and I want to be able to mark some of them (e.g. the DTDs and the W3C stylesheets I use most often) as “Never ever delete this; ALWAYS have a copy handy; check every few months to see if it needs updating.” From the documentation of Apache and of Squid, I am inclined to believe this is not actually possible. At the very least, it’s not obvious. By default, Apache’s mod_proxy appears to plan to delete everything after 24 hours regardless of its expiration date. And the default size of the cache appears (can this possibly be?!) to be 5 KB.

So so far, the caching proxy does not give me the guarantees I want, about always having the resources I care about available, network or no network.

For catalogs, on the other hand, it would be nice to have some software that would augment the catalog with information about when a particular copy of the resource was fetched, when it was last modified, what its expiration date is (if the server provides one; surprising how few Web servers actually provide useful expiration information), and would check the Web periodically (say, once a month or so) to see whether any of my local copies of Web resources should be updated.

My interim conclusion is: both catalogs and HTTP caches could use improvement. As a way to ensure that the work I want to do can proceed without the network, however, catalogs are a lot more convenient, straightforward, and functional.

Balisage preliminary program

Balisage 2008 logo

The Balisage 2008 program committee met this week, and we now have a preliminary program. (As I write, the schedule-at-a-glance isn’t up yet, but it really should be up any moment now.)

We’ve got a good range of topics, with talks on resource-oriented architectures and a framework for managing constraints and how markup relates to the philosophical problem of universals and how to handle overlap in a format that makes good use of the non-overlapping properties of XML. And a parallel algorithm for XML parsing and validation that exploits multiple-core machines, and an XML Exchange for messaging that uses XQuery to route messages, and structural metadata as a socio-technological problem and using the Atom Publishing Protocol to build dynamic applications.

The day before Balisage begins, there will be a one-day symposium on versioning issues, which has also shaped up very nicely. (More on that in another post.)

Great topics, great people (I’ve heard a lot of these speakers, and I know they’re good), and Montreal in August. What more could you want?

So think hard about being in Montreal the week of 11-15 August, for the versioning symposium and for Balisage 2008. I look forward to seeing people there!