Allowing ‘extension primitives’ in XML Schema?

In issue 3251 against XSDL 1.1 (aka ‘XML Schema 1.1’ to those who haven’t internalized the new acronym), Michael Kay suggests that XSDL, like other languages, allow implementations to define primitive types additional to those described in the XSDL spec.

I’ve been thinking about this a fair bit recently.

The result is too much information to put comfortably into a single post, so I’ll limit this post to a discussion of the rationale for the idea.

‘User-defined’ primitives?

Michael is not the first to suggest allowing the set of primitives to be extended without revving the XSDL spec. I’ve heard others, too, express a desire for this or for something similar (see below). In one memorable exchange at Extreme Markup Languages a couple years ago, Ann Wrightson noted that in some projects she has worked on, the need for additional primitives is keenly felt. In the heat of the moment, she spoke feelingly of the “arrogance” of languages that don’t allow users to define their own primitives. I remember it partly because that word stung; I doubt that my reply was as calmly reasoned and equable as I would have liked.

Strictly speaking, of course, the notion of ‘user-defined primitives’ is a contradiction in terms. If a user can define something meaningfully (as opposed to just declaring a sort of black box), then it seems inevitable that that definition will appeal to some other concepts, in terms of which the new thing is to be understood and processed. Those concepts, in turn, either have formal definitions in terms of yet other concepts, or they have no formal definition but must just be known to the processor or understood by human readers by other means. The term primitive denotes this last class of things, that aren’t defined within the system. Whatever a user can define, in the nature of things it can’t very well be a primitive in this sense of the word.

Defining types without lying to the processor

But if one can keep one’s pedantry under control long enough, it’s worth while trying to understand what is desired, before dismissing the expression of the desire as self-contradictory. It’s suggestive that some people point to DTLL (the Datatype Library Language being designed by Jeni Tennison) as an instance of the kind of thing needed. I have seen descriptions of DTLL that claimed that it had no primitives, or that it allowed user-defined primitives (thus falling into the logical trap just mentioned), but I believe that in public discussions Jeni has been more careful.

In practice, DTLL does have primitives, in the sense of basic concepts not defined in terms of other concepts still more basic. In the version Jeni presented at Extreme Markup Languages 2006, the primitive notions in terms of which all types are defined are (a) the idea of tuples with named parts and (b) the four datatypes of XPath 1.0. (Note: I specify the version not because I know that DTLL has since changed, but only because I don’t know that it has not changed; DTLL, alas, is still on the ‘urgent / important / read this soon’ pile, which means it’s continually being shoved aside by things labeled ‘super-urgent / life-threatening / read this yesterday’. It also suffers from my expectation that I’ll have to concentrate and think hard; surfing the Web and reading people’s blogs seems less strenuous.)

But DTLL does not have primitives, in the sense of a set of semantically described types from which all other types are to be constructed. All types (if I remember correctly) are in the same pool, and there is no formal distinction between primitive types and derived types.

Of course, the distinction between primitives and derived types has no particular formal significance in XSDL, either. What you can do with one, you can do with the other. The special types, on the other hand, are, well, special: you can derive new types from the primitives, but you can’t derive new types from the specials, like xs:anySimpleType or (in XSDL 1.1) xs:anyAtomicType. Such derivations require magic (which is one way of saying there isn’t any formal method of defining the nature of a type whose base type is xs:anyAtomicType: it requires [normative] prose).

But XSDL 1.0 and the current draft of XSDL 1.1 do require that all user-defined types be defined in terms of other known types. And this has a couple of irritating corollaries.

If you want to define an XSDL datatype for dates in the form defined by RFC 2822 (e.g. “18 Jan 2008”), or for lengths in any of the forms defined by (X)HTML or CSS (which accepts “50%”, and “50” [pixels], and “50*”, and “50em”, and so on), you can do so if you have enough patience to work out the required regular expressions. But you can’t derive the new rfc2822:date type from xs:date (as you might wish to do, to signal that they share the same value space). You must lie to the processor, and say that really, the set of RFC 2822 dates is a subtype of xs:string.

Exercise for those gifted at casuistry: write a short essay explaining that what is really being defined here is the set of RFC 2822 date expressions, which really are strings, so this is actually all just fine and nothing to complain about.

Exercise for those who care about having notations humans can actually use: read the essay produced by your casuist colleague and refrain from punching the author in the eye.

Lying to the processor is always dangerous, and usually a bad idea; designing a system that requires lying to the processor in order to do something useful is similarly a bad idea (and can lead to heated remarks about the arrogance of the system). When the schema author is forced to pretend that the value space of email dates is the value space of strings (i.e. sequences of UCS characters), it not only does violence to the soul of the schema author and makes the processor miss the opportunity to use a specially optimized storage format for the values, but it also makes it impossible to derive further types by imposing lower and upper bounds on the value space (e.g. a type for plausible email dates: email dated before 1950? probably not). And you can’t expect the XSDL validator to understand the relation among pixels and points and picas and ems and inches, so just forget about the idea of restricting the length type with an upper bound of “2 pc” (two picas) and having the software know that “30pt” (thirty points) exceeds that upper bound.

What about NOTATION?

As suggested in the examples just above, there are a lot of specialized notations that could usefully be defined as extension primitives in an XSDL context. Different date formats are a rich vein in themselves, but any specialized form of expression used to capture specialized information concisely is a candidate. Lengths, in a document formatting context. Rational numbers. Colors (again in a display context). Read the sections on data types in the HTML and CSS specs. Read the section on abstract data types in the programming textbook of your choice. Lots of possibilities.

One might argue that the correct way to handle these is to declare them for what they are: specialized notations, which may be processed and validated by a specialized processor (called, perhaps, as a plugin by the validator) but which a general-purpose markup validator needn’t be expected to know about.

In principle, this could work, I guess. And it may be (related to) what the designers of ISO 8879 (the SGML spec) had in mind when they defined NOTATION. But I don’t think NOTATION will fly as a solution for this kind of problem today, for several reasons:

  • There is no history or established practice of SGML or XML validators calling subvalidators to validate the data declared as being in a particular notation. So the ostensible reason for using NOTATION (“It’s already there! you don’t need to do anything!”) doesn’t really hold up: using declared notations to guide validation would be plowing new ground.
  • Almost no one ever really wants to use notations. In part this is because few people ever feel they really understand what notations are good for, and usually not for long, and they don’t always agree. So software developers never really know what to do with them, and end up doing nothing much with them.
  • If RFC 2822 dates are a ‘notation’ rather than a ‘datatype’, then surely the same applies to ISO 8601 dates. Why treat some date expressions as lexical representations of dates, and others as magic cookies for a black-box process? If notations were intended to keep software that works with SGML and XML markup from having to understand things like integers and dates, then the final returns are now in, and we can certify that that attempt didn’t succeed. (Some document-oriented friends of mine like to tell me that all this datatype stuff was foisted on XSDL by data-heads and does nothing at all for documents. I keep having to remind them that they spent pretty much the entire period from 1986 to 1998 warning their students that neither content models nor attribute declarations allowed you to specify, for example, that the content of a quantity element, or the value of a quantity attribute, ought to be [for example] a quantity — i.e. an integer, preferably restricted to a plausible range. XSDL may have given you things you never asked for, and it may give you things you asked for in a form you didn’t expect and don’t much like. But don’t claim you never asked for datatypes. I was there, and while I don’t have videotapes, I do remember.)

Who defines the new primitives?

Some people are nervous at the idea of trying to allow users to define new kinds of dates, or new kinds of values, in part because the attempt to allow the definition of arbitrary value spaces, in a form that can actually be used to check the validity of data, seems certain to end up by putting a Turing complete language into the spec, or by pointing to some existing programming language and requiring that people use that language to define their new types (and requiring all schema processors to include an interpreter for that language). And the spec ends up defining not just a schema language, but a set of APIs.

However you cut it, it seems a quick route to a language war; include me out.

Michael Kay has pointed out that there is a much simpler way. Don’t try to provide for user-defined types derived by restriction from xs:anyAtomicType in some interoperable way. That would require a lot of machinery, and would be difficult to reach consensus on.

Michael proposes: just specify that implementations may provide additional implementation-defined primitive types. In the nature of things, an implementation can do this however it wants. Some implementors will code up email dates and CSS lengths the same way they code the other primitives. Fine. Some implementors will expose the API that their existing primitive types use, so they choose, at the appropriate moment, to link in a set of extension types, or not. Some will allow users to provide implementations of extension types, using that API, and link them at run time. Some may provide extension syntax to allow users to describe new types in some usable way (DTLL, anyone?) without having to write code in Java or C or [name of language here].

That way, all the burden of designing the way to allow user-specified types to interface with the rest of the implementation falls on the implementors, if they wish to carry it, and not on the XSDL spec. (Hurrah, cries the editor.)

If enough implementations do that, and enough experience is gained, the relevant community might eventually come to some consensus on a good way to do it itneroperably. And at that point, it could go into the spec for some schema language.

This has worked tolerably well for the analogous situation with implementation-defined / user-defined XPath functions in XSLT 1.0. XSLT 1.0 doesn’t provide syntax for declaring user-defined functions; it merely allows implementations to suport additional functions in XPath expressions. Some implementations did so, either their own functions, or functions defined by projects like EXSLT. And some implementations did in fact provide syntax for allowing users to define functions without writing Java or C code and running the linker. (And lo! the experience thus gained has made it possible to get consensus in XSLT 2.0 for standardized syntax for user-defined functions.)

But with that thought, I have reached a topic perhaps better separated out into another post. Other languages, specs, and technologies have faced similar decisions (should we allow implementations to add new blorts to our language, or should we restrict them to using the standard blorts?); what has the experience been like? What can we learn from the experience?

Safari’s love/hate relation with XML

[17 January 2008]

So what is it about Safari and XSLT?

I write a lot of documents. I write them in XML. I really like it when publishing them on the Web means: just checking them into the W3C’s CVS repository (from which they propagate to the the W3C’s Web servers automatically, courtesy of some extremely nifty software put together by our Systems Team with chewing gum, scotch tape, and baling wire great ingenuity). No muss, no fuss. No running make or ant. Just. Check. It. In. And presto! it’s on the Web.

And mostly that works.

Actually, for the browsers I usually use, it always works. But I have friends who tell me I really should be using Safari, as it’s faster and simpler and better in ways that momentarily defeated their ability to explain — if I tried it for awhile, I would see, I think was the idea.

But Safari puts me in a bind.

I can’t use Safari as my daily browser if I can’t reliably display XML in it.

And I can’t write it off entirely as a waste of my time, since much of the time it does display XML just fine.

That is: sometimes it works. And sometimes it doesn’t. And so far I have not been able to get much light shed on when.

To take a simple example: consider this working paper I’m writing for the W3C SML Working Group. There are two copies of this document: one on the W3C server at the URI just linked to, and one on my local file system, in the directory subtree that holds stuff I’ve checked out from the W3C CVS server. All the references (to DTD, to stylesheet, to images, …) are relative, so that they work on the local copy even when I’m off the network, and so I don’t have to change them when I check revisions in.

The local document displays fine in Firefox. So does the copy of the server. Opera displays both the local and the Web copy just fine. Internet Explorer displays the Web copy just fine. I don’t have a copy of IE that can check the local copy, but I used IE for display of local XML for a long time; I’m confident it would work fine.

Safari displays the local copy just fine.

And on the Web copy? Safari gives me a blank screen.

Safari has, it seems, a love/hate relation with XML.

And that means I have a love/hate relation with Safari.

Posted in XML

Tony Coates has written a URI resolver for .Net

Tony Coates reports that he has written a URI resolver for .net that supports XML catalogs.

XML catalogs allow local caching of long-lived material like schemas, DTDs, and the like. They thus make it feasible to work with XML software that uses URIs to locate such material even when you are currently not network accessible. They also make it easier to work around discrepancies in software. Some programs I use support normal URIs as system identifiers, others (mostly older stuff, admittedly, but programs I still want to use) really only directly supports access to the file system. But when programs support catalogs, then I can generally make them both work fine and avoid having to munge the system identifiers in a document each time I open it in a different program or my network status changes. Richard Tobin‘s rxp became much more useful (and went from being something I used every few months to something I use essentially all the time) when he added catalog support; other software that I would otherwise use routinely doesn’t support catalogs, because the developer doesn’t see the point (I won’t name names, but you know who you are), with the result that I don’t use it at all when I can avoid it.

Tim Berners-Lee once told me that really, catalogs seemed just a special case of local caching, and that one should use the local cache instead of a special-purpose mechanism. In some ways, he’s right, and I’m willing to stop using catalogs when I learn about a local cache mechanism that is as well documented, as simple to use, and as application independent as catalogs have proven to be. (Who is willing to hold their breath til that happens?)

Good work, Tony!

If only all XML software supported XML catalogs.

Posted in XML

Does XML have a future on the Web?

Earlier this month, the opening session of the XML 2007 conference was devoted to a panel session on the topic “Does XML have a future on the Web?” Doug Crockford (of JSON fame) and Michael Day (of YesLogic) and I talked for a bit, and then the audience pitched in.

(By the way — surely there is something wrong when the top search result for “XML 2007 Boston” is the page on the 2006 conference site that mentions the plans for 2007, instead of a page from the 2007 conference site. Maybe people are beginning to take the winter XML conference for granted, and not linking to it anymore?)

Michael Day began by pointing out that in the earliest plans, the topic for the session had included the word “still”, which had made him wonder: “Did XML ever have a future on the Web?” He rather thought not: XML, he said, was yet another technology originally intended for the browser that ended up on the server, instead. No one serves XML on the Web, he said, and when they try to something as simple as XHTML, it’s not well-formed. (This, of course, is a simplistic caricature of his remarks, which were a lot more nuanced and thoughtful than this. But of course, while he was speaking I was trying to remember what I was supposed to say; these are the bits of his opening that penetrated my skull and stuck.)

Doug Crockford surprised me a bit; from what I had read about JSON and his sometimes fraught relations with XML, I had expected him to answer “No” to the question in the session title. But he began by saying quite firmly that yes, he thought XML had a very long future on the Web. He paused while we chewed on that a moment, before putting in the knife. We know this, he said, because once any technology is deployed, it can take forever to get rid of it again. (You can still buy Cobol compilers, he pointed out.) If I understood him correctly, his view is that XML (or XHTML, or the two together with all their associated technologies) has been a huge distraction for the Web community, and nothing to speak of has been done on HTML or critical Web technologies for several years as a result. We need, he thought, to rebuild the Web from its foundations to improve reliability and security.

It gives me some regret now that I did not interrupt at this moment to point out that XHTML and XForms are precisely an effort (all in all, a pretty good one) to improve the foundations of the Web, but I wasn’t quick enough to think of that then. (I also didn’t think to say that being compared to Grace Murray Hopper, however indirectly and with whatever intention, is surely one of the highest compliments anyone has ever paid me. Thank you, Doug!) And besides, it’s bad form to interrupt other panelists, especially when it’s your turn to speak next.

Since I have cut so short what Michael Day and Doug Crockford said, I ought in perfect fairness to truncate my own remarks just as savagely, so the reader can evaluate what we said on some sort of equal footing. But this is my blog, so to heck with that.

Revised slightly for clarity, my notes for the panel read something like the following (I have added some material in italics, either to reflect extempore additions during the session or to reflect later pentimenti). I’d like to have given some account of the ensuing discussion, as well, but this post is already a bit long; perhaps in a different post.

I agree with Doug Crockford in answering “Yes” to the question, but we have different reasons. I don’t think just that XML has a future because we can’t manage to get rid of it; I think it ought to have a future, because it has some properties it’s hard to find elsewhere.

1 What do we mean by “the Web”?

A lot depends on what we mean by “the Web”. If we mean Web 2.0 Ajax applications, we may get one answer. If we mean the universe of data publicly accessible through HTTP, the answer might be different. But neither of these, in reality, is “the Web”.

If there is a single central idea of the Web, it’s that of a single connected information space that contains all the information we might want to link to — that means, in practice, all the information we care about (or might come to care about in future): not just publicly available resources, but also resources behind my enterprise firewall, or on my personal hard disk. If there is a single technical idea at the center of the Web, it’s not HTTP (important though it is) but the idea of the Uniform Resource Identifier, a single identifier space with distributed responsibility and authority, in which anyone can name things they care about, and use their own names or names provided by others, without fear of name collisions.

Looked at in this way, “the Web” becomes a rough synonym for ‘data we care about’, or ‘the data we process, store, or manage using information technology’. And the question “Does XML have a future on the Web?” becomes another way of asking “Does XML have a future?”

Not all parts of the Web resemble each other closely. In some neighborhoods, rapid development is central, and fashion rules all things. In others, there are large enterprises for whom fashion moves more slowly, if at all. Data quality, fault tolerance, fault detection, reliability, and permanence are crucial in a lot of enterprises.

The Web is for everyone. So a data format for the Web has to have good support for internationalization and accessibility.

Any data format for “the Web” must satisfy a lot of demands beyond loading configuration data or objects in a client-side Javascript program. As Murata Makoto has often said, one reason to be interested in XML is that it offers us the possibility of managing in a single notation data that for a long time we held separately, in databases and in documents, managed by separate tool sets. General-purpose tools are sometimes cumbersome for particular specialized forms of data, but the provision of a common model and notation is a huge win; before I decide to use another specialized notation, I want to think hard about the costs of yet another notation.

I think XML has a future on the Web because it is the only format around that can plausibly be used for such a broad range of different kinds of data.

2 Loose coupling, tight coupling

One of the important technical properties of the Web is that it encourages a relatively loose coupling between parts of the larger system. Because the server and the client communicate through a relatively narrow channel, and because the HTTP server is stateless, client and server can develop independently of each other.

In a typical configuration there are lots of layers, so there are lots of points of flexibility, lots of places where we can intervene to process requests or data in a different way. By and large, the abstractions are not very leaky, so we can change things at one layer without disturbing (very much) things in the adjoining layers.

In information systems, as in physical systems [or so I think — but I am not a mechanical engineer], loose couplings incur a certain efficiency cost, and systems with tighter couplings are often more efficient. But loose coupling turns out to be extremely useful for allowing diverse communities to satisfy diverse needs on the Web. It turns out to be extremely useful in allowing the interchange of information between unlike devices: if the Web had tighter coupling, it would be impossible to provide Web access to new kinds of devices. And, of course, loose coupling turns out to be a good way of allowing a system to evolve and grow.

One of the secrets of loose coupling is not to expose more information between the two partners in information exchange than you want to.

And in this context, some of the notations sometimes offered as alternatives to XML (at least in some contexts) — or for that matter, as uses of XML — have always made me nervous. We’re building a distributed system; we want to exchange information between client and server, while limiting their mutual dependencies, so that we can refactor either side whenever we need to. And you want me to expose my object structures?! Are you out of your mind? In polite company there is such a thing as too much information. And exhibiting my object structures for the world to see is definitely a case of too much information. I don’t want to see yours, and I don’t want you to see mine. Sorry. Let’s stick to the business at hand, and leave my implementation details out of it.

So, second, I think XML has a future on the Web because (for reasons I think are social as much as technical) the discipline of developing XML vocabularies has a pretty good track record as a way of defining interfaces with loose coupling and controlled exposure of information.

3 Publication without lossy down-translation

There were something like two hundred people actively involved in the original design of XML, and among us I wouldn’t be surprised to learn that we had a few hundred, or a few thousand, different goals for XML.

One goal I had, among those many, was to be able to write documents and technical papers and essays in a descriptive vocabulary I found comfortable, and to publish them on the Web without requiring a lossy down-translation into HTML. I made an interesting discovery a while ago, about that goal: we succeeded.

XML documents can now be read, and styled using XSLT, by the large majority of major browsers (IE, Mozilla and friends, Opera, Safari). It’s been months since I had to generate an HTML form of a paper I had written, in order to put it on the Web.

I know XML has a future on the Web because XML makes it easier for publishers to publish rich information and for readers to get richer information. No one who cares about rich information will ever be willing to go back. XML will go away only after you rip it out of my cold, dead hands.

[After the session, Norm Walsh remarked “and once they’re done with your cold dead hands, they’ll also have to pry it out of mine!”]


One reason to think that XML has found broad uptake is the sheer variety of people complaining about XML and the contradictory nature of the problems they see and would like to fix. For some, XML is too complicated and they seek something simpler; for others, XML is too simple, and they want something that supports more complex structures than trees. Some would like less draconian error handling; others would like more restrictive schema languages.

Any language that can accumulate so many different enemies, with such widely different complaints, must be doing something right. Long life to descriptive markup! Long life to XML!