I’ve been wandering early …

[21 October 2008]

I’ve been traveling a good deal lately.

This is the first in series of posts recording some of my impressions.

In late September the XSL Working Group held a one-week meeting in Prague to work on revisions of XSLT to make it easier to support streaming transformations in XSLT. By streaming, the WG means:

  • The transformation can run in memory independent of document size (sometimes constant memory, sometimes memory proportional to document depth, sometimes memory proportional to the size of discrete windows of data in the document).
  • The transformation can begin delivering results before all of the input is available (e.g. can work on so-called ‘infinite’ XML documents like streams of stock quotations).
  • The transformation can be preformed in a single pass over the document.

It turns out that for different use cases it can be necessary or useful to:

  • declare a particular input document as streamable / to be streamed if possible
  • declare a particular set of templates as streamable
  • declare that particular parts of the document need to be available in full (buffered for random access) for part of the transform, but can then be discarded (e.g. for windowing use cases)

Some members of the WG may have been apprehensive at the thought of five straight days of WG discussions. Would we have enough to do, or would we run out of ideas on Tuesday and spend Wednesday through Friday staring at the floor in embarrassment while the chair urged us to get some work done? (If no one else harbored these fears, I certainly did.) But in fact we had a lively discussion straight through to the end of the week, and made what I think was good progress toward concrete proposals for the spec.

Among the technical ideas with most legs is (I think) the idea that sometimes what you want to do with a particular node in the input tree can actually be done with a partial copy of the input tree, and that different kinds of partial copy may be appropriate in different situations.

If you perform a deep copy of an element node in an XDM data model instance, for example, you have access to the entire subtree rooted at that node, but not to any of its siblings or ancestors, nor to anything else in the tree from which it came. For cases where you wish to (or must) allow access to the subtree rooted in a node, but to nothing else, such a deep copy is ideal: it preserves the information you want to have available, and it makes all other information inaccessible. (This is essentially the way that XSD 1.1 restricts assertions to the subtree rooted in a given node: logically speaking the assertions are evaluated against a copy of the node, not against the node itself.)

Several kinds of copy can be distinguished. In the terminology of the XSL Working Group (using terms introduced mostly by Michael Kay and Mohamed Zergaoui):

  • Y-copy: contains the subtree rooted in the node being copied, and also all of its ancestor nodes and their attributes, but none of their siblings. It is thus shaped like an upside down uppercase Y.
  • Nabla-copy: contains just the subtree rooted in the node being copied. It is thus shaped like an upside-down nabla. (Yes, also like a right-side-up delta, but since the Y-copy requires the Y to be inverted, we say nabla copy not delta copy. Besides, a delta copy would sound more like something used in change management.)
  • Dot-copy: contains just the node being copied, itself, and its attributes if any.
  • Yen-copy: like a Y-copy, but includes the siblings of each ancestor together with their attributres (although not their children).
  • Spanish exclamation-point copy: contains just the node being copied, and its ancestors, together with their attributes. Shaped like an exclamation point (dot, with something above it), or like an upside-down Spanish exclamation point.

I’ve been quite taken recently by one possible application of these ideas outside of the streaming XSLT work. In the current draft, assertions in XSD 1.1 are restricted to / are evaluated against a nabla-copy of the element or attribute being validated, and the conditions used for conditional type assignment are evaluated against a dot copy of the element. These restrictions are painful, especially the latter since it makes it impossible to select the type of an element depending on its xml:lang value (which is inherited from an ancestor if not specified locally). But XSD 1.1 could readily support conditions on the nearest value of xml:lang if conditions were evaluated on a Spanish-exclamation-point copy, instead of on a dot copy, of the element in question. I don’t know whether the XML Schema WG will buy this approach, but the possibility does seem to suggest that there is value in the idea of thinking about things in terms of invariants preserved by different kinds of node copying operations.

Thoughts to come back to …

[27 August 2008]

Observation: both RDF and Topic Maps seem to aspire to make it easy to find all the ways in which a given thing (topic or resource) may appear in a given body of data.

In both, the basic goal appears to be that if you look for Essex or Washington, you should be able to specify whether you mean the human being or the geographic entity (and probably be able to distinguish the state of Washington from the various cities named Washington), and find it no matter where it appears in the system. In RDF terms, this would mean no matter which triples it appears in, and no matter whether it appears as subject or object of the triples; in topic map terms, it would mean no matter which roles it plays in which associations.

Observation: Codd insists firmly that to be acceptable in his eyes, relational database management systems must not only provide built-in system-level support for domains (which may be regarded as a form of extended data types with slightly more semantics than the basic types in a typical programming language, so you can distinguish people from places even if you use VARCHAR(60) columns to represent both), but also include ways of finding all the values actively in use for a given domain, regardless of relation or column, and of finding all the occurrences of a particular value of a particular domain, without getting it mixed up with any values from different domains which happen to use the same underlying basic datatype in their representation. (For those with The relational model for database management, version 2 (Reading: Addison-Wesley, 1990) on the shelf, I’m looking at sections 3.4.1 The FAO_AV Command and 3.4.2 The FAO_LIST Command).

Question: Are the RDF and TM communities and Codd after essentially the same thing here? Is there some sense in which they fulfil (or are trying to fulfil) this part of Codd’s ideal for data management better than SQL systems do?

What exactly is the relation between this aspect of both RDF and TM on the one hand, and Codd’s notion of domains or extended data types on the other?

I’ve wanted to think about this for years, but have not managed to find anyone to discuss it with who had (a) sufficient knowledge of both semweb or Topic-Map technology and relational theory (or rather Codd’s particular doctrines) and (b) the time and inclination to go into it, at (c) a time when I myself had the time and inclination. But someday, perhaps …

Information and the laws of thermodynamics

[27 August 2008]

Liam Quin has been spending most of his blogging effort recently on his site fromoldbooks.org, but recently he posted a thoughtful piece on his ‘other blog’, In search of XMLence, on the relations among standard off-the-shelf vocabularies, customized vocabularies, what widely deployed systems can understand, and what people want to be able to say. (“The Super Information Archipelago”, 26 August 2008.)

One net result: if you use a vocabulary customized to your application, you get a better fit, at some cost in effort (or money) and immediate interchangeability (or interoperability) with other vocabularies. If you use standard one-size-fits-all vocabularies, you get free interoperability with others, at the cost of a poorer fit for the work you want to do in the first place. Liam imagines the situation of users of custom vocabularies as a kind of archipelago of islands of rich information in a sea of, ah, less rich information, sometimes connected by good bridges.

Something in the tradeoff (perhaps it’s the image of the islands) reminds me of the observation someone made long ago when they were teaching me about the three laws of thermodynamics. This was the kind of conversation where there was a little less emphasis on the laws as formulated in physics books than on the popular paraphrase:

  1. You can’t win.
  2. You can’t break even.
  3. You can’t get out of the game.

If the laws of thermodynamics say that entropy always increases, someone asked, then how is it possible that in some situations entropy seems to decrease, and order to increase? Many of us spend a lot of our time trying to build systems to organize things or information; if the universe is stacked against us, how is that we ever have even the illusion of having succeeded? (And is this a good excuse for not even trying to organize my desk?)

Entropy does increase, was the answer, but only in the universe as a whole — not necessarily at every point in the universe uniformly. You can increase the order, and decrease the entropy, in a restricted portion of the universe — just not in the universe as a whole.

It sheds light from a different angle on the issues of data islands and walled gardens.

Il faut cultiver notre jardin.

Further notes on optimistic concurrency and XML parsing

[22 August 2008]

I just posted some notes on a paper given at Balisage 2008 by Yu Wu et al. of Intel.

A few thoughts occurred to me in writing up those notes which might merit separate consideration.

How effective could pessimization be?

A key part of the optimistic concurrency algorithm presented by Yu Wu et al. is that the process of chunking the document needs to be quick. So they make some guesses, when chunking, that could later be proven wrong; in that case, the chunk needs to be re-parsed.

I suppose the worse-case scenario here is that a sufficiently lucky and malignant adversary could construct a document in which the context at the end of chunk 1 means that chunk 2 needs to be reparsed, and the reparsing of chunk 2 reveals for the first time that chunk 3 now needs to be reparsed, and so on, so that in the end you end up using n time slices to parse n chunks, instead of n divided by the number of threads.

So there’s an interesting question: how long can we keep this up?

It’s pretty clear that if we know exactly where the pre-scanner will break the chunks, then we can devise an XML document that forces chunk 2 to be reparsed. Can we construct a document in which only the second, correct parse of chunk 2 reveals that chunk 3 now needs to be reparsed (i.e. in which the first parse of chunk 2 makes chunk 3 look OK, and the second one shows that it’s not OK)?

Can we make a document in which every time we reparse a chunk with the correct context, we discover that the next chunk also needs to be reparsed? How much reworking can an omniscient and malevolent XML author cause this algorithm to do? Remember that comments and CDATA sections do not nest; the worst I can figure out off hand is that a comment or CDATA section begins in chunk 1 and doesn’t end until the last chunk.

How many chunks do you want?

The paper says fewer chunks are better than many chunks (to reduce post-processing costs), and that you want at least as many chunks as there are threads (to ensure that all cores can be busy). To simplify the examples I’ve been thinking about, I’ve been imagining that if I have eight threads, I’ll make eight chunks.

But if I’ve read the performance data and charts right, the biggest single reason the Horatian parser is not getting an eight-fold speedup when using eight threads is the need to reparse some chunks, owing to bad guesses about parse context made during the first parse. If we have eight threads and eight chunks, everything is fine for the first pass over the chunks. But if we need to reparse two of the chunks, then it rather looks as if six threads might be sitting idle waiting for the re-parsing to finish.

I wonder: would you get better results if you had shorter chunks, and more of them, to keep more threads busy longer? What you want is enough chunks to ensure that while you are reparsing some chunks, you still have other chunks for the other threads to parse.

As a first approximation, imagine that we have eight threads. Instead of eight chunks, we make fourteen chunks, and give the first eight of them to the eight threads. Let’s say two of them need to be reparsed; the reparsing goes on at the same time that the remaining six threads parse the remaining six chunks. The minimal path through the speculative parsing step remains the time it takes to parse two chunks, but the chunks are somewhat smaller now. The only question is how much additional time the post-processing step will now take, given that it has fourteen and not eight chunks to knit together.

And of course you need to bear in mind that if one chunk in four turns out to need re-parsing, then three or four out of the fourteen chunks are going to need reparsing, not just two. By the time you factor that in, and try to ensure that your last round of parsing doesn’t generate any new re-parse requests, things have gotten more complicated than I can conveniently deal with here (or elsewhere).

Maybe that’s why the Intel paper was so non-committal on the way to choose how many chunks to make in the first place: it can get pretty complicated pretty fast.

Optimization and context independence in schema languages

One of the things that intrigues me about these results is that so much of what people have said needs to be done to schema languages to ensure that validation can be fast has nothing much to do with the speed gains shown by optimistic concurrency.

I thought for a while that this work did benefit from the fact that elements can be validated against XSD types without knowledge of their context (no reference to ancestors or siblings in any assertions, for example), but on reflection I’m not sure this is true: in order to find the right element declaration and type definition to bind an instance element, you need to know (a) the expanded name of the element (which means knowing the in-scope namespaces, which in practice means having looked at all of the ancestors of the element), and (b) the type assigned to the element’s parent (unless this element is itself the validation root). Once you have a type, it’s true that validation is independent of context. But the assignment of a type to an element or attribute does depend, in the normal case, on the context. It’s not clear to me that allowing upward-pointing XPath expressions in assertions or conditional type assignment would make much difference.

To really exploit parallelism in validation, it would seem you want to eliminate the variable binding of expanded names to element declarations and to types.

Back to DTDs plus datatypes, anyone?