Two styles of Monte Carlo testing for grammars

[8 August 2008]

In an earlier post I talked about generating test cases for grammars, and asked a number of questions. I’ve been meaning to get back to the topic; this afternoon I revisited some work I did with the XSD 1.1 grammar for regular expressions, and so the topic is again on my mind.

Pseudo-random strings

In the past, I’ve used Monte Carlo test generation to test grammars, in particular to find strings that distinguish between grammars. For example, given two grammars for the lexical space of a particular datatype, can we find strings that are accepted by one grammar and rejected by the other?

But random strings can also be used just to understand a given grammar better.

At one point in our work on XSD 1.1, for example, I proposed that in order to have a looser coupling between XSD and the RFCs that define URIs, we should just accept any string at all as a type-valid URI, much the same way that HTML browsers ‘accept’ any string as the value of an href attribute. They may not understand it, but they don’t raise an error message. The rationale for my proposal was that in any case, the generic grammars defined by RFC 3986 and its predecessors are so vague, so general, and so forgiving, that it wasn’t really very clear that they outlawed very much at all. I’ve heard people who know URIs well say that really the only thing strings that grammar won’t accept are strings with two hash signs. Others, equally well informed, say that those are legal, too.

The scheme-specific grammars (e.g. the grammar for http: URLs) are much more useful in catching errors, I argued. But we don’t require that they be checked, and don’t really want to make them part of the definition of type validity for the anyURI type. (I think a good validator should check anyURI value against the scheme-specific grammars and issue warnings if there are problems, but I don’t know any that do that.) So, I argued, the XSD spec was giving the illusion of useful value checking, but not the reality. Others in the working group maintained that the generic grammar does perform some useful checks, and wanted to retain it.

Generating ten thousand or so strings of twenty or forty characters each, with each character a randomly chosen character in the printable range of the ASCII character set, persuaded me that I was wrong. Even if the only useful thing the generic URI grammar does is eliminate strings containing forbidden characters, it rejects well over half the random strings I generated.

So I abandoned my proposal, and in XSD 1.1, type validity for anyURI does involve checking the literal against the grammar given in the RFC.

More interesting pseudo-random strings

One problem, for a human looking at the data, is that if you generate random strings as I did, by generating twenty random numbers between 32 and 127, and then concatenating the corresponding characters, then you generate an awful lot of uninteresting test cases, and you can go a long long time before you actually exercise a specific part of your grammar.

So when I faced the task of generating test cases for XSD’s regular-expression grammar, I wanted some method that would generate more interesting test cases, test cases more likely to exercise all the parts of the grammar. In particular, I wanted to make sure we got good coverage of the more obscure corners of the character-class constructs. These have relatively elaborate substructures and the likelihood of generating a complex character-class expression using randomly chosen characters was way too low.

To make a long story short, I think I found a way. This is what I did.

  1. I built a set of strings in the following way.
    1. For each non-terminal in the grammar, I generated one or more strings that matched that non-terminal, and added them to the set.
    2. For each literal string mentioned in the grammar, I included that literal string in the set. For example, the first production rule in the regex grammar is
      regExp ::= branch ( '|' branch )* 

      So I added “|” to the set of strings.

    3. For each class of terminal symbol, I included one or two examples in the set of strings.

    For the regex grammar, I ended up with a set of a hundred or so strings, from “a” to “(” to “[p{IsBasicLatin}-[aeiou]]”.

  2. I put the set of strings into an array, in no particular order.
  3. Then, I generated sequences of five or ten random numbers between 0 and the size of my set of string. For example, “[92, 21, 48, 6, 41, 69, 47”.
  4. Then I used each number to look up the corresponding string in the array, and concatenated the results. For the random numbers just given, this produced the strings ”}”, ”a”, ”2”, ”)”, ”|xyz”, ”|”, ”[\p{IsBasicLatin}-[aeiou]]”, and ”(”.
    Together, these yield the test string “ }a2)|xyz|[\p{IsBasicLatin}-[aeiou]](

The test cases generated by this modified method (random selection of substrings from a set of strings known to be grammatically interesting) exercised the grammar far better than test cases generated by the random-character method.

If I were a better mathematician, I might be able to characterize the difference. But in a purely informal way, I think one can think of the production rules of any grammar as a set of regular expressions over an alphabet consisting of both the terminal and the non-terminal symbols of the grammar, not just of characters. In any normal grammar, for most non-terminals some or most random character sequences will fail to match the non-terminal. Test cases generated by selecting characters at random will have the same effect as selecting items randomly from the terminal + non-terminal vocabulary, but with a bias (typically a very strong bias) against selecting any non-terminal symbol. By making a set of substrings as described, the second Monte Carlo method evens the odds a bit, and you get sequences that are much more likely to match the regular expressions over the terminal + non-terminal vocabulary, or to come very close to matching them. And that combination (matching all the productions, and coming close to matching each of the productions, but missing) is as close as I can come to characterizing the set of ‘interesting’ test cases.

Certainly, it was not necessary to generate tens of thousands of test cases in order to find examples that were valid against one form of the grammar, and invalid against a different form of the grammar.

If anyone wants to look at this quick and dirty test generator, the code for the test generator is at http://www.w3.org/XML/2008/03/xsdl-regex/testgenerator1.pl; the parser for the grammars I was testing is in the other files of the same directory.

Thinking about schema mappings

[3 August 2008]

At the Digital Humanities conference in Finland in June, two papers made me think about a problem that has worried me off and on for a long time, ever since Mark Olsen at the ARTFL Project at the University of Chicago asked how he was supposed to provide searches across a large collection of documents, if all the documents were marked up differently.

Mark’s solution was simple, Procrustean, and effective: if I understood things correctly and remember aright, he translated everything into a single common vocabulary, which in the nature of things was a sort of lowest common denominator of text structure.

Stephen Ramsay and Brian Pytlik Zillig spoke about “Text analytics: a TEI format for cross-collection text analysis”, in which they described an approach similar to Mark’s in spirit, but crucially different in details. That is, like him their idea is to translate into a single common system of markup, so that the collection they are searching uses consistent ways of signaling textual features. Along the way, they will throw away information they believe to be of no interest for the kind of text analysis their tool is to support. The next day, Fotis Jannidis and Thorsten Vitt gave a paper on “Markup in Textgrid”, which also touched on the problem of providing a homogeneous interface to a heterogeneous collection of documents; if I understood them correctly, they didn’t want to throw away information, but were planning simply to store both the original and a modified (homogenized) form of the data. In the discussion period, we discussed briefly the relative merits of translating the heterogeneous material into a common format and of leaving it in its original formats.

The translation into a common format frequently involves loss of some information. For example, if not every document in the collection has been encoded in such a way as to mark all line-end hyphens according to the recommendations of the MLA’s Committee on Scholarly Editions, then it may be better to strip that information out rather than expose it and risk allowing the user to conclude that the other documents were printed originally without any line-end hyphens at all (after all, the query shows no line-end hyphens in those documents!). But that, in turn, means that you’d better be careful if you expect the work performed through the common interface to produce results which may lead to someone wanting to enrich the markup in the documents. If you’ve stripped out information from the original encoding, and now you enrich your stripped copy, later users are unlikely to thank you when they find themselves trying to re-unify the information you’ve added and the information you stripped out.

It would be nice to have a way to present heterogeneous collections through an interface that allows them to look homogeneous, without actually having to lose the details of the original markup.

It has become clear to me that this problem is closely related to problems of interest in relational databases and in RDF queries. (And probably in other areas where people worry about query languages, too, but if Topic Maps people have talked about this in my hearing, they did so without my understanding that they were also addressing this same problem.)

“Ah,” said Enrique. “They used the muffliato spell on you, did they?” “Hush,” I said.

Database people are interested in this problem in a variety of contexts. Perhaps they are performing a federated search and the common schema in terms of which the query is formulated doesn’t match the actual schemas in which the data are stored and exposed by the database management systems. Perhaps it’s not a federated query but there are other reasons we (a) want to query the data in terms of a schema that doesn’t match the ‘native’ schema, and (b) don’t want to transform the storage from the native schema into the query schema. My colleague Eric Prud’hommeaux has been working on a similar problem in the context of RDF. And of course as I say it’s been on the minds of markup people for a while; I’ve just found a paper that Nancy Ide and I wrote for the ASIS 97 conference in which we tried to stagger towards a better understanding of the problem. I have the sense that I understand the problem better now than I did then, but I could be wrong.

Two basic techniques seem to be possible, if you have a body of data in one vocabulary (let’s call it the “source vocabulary”) and would like to be able to query it using terms from a different vocabulary (the “target vocabulary”). Both assume that it’s possible to map information from the source vocabulary to the target vocabulary.

The first technique is Mark Olsen’s: you have or develop a mapping to go from the source vocabulary to the target vocabulary; you apply that mapping. You now have data in the target vocabulary, and you can query it in the usual way. Done. I believe this is what database people call “materializing the view”.

The second technique took me a while to get my head around. Again, we start from a mapping from the source vocabulary to the target vocabulary, and a query using the target vocabulary. The technique has several steps.

  1. Invert the mapping, so it maps from the target vocabulary to the source vocabulary. (Call the result “the inverse mapping”.)
  2. Apply the inverse mapping to the query, to produce a semantically equivalent query expressed in terms of the source vocabulary. (Since the query is not itself a relational database, or an RDF graph, or an XML document, there’s a certain sleight-of-hand going on here: even if you have successfully inverted the mapping, it will take some legerdemain to apply it to a query instead of to data. But just how hard or easy that is will depend a lot on the nature of the query and the nature of the mapping rules. One of the reasons for this klog post is that I want to be able to set up this context, so I can usefully think aloud about the implications for query languages and mapping rules.)
  3. Apply the source-vocabulary query to the source-vocabulary data. Simple, right? Well, no, not simple, but at least it’s a well known problem.
  4. Take the results of your query, and apply the original source-to-target mapping to them, to produce results expressed in / marked up in the target vocabulary.

Eric Prud’hommeaux may have been surprised, when he brought this topic up the other day, at the speed with which I told him that the key rule which any application of the second technique must obey is a principle I first learned in a course on language pedagogy, years ago in graduate school. (If so, he hid it well.)

The unit of translation is the utterance, not the word.

Everything else follows from this, so let me say it again. The unit of translation is the utterance, not the word. And almost every account of ‘semantic mapping’ systems I have heard in the last fifteen years goes wrong because it assumes the contrary. So let me say it a third time. The specific implications of this may vary from system to system, and need some unpacking I’m not prepared to do this afternoon, but the basic principle remains what I learned from Gertrude Mahrholz thirty years ago:

The unit of translation is the utterance, not the word.

More on this later. In the meantime, think about that.

Allowing case-insensitive language tags in XSD

[30 July 2008]

I found a note to myself today, while going through some old papers, reminding me to write up an idea the i18n Working Group and I had when we were discussing the problem of case (in)sensitivity and language tags, some time ago. Here it is, for the record.

The discussion of the language datatype in XSD 1.1 includes a note reading:

Note: [BCP 47] specifies that language codes “are to be treated as case insensitive; there exist conventions for capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.” Since the language datatype is derived from string, it inherits from string a one-to-one mapping from lexical representations to values. The literals ‘MN’ and ‘mn’ (for Mongolian) therefore correspond to distinct values and have distinct canonical forms. Users of this specification should be aware of this fact, the consequence of which is that the case-insensitive treatment of language values prescribed by [BCP 47] does not follow from the definition of this datatype given here; applications which require case-insensitivity should make appropriate adjustments.

The same is true of XSD 1.0, even if it doesn’t point out the problem as clearly.

As can be imagined, there have been requests that XSD 1.1 define a new variant of xsd:string which is case-insensitive, to allow language tags to be specified properly as a subtype of case-insensitive string and not as a subtype of the existing string type. Or perhaps language tags need to be a primitive type (as John Cowan has argued) instead of a subtype of string. The former opens a large can of worms, the same one that led XML 1.0 to be case-sensitive in the first place. (If you haven’t tried to work out a really good locale-insensitive internationalized rule for case folding, try it sometime when your life is too placid and simple and all your problems are too tractable; if the difference between metropolitan French and Quebecois doesn’t make you give up, remember to explain how you’re going to handle Turkish and English in the same rule.) The latter doesn’t open that can of worms (for the restricted character set allowed in language tags, case folding is well behaved and well defined), but it does open others. I’ve talked about language codes as a subtype of string before, so I won’t repeat it here.

In some cases the case-sensitivity of xsd:language is not a serious problem: we can write our schema to enforce the usual case conventions, by which language tags should be lower-case and country codes should be uppercase, and we can specify that a particular element should have a language tag of “en”, “fr”, “ja”, or “de” by using an enumeration in the usual way:

<xsd:simpleType name="langtag">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:enumeration value="en"/>
<xsd:enumeration value="fr"/>
<xsd:enumeration value="ja"/>
<xsd:enumeration value="de"/>
</xsd:restriction>
</xsd:simpleType>

This datatype will accept any of the four language codes indicated, but only if they are written in lower case.

But what if we want a more liberal schema, which allows case-insensitive language tags? We want to accept not just “en” but also “EN”, “En”, and (because we are determined to do the thing properly) even “eN”.

We could add those to the enumeration: for each language, specify all four possible forms. No one seems to like this idea: it makes the declaration four times as big but much less clear. When I suggested it to the i18n WG, they just groaned and François Yergeau looked at me as if I had emitted an indelicate noise he didn’t want to call attention to.

We were all happier when a different idea occurred to us. First note that the datatype definition given above can easily be reformulated using a pattern facet instead of an enumeration:

<xsd:simpleType name="langtag2">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="en|fr|ja|de"/>
</xsd:restriction>
</xsd:simpleType>

This definition can be adjusted to make it case sensitive in a relatively straightforward way:

<xsd:simpleType name="langtag3">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern value="[eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE]"/>
</xsd:restriction>
</xsd:simpleType>

Voilà, case-insensitive language tags. The pattern is not quite four times larger than the old pattern, but the declaration is still smaller than the first one using enumerations.

A side benefit of using the pattern instead of the enumeration is that it’s easier to allow for subtags (so we can accept “en-US” and “en-UK”, etc., as well as just “en”) by expanding on the pattern:

<xsd:simpleType name="langtag4">
<xsd:annotation>
<xsd:documentation>
<p xmlns="http://www.w3.org/1999/xhtml">The <code>langtag</code>
type lists the language codes that Amalgamated Interkluge
accepts:  English (en), French (fr), Japanese (ja), or
German (de).</p>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:language">
<xsd:pattern
value="([eE][nN]|[fF][rR]|[jJ][aA]|[dD][eE])(-[a-zA-Z0-9]{1,8})*"/>
</xsd:restriction>
</xsd:simpleType>

In a perfect system, there would be some way to signal that the four upper-, lower-, and mixed-case forms of “en” all mean the same thing and map to the same value. This technique does not provide that. But then, I don’t know any good way to provide it. (I do know ways to provide it, just not any ones I think are good.) In an imperfect world, if I want case-insensitive language tags, I suppose I should be happy that I can find a way to define them without much inconvenience. And that, this technique provides.

Namespace documents (kudos to XHTML)

[28 July 2008]

Lately I’ve had occasion to spend some time dereferencing namespaces and looking at what you get when you do so. If, for example, you have encountered some qualified name and want to know what it might mean, the “follow-your-nose” principle says it’s a good idea that you should be able to find out by dereferencing the namespace name. (The follow your nose priniplce introduced to me under that name by Dan Connolly, but I think he’d prefer to think of it as a general principle of Web architecture than as an invention of his own. And indeed the Architecture of the World Wide Web, as documented by the W3C’s Technical Architecture Group, explicitly recommends that namespace documents be provided for all namespaces.)

The upshot of my recent examinations is that for some namespaces, even otherwise exemplary applications and demos fail to provide namespace documents. For others, the only namespace document is a machine-readable document (e.g. an OWL ontology) without any human-comprehensible documentation of what the terms in the namespace are intended to mean; for still others, there is useful human-readable description (sometimes only in a comment, but it’s there) if you can can find it. And for a few, there is something approaching a document intended to be accessible to a human reader.

So far, however, the best namespace document I’ve seen recently is the one produced by the XHTML Working Group for the namespace http://www.w3.org/1999/xhtml/vocab — human-readable, and reasonably clear. Not perfect (no document date? no description of whether the vocabulary is subject to change?) but far, far, better than average.

Kudos to the XHTML Working Group!

Posted in XML

RDF and Wittgenstein

[27 July 2008]

The more I think about RDF’s goal of providing a monotonic semantics for RDF graphs (under pressure in part from my colleague Thomas Roessler, to whom thanks), the more the RDF triple seems to be an attempt to operationalize Wittgenstein’s notion of atomic fact, with all the advantages and disadvantages that that entails. Is this an insight, or just a blazing truism? Or false?

Interesting that this possibility seems to run counter to Steve Pepper’s remark “RDF/OWL is to Aristotle as Topic Maps is to Wittgenstein.” Perhaps SP has the Wittgenstein of the Untersuchungen in mind?