Simple proof that the URI grammar of RFC 3986 defines a regular language

[16 January 2009]

A while back, a colleague wrote to me to ask if I thought the language defined by the grammar for URIs in RFC 3986 is regular, or not. I don’t know for sure why he wonders; I think he is contemplating trying to reduce it to a regular expression.

If the language is regular, of course, then part of the rationale I gave John Cowan for making anyURI primitive, last January, falls apart. I wrote:

But that language [the language defined by the grammar of RFC 3986] can be defined for a derived type only if it’s regular. Is that language regular? I haven’t looked at it for a while, so I’m not really sure. At the very least, it’s not obvious that it’s regular. And it is obvious that reducing the ABNF of the RFC into a regular expression would be error prone and unlikely to produce a perspicuous result.

My colleague’s email poses the question anew. Is the language in fact regular? This morning a simple method of checking occurred to me, and I spent an hour or so today verifying that the language is in fact regular.

First, I made a set of Prolog facts relating the non-terminals of the grammar; the fact refers(X,Y) is asserted if and only if there is a production rule in the grammar with X on the left-hand side and Y somewhere on the right-hand side. My idea was to load the set of facts into SWI Prolog, use the handy GraphViewer tool (which ships with SWI Prolog as a demo) to draw the graph of the refers relation, and inspect the graph to see if it is cyclic. That turned out to be more awkward than I had expected (the graph is not that complicated, but too complicated to allow me to look at it and find a cycle immediately, or pronounce it acyclic with confidence).

But there turned out to be a simple alternative. This is what I did, after consulting my set of facts.

setof(V,X^(refers(V,X);refers(X,V)),Vs),
setof(L-R,refers(L,R),Es),
vertices_edges_to_ugraph(Vs,Es,G),
transitive_closure(G,Gc),
member(LHS-RHS,Gc),
member(LHS,RHS).

There were no solutions; from this I infer that that the language of the grammar is regular. Let’s take it again, from the top, in a bit more detail.

The line

setof(V,X^(refers(V,X);refers(X,V)),Vs),

makes a set Vs of all terms in either argument position for refers. This is the set of vertices in the non-terminal reachability graph.

The line

setof(L-R,refers(L,R),Es),

similarly makes a set of expressions of the form L-R for terms linked by the refers relation. For example, since the ABNF in the RFC includes the rule

URI    = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

the set Es (edges) will contain 'URI'-scheme, 'URI'-'hier-part', 'URI'-query, and 'URI'-fragment. Prolog quotes URI here to avoid having it taken as a variable.

The line

vertices_edges_to_ugraph(Vs,Es,G),

uses an SWI utility to turn the lists Vs and Es of vertices and edges into an unweighted graph G, in the form used by the ugraph library (written by Richard O’Keefe and Vitor Santos Costa, to whom thanks!) that ships with SWI Prolog.

G has the form of a list of pairs X-[Y1,Y2,…], where there are edges from X to Y1, Y2, … in the graph.

In grammatical terms, the graph G has an edge from X to Y if and only if Y can be an immediate constituent of X (i.e. there is a grammar rule of the form X = ... Y ...)

The line

transitive_closure(G,Gc),

takes the transitive closure of graph G and assigns it to variable Gc. In graph terms, Gc has an edge from X to Y iff Y is reachable from X in graph G. In grammatical terms, Gc has an edge from X to Y if a Y can appear anywhere in an X, at any level. An edge from any symbol S to itself in Gc indicates that that symbol is recursive in the original grammar: in expanding S, we may find new occurrences of S appearing in our sentential form.

The final lines

member(LHS-RHS,Gc),
member(LHS,RHS).

seek a left-hand side LHS in Gc which has a edge pointing to itself, which would indicate that in the grammar, non-terminal LHS is reachable from non-terminal LHS — or, in other words, that the grammar for LHS is recursive.

Since there are no solutions to the Prolog goal, we know that the grammar is not recursive. If the grammar is not recursive, then it is clearly regular.

Q.E.D.

It occurs to me to wonder: how do I know that a non-recursive context-free grammar is necessarily regular? I think I learned it from Niklaus Wirth’s book Grundlagen und Techniken des Compilerbaus (Bonn: Addison-Wesley, 1996), also in English as Compiler Construction (Harlow, England: Addison-Wesley, 1996). He writes:

Eine Sprache ist regulär, wenn sich ihre Syntax durch eine einzige EBNF-Regel ohne Rekursion ausdrücken läßt.

Or:

A language is regular, if its syntax can be expressed by a single EBNF expression.

(It would be interesting to try to prove this statement from first principles, but this blog post is already too long.)

In the general case, the presence of recursion in a grammar does not prove that the grammar is not regular (some recursions in BNF can be expressed in EBNF without recursion). But the absence of recursion in a CFG does prove that the language is regular.

So it really should be possible to generate an XSD regular expression for legal URIs (and possibly for legal IRIs). Stay tuned.

Having worked this out, I asked my colleague, Dan Connolly, what line of reasoning he had followed in answer the question. “Well, I just tried the construction, and it worked.“ He has a Web page with Javascript that performs the translation from the ABNF of the RFC into Javascript regexes, and allows the user to test strings to see if they match that grammar. If you are interested in this kind of question, you may find that page fun to play with.

Baby, you should fix your site

[5 January 2009]

Maybe it was an hallucination, or just an after-effect of seeing the article on T. V. Raman in the Sunday New York Times yesterday. But my evil twin Enrique showed up this evening.

“I’ve written a song,” he said. “You have, have you?” I said, guardedly. “Want to hear it?” he asked. “I doubt it.” But there was no stopping him.

[Before you read this, you may need to be reminded that “WAI” (pronounced “way”) is the W3C’s Web Accessibility Initiative. And “section 508” refers to the section of the U.S. legal code that requires at least some Web sites to be accessible to the disabled. There are analogous requirements in many civilized countries, but the girl in this song appears to be an American.]

I asked a girl what she wanted to do,
And she said ‘Baby, surf the Web, just like you.
I got some software, it’s readin’ my screen,
but your damn page breaks the whole machine!

Chorus:
“Baby, you should fix your site!
WAI can help you make it right.
Baby, if you fix your site,
Then maybe I’ll love you.”

I told that girl that my pages were good.
And she said “Listen, babe, they don’t do what they should.
You know, there’s rules, section 508
Fix that page, if you want a date!

Chorus:
“Baby, you should fix your site!
WAI can help you make it right.
Baby, if you fix your site,
Then maybe I’ll love you.

[Instrumental.]

Chorus:
“Baby, you should fix your site!
WAI can help you make it right.
Baby, if you fix your site,
Then maybe I’ll love you.”

I asked that girl how to start right away/
And she said, “Baby, do Level A.
You won’t stop there, not if you are smart.
But Level A, baby, now that’s a start!

Chorus:
“Baby, you can make it right.
WAI can help you fix your site!
Baby, if you see the light,
Then, baby, I’ll love you!”

I hate to encourage Enrique in his song-writing career, and especially I hate to encourage further songs using Beatles tunes (if you didn’t recognize “Baby, you can drive my car” here, then you are a lot younger than I am), but well, you know, that girl has got a point.

If your site is not accessible, you really should fix it.

My management asks me to point out that we do not warrant that making your Web site more accessible will actually get you more dates. But it’s still the right thing to do.

Spam Karma 2, again

[2 January 2009]

As noted earlier, I hesitated to try to install Spam Karma 2 in the new location of this blog, since I wasn’t sure it would work with WordPress 2.6.5.

But after a couple of weeks with only Bad Behavior minding the fort, I became desperate. (BB does filter out a lot — my only problem was that almost everything it let through for me to moderate was in fact spam. I have better things to do than get ten or twenty messages in a day asking me to moderate comment spam.)

So I’ve installed Spam Karma 2 again, and so far it appears to work well with WordPress 2.6.5.

And my life is a bit quieter. (Actually, it would be quieter even without SK2, since the day with twenty comment spams appears to have been a fluke; SK2 hasn’t been catching twenty a day since its installation. But the next time the bots find me, SK2 will be in place.)

An ineffable moment

[2008-12-31+07:00 / 2009-01-01Z]

If we add it all up, the XML Schema Working Group has spent a lot of time, since the beginning of the group, worrying about leap seconds.

XSD 1.0 attempts to accommodate them in its descriptions of the date/time types, but leaves some aspects of behavior unspecified. Accordingly, implementations of XSD 1.0 vary wildly in how carefully they handle leap seconds; not every implementation comes with built-in knowledge of when leap seconds have occurred in the past, and not every implementation enforces the rules which specify that, when they occur in the future, leap seconds will occur only at midnight, UTC, at the end of a month. You can read some things on the Web that seem to imply leap seconds can only occur at the end of June or December, or possibly also March and September, but that’s not what the relevant spec says. The quarter days are to be preferred, but in principle a leap second could occur at the end of any month. You can also read things on the Web that suggest many people are uncertain whether the responsible authorities could in principle insert (or delete) two leap seconds at a time, or even more. I was unsure myself, until some time after I read the relevant specification it finally dawned on me that the answer is no.

[“The relevant spec?” asked Enrique. “Which is that?” “Well, if you want to be precise, it’s Recommendation ITU-R TF.460-6: Standard-frequency and time-signal emissions, published by the International Telecommunications Union (Geneva: ITU, February 2002). I don’t remember how I managed to acquire a copy.” “And how can you be sure there can never be adjacent leap seconds, if it doesn’t say that flat out?” “What it says is that leap seconds are to be added at the end of some month, UTC, in order to keep UTC within 0.8 seconds of the appropriate solar time measure (maybe UT1 or UT2, but it’s been a while and I forget the details). If after adding two leap seconds, UTC is within 0.8 seconds of solar time, then before they were added it must have been more than 0.8 seconds off. But it’s not allowed to be more than 0.8 seconds off — that’s the point of adding or deleting leap seconds.” “What if there was a large change between January and June?” insisted Enrique. “Then the spec implies that a leap second should be added between January and June. The spec does not limit the insertion of leap seconds to December 31 and June 30, it just says to prefer those dates. I think the implication is pretty clear that if you need to add a leap second at the end of May, you are supposed to do so.”

[“Yeah, but what if the world slowed down by two seconds in the course of a single month? Isn’t that logically possible?” “Logically possible, yeah, I guess so. But astronomically implausible. If the rotation of the earth starts to vary that much, it’s likely to be because a large asteroid just hit us, or something. Under those circumstances, schema-validity is likely to be the least of our worries” “Well, my point exactly,” said Enrique. “If the world is falling apart, that’s the last time you want your systems to start failing because the schema validator doesn’t like your time stamps. There will be more important things to be worrying about!”]

In developing XSD 1.1, we spent a lot of time trying to nail things down better, but ultimately reached the conclusion that there just was no good way to allow all real leap seconds and only real leap seconds, to handle validation of dateTime values for the future, and to maintain the principle that a document’s schema-validity against a given schema is the same today and in the future; it should not change from day to day depending on decisions made by the managers of Universal Coordinated Time. In the end, we said that XSD 1.1 processors just don’t handle leap seconds at all: the moments in the global time-line which are occupied by leap seconds do not correspond to values in the xsd:dateTime value space.

It’s an important principle of schema design (and of the use of other formalisms as well, I think) that in the general case, what the formal notation can express may be only an approximation to the reality you are modeling. Some things may exist without being able to be spoken. Mostly we mean by that that specific rules that apply in a given context may not be expressible in a given formal notation, since the expressive power of the formalism may be hobbled in order to preserve its tractability. It’s nice, I think, that the principle is also instantiated by the dateTime type: there are some moments of UTC time that cannot be captured as values in the dateTime value space.

All this is on my mind, of course, because one of those moments is scheduled to occur today. At midnight UTC. Any moment, now, in fact.

[Pause.]

[“Shouldn’t it be midnight local time?” hissed Enrique. “No, you’re thinking of shifting to and from Daylight Savings Time. Leap seconds are inserted at the same moment all around the globe. Hush, now, don’t spoil it. Just wait and watch.”]

And there it went. Midnight UTC has passed, and the sequence of seconds shown by the applet at http://www.time.gov for Mountain Time went:

  • 16:59:56
  • 16:59:57
  • 16:59:58
  • 16:59:59
  • 16:59:60 [That’s it! That’s it!! “Hey, come look at this!” I wanted to call to my wife.]
  • 17:00:00 [“No, wait, never mind. It’s over already.”
  • 17:00:01

All of these past weeks, as events in W3C and in the economy and in the world have gone from bad to worse, I’ve been waiting impatiently to shake the dust of this year from my feet, and yearning for 2009 and a new leaf. The new year will surely be hard in many ways, I tell myself, but it cannot be as bad as the year just ending. As far as I can tell, I am not alone; 2009 has a heavy freight of hope and expectations to carry. A heavier freight than it’s fair to ask any year to bear.

So I like the idea that between the old year, so widely and deservingly anathematized, and the new one which carries so much fragile hope, time paused, just for a second, to gather its forces before picking up its burden and marching forward again.

Happy New Year, o my readers. Happy New Year.

Moving a WordPress blog to a new domain

[12 December 2008]

Having just moved this blog from people.w3.org to cmsmcq.com, I think it might be useful (to others, or to me down the road) to record what I did in order to make the move relatively smooth.

I started out thinking that I would have to export all the data from MySQL on people.w3.org (using my handy backup routine), move the resulting mib.sql file to cmsmcq.com, and load it from the command line.

Reading the various documents about moving blogs from one site to another, or within a site, however, I discovered that that wasn’t what anyone recommended. Export from the old site, I read, and then import into the new site. WordPress has developed a handy export format that can be used conveniently for this purpose.

I tried it. Export worked fine, and I edited the resulting XML document to change all occurrences of “http://people.w3.org/~cmsmcq/blog” to “http://cmsmcq.com/mib”. Then I imported the file to the new WordPress installation. Several articles loaded successfully, but by no means all, and those that did load did not have similar query strings in their URIs. Article http://people.w3.org/~cmsmcq/blog/?p=12 might appear as http://cmsmcq.com/mib/?p=3, not as ...?p=12. That’s a pain, because I’d like to redirect from the old locations to the new, so existing references to the blog don’t break. I know I can build a table containing all the URIs of everything in the blog, and map each to the appropriate URI on the new host, but I’d really rather not have to spend time on that.

I never did figure out why only part of the data was loading successfully; deleting spam from the site, and then re-exporting helped some (more of the posts loaded), but I never got everything to load.

So I reconsidered. I made a new SQL dump of the database on the old site, and edited it to change URIs from the old name to the new. (I also deleted the commands to load data into the user and options tables, since I didn’t want to overwrite them. I deleted the Spam Karma 2 tables, too, since my new host has a newer version of Word Press and the existing SK2, which is no longer maintained, may or may not work with it. I’ll install Bad Behavior instead.)

I tried to load this edited SQL dump to the new host by using the Web interface to MySQL provided by phpMyAdmin; it complained about a problem, and after I fixed that the process kept hanging.

So I split the file into smaller pieces, to evade any timeout and data-volume restrictions, and tried importing each in turn. Either the host choked, or my name service went away about this time; I think some of the smaller SQL files were successfully imported, but not all. Tried again the next day, and it hung again.

So I went back to Plan A: I copied the entire edited SQL file to the new host and loaded it in MySQL from the command line — took about five minutes (including the file transfer), if you don’t count the six or eight hours of time I burned trying to follow other people’s directions.

For the sake of keeping the old URIs stable, I then added a .htaccess file to the ~/cmsmcq/blog directory on people.w3.org to redirect from the old addresses to the new.

Concisely, what worked best for me was:

  1. Export the data from the old server. I did this with the command mysqldump --verbose --add-drop-table --all --extended-insert --quick --skip-lock-tables --user mysql-userid --password dbname > mib.sql, but it might have been better to export individual tables more selectively.
  2. Edit the mib.sql file, changing the old address (in this case “http://people.w3.org/~cmsmcq/blog”) to the new address (in this case “http://cmsmcq.com/mib”) wherever it occurs (it will occur primarily in cross references from one post to another). Some authorities also recommend doing a global search and replace on your old email address. I also took this opportunity to delete tables I didn’t want in the dump: wp_options, wp_usermeta, wp_users, and the tables used by Spam Karma 2 (RIP). And I modified the wp_ prefix in the table names to match the one provided by my hosting service’s auto-install of WordPress.
  3. Copy the edited SQL dump file (in my case named mib.edited.sql) to the new host.
  4. Invoke MySQL from the command line in the obvious way: mysql -h hostname -u username -p dbname < mib.edited.sql
  5. On the original host, add a .htaccess file to the blog directory (here “~cmsmcq/public_html/blog”) including
    RedirectMatch permanent ^/~cmsmcq/blog/(.*)$ http://cmsmcq.com/mib/$1
    Redirect permanent ^/~cmsmcq/blog$ http://cmsmcq.com/mib
    

No WordPress export/import, no phpMyAdmin, just command line tools. I'm all in favor of Web interfaces and so I think that WordPress export and import, and phpMyAdmin, are great ideas; they just didn't work at all well for me in this situation. But one possible take-home message is: it pays to be comfortable with the command line.