The Well-Formed Web

Exploring the limits of XML and HTTP

The Well-Formed Web is now cool.

That is to say that all the URLs on this site are cool, with cool being defined by Tim Berners-Lee in his article Cool URIs don't change. In the article he argues that URLs should never change and the best way to achieve that is to do some up front design of your URLs so that you won't need to change them in the future.

The URLs for the weblog entries on this site used to be of the form:

/RESTLog.cgi/1

In the article Tim Bernes-Lee gives a list of things to leave out of your cool URL. Now my URLs were mostly cool because they didn't include such information as the authors name, the subject, the status, or the file name extension. The only information embedded in that URL is the software mechanism: .cgi. It has to go. If a new and better method comes along to server up my content and I deploy it then I end up breaking my URLs, and that's not cool. So I need a way to remove any reference to .cgi and also allow the old URLs which are already linked to out in the wild of the internet to keep working. The server I am running on requires the use of the .cgi extension so just renaming the file won't work.

Apache to the rescue

The Apache module mod_rewrite comes to the rescue. This powerful module allows rewriting of URLs on the fly. So I have two ugly URLs to contend with, /RESTLog.cgi and /stories/RESTLog.cgi. The first I want mapped to /news and the second I want mapped to /story. Here is the section of my .htaccess file that accomplishes that rewriting:

RewriteEngine on
RewriteBase /
RewriteRule ^news(.*) /RESTLog.cgi$1  [L]
RewriteRule ^story(.*) /stories/RESTLog.cgi$1  [L]

Note that these rules only modify a URI if they start with "news" or "story" so the old URIs will still work even after I switch to using the new URLs. Which means no broken links. Now that's cool.

This change also required some changes on the server side code. The code was extended by adding a base_uri__ variable to RESTLogImpl.py that is used as the base URI for all urls generated. This fixes a problem when accessing the web site and ModRewrite is in use. SCRIPT_NAME used to be used to generate the urls which was easy to do but if not robust. For example I want all the URLs on WellFormedWeb to be of the form:

/news/N

but the server is only configured to execute scripts if the filename ends in .cgi so I am stuck with:

/RESTLog.cgi/N

I can use ModRewrite to accept /news:

RewriteEngine on
RewriteBase /cgi-bin/
RewriteRule ^cgi-bin/news(.*) /cgi-bin/RESTLog.cgi$1  [L]

But even with this rewrite in place the SCRIPT_NAME still points to RESTLog.cgi and the permalinks generated have RESTLog.cgi in them and not news. So the answer I came up with is to set it in the main script RESTLog.cgi. The alternative was to look for some other potentially missing cgi environment variables that are present if a rewrite was done, but then I realized that created a whole new problem: If I had two installs of pamphlet and they were configured differently, one for /RESTLog.cgi and the other for /news then posts from each install would have a different form of the permalink. Yuk.

Postscript: Much thanks to Mark Pilgrim for sending me some of his .htaccess files as examples and pointing me to A Users Guide to URL Rewriting with the Apache Webserver which is also loaded with examples.

2003-01-04 23:45 Comments (0)