The Well-Formed Web

Exploring the limits of XML and HTTP

The Well-Formed Web

Over a month ago Paul Ford published a great essay entitled How Google beat Amazon and Ebay to the Semantic Web. After reading it the first time I thought it was a great introduction to the Semantic Web, an idea I had been trying to wrap my head around even since encountering RDF as it is baked into RSS 1.0. I had seen the light and bought into the promise of the Semantic Web.

Time passes...

With Dave Winer's floating of the idea of RSS 2.0 discussions ensue about the RDF in RSS 1.0. After spending some time badgering poor Bill Kearney for a concrete benefit of having RDF in RSS 1.0 and not getting a really satisfactory answer I went back and read Paul Ford's essay again. I wanted to get that old religious feeling back again. It didn't work. The magic was gone.

Jump back to a month ago, Mark Pilgrim and I were having a discussion about news aggregators accepting non-well-formed XML. Well-formed is a strictly defined term in the XML specification. It is a series of constraints a file must pass before it can be considered XML. Failing to meet any of the constraints means the file is not an XML file. It is the minimum threshhold for XML and an important measure because it means that the XML file can be loaded into any number of XML tools or libraries and manipulated programatically.

Now don't get me wrong, any text file can be manipulated programmatically, just load the file as a string and do search and replace using using regular expressions. The advantage of XML is that it imposes a structure that you can navigate using XML tools. And if there are many files of the same format, for example RSS files, then it becomes easy to process a great many of these files at one time and extract useful information. Just the kind of processing done by news aggregators. That's the idea of what I call the "Well-Formed Web", instead of a web of ill-formed and difficult to decipher HTML pages, the Well-Formed Web is all those HTML pages backed up with Well-Formed XML documents in well-known formats.

It's the simple power of well-formed XML documents and the ability to easily process them that took the sheen off the Semantic Web for me. Go back and read Paul Ford's essay again, but this time every time you see "RDF" substitute it with "XML" and every time he mentions "Semantic Web" replace it with "Well-Formed Web".

Go read it. I'll wait...

So what would be really needed to make Paul's vision come true? Google already indexes XML documents. So we need an XML format for selling items, for example the following file could be posted to my web site as /forsale.xml:

<forsale>
    <item id="guitar1">
        <minimumBid currency="dollars">300</minimumBid>
        <description>Guitar, Electric. Barely used.</description>
        <image>http://...jpg</image>

	<biddingEnds>2002-09-29T22:49:10-04:00:00</biddingEnds>
    </item>
    ...
    <item id="amp1">
    </item>
</forsale>

And a format for recording bids, with the following format, could be posted on a bidders web-site at http://iwantit.....org/bids.xml:

<bid>
    <bidder>
        <email>joe@bitw..</email>
    </bidder>

    <item>
        <reference>/forsale.xml#guitar1</reference>
        <offered currency="dollars">350</offered>
    </item>
</bid>

Now the bidder could have found the 'forsale' file by searching google, but how is he going to notify the seller that he's posted a bid for it? By using referer logs. That is, the application that creates and posts these XML files (you didn't expect to do this by hand did you?), can also request the 'forsale' file from the sellers site and when it does it fills in the referer information with a URL that points back to the bidders 'bid' file. Now the sellers software can do what Mark Pilgrim's Automatic linkback software does and collect referer log entries every hour and update the 'forsale' document to list the highest bids:

<forsale>
    <item id="guitar1">
        <minimumBid currency="dollars">300</minimumBid>

        <description>Guitar, Electric. Barely used.</description>
        <image>http://...jpg</image>
	<biddingEnds>2002-09-29T22:49:10-04:00:00</biddingEnds>
	<highestBid>http://iwantit.....org/bids.xml</highestBid>

    </item>
    ...
    <item id="amp1">
    </item>
</forsale>

Sure, not everbody has a web-site to post 'bid's or 'forsale's, so web-hosting services can do "micro-hosting". For $5 a year they'll give you one of these selling/bidding apps to run from home and a miniscule amount of file space to posting your files. The bigger and more frequently updated sites get searched by google more frequently and a whole new service category springs to life.

Distributed eBay. Now that's a web service. Very RESTian. No RDF.

This is just one example of the possibilities of the Well-Formed Web. It can be built today with current tools and with no need for RDF, 3-tuples or ontologies.

2002-11-24 23:28 Comments (0)

To Do

My list of things to do on the WellFormedWeb.

File Formats
Document the Archive and Template file formats each in thier own location then change the PURLs that point to them.

Done.

Blosxom
Example in blosxom that returns Archive format.
References
Build a page of reference to external resources, such as the REST Wiki and Monastic XML.

Done.

2002-12-15 00:47 Comments (0)

Download

The latest versions of my RESTLog implementations, called Bulu (server) and Pamphlet (client), can be downloaded:

Bulu Version 0.95
The server is a set of Python scripts.
Pamphlet Version 0.2
The client is a .Net application written in C#. The download contains all the source and a pre-built executable.Has not been tested against the server recently. It is only left here for historical purposes.

The curious can browse and download all the old versions.

2002-12-13 23:39 Comments (0)

The Comment API

Ok, there are a lot of interfaces now circulating TrackBack, Ping-Back, Post-It. All of these are a way of commenting on an item. The only thing missing from the mix is a way to do 'comments' themselves. That is where this specification enters. It is intended to be a roll-up of all the above specifications and to cover comments as well.

As usual I am going to try to re-use as much prior art as possible. In this case I am going to re-use RSS 2.0 and have the payload for this type of message be an 'item' fragment from an RSS 2.0 feed.

If you want to add a comment to a story you just POST an RSS 'item' fragment to the URL specified for comments. How to find that URL is covered later in this document.

POST /news/comments/5 HTTP/1.1
Content-Type: text/xml

<?xml version="1.0" encoding='iso-8859-1'?>
<item>
  <title>Foo Bar</title>
  <author>[email protected]</author>
  <link>http://www.bar.com/</link>
  <description>My Excerpt</description>
</item>

The only response required from the server is the HTTP status code:

HTTP/1.1 200 OK

Note that any appropriate status code can be returned. For example, a 303 may be returned with a Location: header pointing to the URL of the newly created comment.

Further note that the contents of the 'item' element are guided by the RSS 2.0 specification, which states that all elements are optional, but requires that at least one of 'title' or 'description' are present.

Examples

Examples are worth a thousand words, so here are examples of TrackBack, Ping-Back and Post-It items all re-factored into CommentAPI format.

Post-It

Here is an example of a Post-It. Note that line breaks have been inserted for readability.

POST http://www.foo.com/mt-tb.cgi/5
Content-Type: application/x-www-form-urlencoded

comment=My+Excert
[email protected]
&name=Foo+Bar
&url=http://www.bar.com/
&agent=send-cb.pl+(Version+0.1)
In the Comment API this becomes:
POST /news/comments/5 HTTP/1.1
Content-Type: text/xml

<item>
  <title>Foo Bar</title>
  <author>[email protected]</author>
  <link>http://www.bar.com/</link>
  <description>My Excerpt</description>
</item>

TrackBack

Here is an example of TrackBack. Note that line breaks have been inserted for readability.

POST http://www.foo.com/mt-tb.cgi/5
Content-Type: application/x-www-form-urlencoded

title=Foo+Bar
&url=http://www.bar.com/
&excerpt=My+Excerpt
&blog_name=Foo
In the Comment API this becomes:
POST /news/comments/5 HTTP/1.1
Content-Type: text/xml

<item>
  <title>Foo Bar</title>
  <link>http://www.bar.com/</link>
  <description>My Excerpt</description>
  <source>Foo</source>
</item>

Ping-Back

Here is an example of Ping-Back.

POST /news HTTP/1.1
Content-Type: text/xml


<methodCall>
  <methodName>pingback.ping</methodName>
  <params>
    <param>
      <value>http://www.bar.com/</value>
      <name>sourceURI</value>
    </param>
    <param>
      <value>http://www.somebar.com/news</value>
      <name>targetURI</value>
    </param>
  </params>
</methodCall>

In the Comment API becomes

POST /news/comments/5 HTTP/1.1
Content-Type: text/xml

<item>
  <link>http://www.bar.com/</link>
</item>

In this case the targetURI doesn't appear since it is implicitly embedded in the URI given for the Comment to be posted to.

Summary Table

Here is a summary of all the above interfaces and how they map to RSS 'item' child elements. Also included in the last column is the elements that are used when posting a comment.

Table 1
How other interfaces map to the CommentAPI
CommentAPI Element Ping-back Track-back Post-It A Comment
title title name name title
link sourceURI url url Link to home page of the comment author.
description excerpt comment The text of the comment.
author email email
source blog_name
dc:creator name

When looking at the above table it is important to keep in mind the context of the data. That is, remember that this is what the data looks like when it arrives at the server. This is no way constrains the CommentAPI server as to how to interpret the data. For example, if you produce an RSS feed of the last 20 comments on your site, then there should be no expectation that the 'item's in that RSS feed be exactly formatted as they arrived. That is because they appear in a different context.

Notes

  1. XML-RPC makes it akward to refer to stuff, what this entry means is that the 'value' of the 'sourceURI' element pair is what is submitted in the 'link' element.

Auto-Discovery

Two mechanims are available for discovering the URI that is the target of the POST. The first is a way to put that information in HTML, the second is a way to embed that information in an RSS feed.

HTML

Lot's of options here but the <link> element has been so successful in finding RSS feeds that I'm going to use it here for discovering the Comment interface URI in HTML pages. In this case the form is:

<link rel="service.comment" type="text/xml" href="url goes here" title="Comment Interface">

Where href should be set to the URL that understands the CommentAPI. Applications looking for a comment URI need to parse out the headers of the web page and look for a link tag that has a relation rel of "comment" and a mime-type of "text/xml".

N.B. The URI given is not the URI of the web page that will present an HTML form for posting but instead it is the URI that will take POSTs of RSS 2.0 'item' fragments and will interpret them as comments.

RSS

A new item level element named 'comment' in the namespace /CommentAPI/ is used to provide the location of the CommentAPI endpoint to aggregator software. This is providing the same information as the link tag does in HTML. Here is an example:

<wfw:comment xmlns:wfw="/CommentAPI/">
  /news/comments/52
</wfw:comment>

N.B. Just like was stated for the Link tag, the URI given is not the URI of the web page that will present an HTML form for posting but instead it is the URI that will take a POST of an RSS 2.0 'item' fragment and will interpret it as a comment.

Revision History

13-March-2003
Added anchors for each section. Also updated the RSS Auto-find to include suggestions from this discussion.
20-March-2003
Changed the namespace URI for wfw to be more specific.

2003-01-17 14:14 Comments (0)

RESTLog File API

The RESTLog File API is an HTTP interface for uploading and managing files, whether they are photographs, icons, Word Documents, HTML pages, etc.

N.B. This interface was originally called the RESTLog Image API but as was pointed out on [ucapi-discuss] this could be extended by accepting more media types, without coming close to the capabilities of WebDAV. Providing a limited interface makes the RESTLog File API operate like a WebDAV-lite.

Table 1
The RESTLog File Interface
URL Verb Type Description Format
/.../albumID GET html An HTML page for all the documents in this album. Includes thumbnails.
xml A list of al the documents and sub-albums. Notes.Archive
POST xml Create a new sub-album.
<album>
  <name>albumID</name>
</album>
DELETE - Delete the album.
/.../docID GET varies The documentWhat ever is stored there, PNG, GIF, Word Document, PDF, etc.
PUT varies Creates or overwrite a document.What ever is used, PNG, GIF, etc.
DELETE - Delete the document.

Note 1: Note that this specification is recursive. That is it defines the operations available on an album, which include the creation of sub-albums. The sub-albums must then support this API.

Note 2: I have intentionally used the term 'album' instead of directory throughout this specification to avoid confusing implementation and specification. When implementing the specification the albums could be implemented as sub-directories. On the otherhand an implementation could store all the files in an SQL database.

Note 3: In the Archive format documents appear in 'res' elements while sub-albums appear as 'more' elements.

Examples to go here.

2003-01-07 22:30 Comments (0)