The Annotator

	Update: The CritSuite Toolset Project has been completed. This page is now part of an archive of CritSuite web pages. The domain http://crit.org no longer belongs to this project or to Foresight Institute. For current information on CritSuite, please see the site maintained by the author of the software, Ka-Ping Yee: http://zesty.ca/crit
	Enhancing the World Wide Web: Social Software for the Evolution of Knowledge

The Annotator Project

Terry Stanley

(based upon an annotation system developed by Wayne Gramlich)

Update: The Web Enhancement group has since produced Crit, a web-based service enabling annotation and backlinking on any public web page. Try it out and obtain the software at http://crit.org/


			The Criteria
			The Approach: Work with the Web
			Proxies and Firewalls
			Scaling Issues
			Further scaling issues:

The Criteria

Early in the Web Enhancement Project, a set of link properties was defined. These properties are considered to be either required or highly desirable in building a hypertext backlinks system. Evaluating candidate systems to implement a hypertext publishing system designed to aid the evolution of knowledge therefore included answering the question, "How many of the link properties does this system support?"

A system designed and implemented by Wayne Gramlich appears to meet the criteria. This backlinks system (referred to here as the "annotator") will be used in the Computer Security Debate.

The Approach: Work with the Web

The annotator was designed to fit in with the existing Web architecture. There are near-term and long-term advantages to using the standard protocols and browsers. The annotator is immediately accessible by anyone having Web access and a standard Web browser. Over the long term, as protocols improve and the browser wars continue, our backlinks extension benefits from the development efforts of others.

The annotator hooks in to the Web through the proxy interface. The configuration shown below is common and well supported. Proxies exist to funnel all requests through a firewall; incoming responses are restricted to a single host: "the proxy".

Relation of proxy server to
workstations and the Internet

Proxies have the useful property of chaining: a network can include a single proxy as well as a series of proxies. This property led to the following design.

Relation of annotator to
workstation and proxy server

The proxy interface provides the hooks needed to add backlinks to the Web (see Proxies and Firewalls). The annotator, as a member of the proxy chain, can listen to outgoing requests and intercept incoming responses. This simple program (roughly, 20 pages of C code) does little more than what is required to annotate documents. Rather than implement the full proxy interface, it runs with a proxy server. For example, on my laptop, I run the annotator with the Apache Web server configured as a proxy server. Together they make an "annotating proxy".

When the annotator receives a browser request, it simply passes this along to the proxy server. The document is fetched and returned to the annotator. At this point, before the annotator returns the document to the browser, annotations are added.

The annotator, having received a document, searches available annotation sets for links to this document. Annotations are specified with:

the URL of the document being annotated
a text pattern
the text written by the author of the annotation

A backlink is added for each annotation found. If a pattern is given, the document is searched for matching text (a single word, a phrase, a paragraph) and if found, the backlink is attached here. If no pattern is given or if matching text is not found, the backlink is inserted at the beginning of the document.

The annotator generates the HTML for the backlinks using standard HTML tags, interleaves the new HTML with the original, and returns the document to the browser for viewing. (The original document is not modified; the annotations appear only in the presentation.)

Proxies and Firewalls

The prevalence of proxy-based firewalls led to browser support for proxy redirection. By simply setting a browser option, the user enables an efficient, transparent mechanism for the redirection of requests.

During the alpha phase, use of the browser option is limited. This option cannot be used if a firewall is between the browser and the annotating proxy.

Fortunately, there is an alternative: pseudo proxies. This is a hack, but it is used in implementing a number of useful extensions, in particular, anonymizers use it to shield the identity of users browsing the Web.

Instead of transparent redirection, the pseudo proxy's URL is prepended to the URL being requested. This pseudo proxy sees all requests and intercepts all responses. To make the initial "connection" the user manually edits the requested URL, but only for the first request. The pseudo proxy fetches the document, but before returning it, it scans the document for links and rewrites the URLs to point at itself. If the user follows a link from this document, the request is redirected to the pseudo proxy.

If the browser option limitation during alpha is unacceptable, the annotating proxy can easily be modified to implement the pseudo-proxy technique.

A post-alpha option is to port the Linux-based annotating proxy to desktop machines. This "Personal Annotator" could run on the Windows, Macintosh, or Java platforms.

Scaling Issues

The basic summary for how to make annotation sets scale is `spread it out!' The annotation system above does not have good scaling properties because it is too centralized. However, an annotation system in the future would be much more decentralized and do all of the annotation either directly in the browser or in a process running on or near the user's machine. Let's talk about how this future annotation system would scale.

For each annotation set, sort all of the URL's for annotated documents. Further reduce this list to just the names of the annotated hosts. When the annotator starts up, it fetches this host name list and stores it locally. For everything except the most gargantuan/humungous/enormous annotation sets, this is a fairly modest amount of data to fetch. The host name list can be cached in the file system somewhere so that it does not have to be refetched each time the user reboots his machine. This step is payed only at annotator start-up time.
The annotator merges all of the sorted host name lists for all of its annotation sets into a single sorted list.
When the annotator is presented with a URL to fetch, it performs a binary search of the merged host name list to figure out if there are any annotation sets that may pertain.
If there is a match after the binary search, the annotator now knows which annotations sets may have annotations for the requested document.
The annotator goes to each annotation set and fetches the list of the documents for the host that have annotations. Except for truly large sites that serve up millions of documents, this will be down-loaded fairly quickly.
Now the annotator knows whether or not the document that has been requested has any annotations. If so, it goes off and fetches the specific annotations, merges them in and returns the modified document to the web-browser.

Please note that in this process, that there is an initial pause when the first document is fetched from a given host (to down load the annotated document list.) As the user bounces around the web site, he gets quick response time since the annotated document list has already been downloaded.

Further scaling issues:

Really popular annotation sets will get a lot of hits (just like popular web sites). How do we deal with this? There are two answers -- geographic distribution and load balancing:

Geographic distribution:: Geographic distribution is simple--just put up a mirrored annotation set server at multiple geographic sites. Example, one in Europe, a couple in the US, one in Japan and one in Australia. When an annotator first visits an annotation set server, it asks 'where are your geographic mirroring sites?' It returns the latitude and longitude for each server. The annotator can compare its latitude and longitude with the various geographic sites and find the geographically closest one. The network routing people do not like this answer, since sometimes geographic proximity does not mean network proximity. Tough! Right now this is how people do geographic distribution; they ask users to click on the web page that is closest to them.
Load balancing:: It may still be the case that a given server in a given geographic location is getting pounded into oblivion. The solution is to design annotation sets so that they can load balance. What you do is mirror the annotation set across N servers. The annotator goes to the geographic site and asks 'how many mirrored sites do you have and what are there names?' Then each annotator takes its Internet address, computes a hash, and takes the remainder of dividing it by N and talks to that annotation set server.

The final component to the scaling issue is 'what about gigantic annotation sets?' An example of a gigantic annotation set is one that attempts to keep track of all back links. This is an annotation set that would basically span the entire web. First, you need lots and lots and lots of hardware. This is essentially what Digital's Alta Vista is trying to do. Second, the strategy of downloading the host name list is basically a waste of time; the solution is to not do it. Instead, you go to the annotation set server each time you visit a host and fetch the annotated documents list. Again, for sites that have huge document sets, down-loading the list of documents that are annotated can be a waste of time. Again, the solution is to not download, but instead get the annotation set server each time you fetch a document. There is no magic here.

Terry Stanley is the lead programmer on the Annotator project and can be reached at tstanley@best.com.

Foresight Programs

Home About Foresight Blog News & Events Roadmap About Nanotechnology Resources Facebook Contact Privacy Policy

Web site developed by Stephan Spencer and Netconcepts; maintained by James B. Lewis Enterprises.