Update: The CritSuite Toolset Project has been completed. This page is now part of an archive of CritSuite web pages. The domain http://crit.org no longer belongs to this project or to Foresight Institute. For current information on CritSuite, please see the site maintained by the author of the software, Ka-Ping Yee: http://zesty.ca/crit
(based upon an annotation system
developed by Wayne Gramlich)
Update: The Web Enhancement group has since produced Crit, a web-based service enabling annotation and backlinking on any public web page. Try it out and obtain the software at http://crit.org/
A system designed and implemented by Wayne Gramlich
appears to meet the criteria. This backlinks system (referred to
here as the "annotator") will be used in the Computer Security Debate.
The
Approach: Work with the Web
The annotator
was designed to fit in with the existing Web architecture.
There are near-term and long-term advantages to using the
standard protocols and browsers. The annotator is immediately
accessible by anyone having Web access and a standard Web
browser. Over the long term, as protocols improve and the browser
wars continue, our backlinks extension benefits from the
development efforts of others.
The annotator hooks in to the Web through the proxy interface.
The configuration shown below is common and well supported.
Proxies exist to funnel all requests through a firewall; incoming
responses are restricted to a single host: "the proxy".
Proxies have the useful property of chaining: a network can
include a single proxy as well as a series of proxies. This
property led to the following design.
The proxy interface provides the hooks needed to add backlinks
to the Web (see Proxies and Firewalls).
The annotator, as a member of the proxy chain, can listen to
outgoing requests and intercept incoming responses. This simple
program (roughly, 20 pages of C code) does little more than what
is required to annotate documents. Rather than implement the full
proxy interface, it runs with a proxy server. For example, on my
laptop, I run the annotator with the Apache Web server configured
as a proxy server. Together they make an "annotating
proxy".
When the annotator receives a browser request, it simply passes
this along to the proxy server. The document is fetched and
returned to the annotator. At this point, before the annotator
returns the document to the browser, annotations are added.
The annotator, having received a document, searches available
annotation sets for links to this document. Annotations are
specified with:
the URL of the document being annotated
a text pattern
the text written by the author of the annotation
A backlink is added for each annotation found. If a pattern is
given, the document is searched for matching text (a single word,
a phrase, a paragraph) and if found, the backlink is attached
here. If no pattern is given or if matching text is not found,
the backlink is inserted at the beginning of the document.
The annotator generates the HTML for the backlinks using standard
HTML tags, interleaves the new HTML with the original, and
returns the document to the browser for viewing. (The original
document is not modified; the annotations appear only in the
presentation.)
Proxies and Firewalls
The prevalence of proxy-based firewalls led to browser support
for proxy redirection. By simply setting a browser option, the
user enables an efficient, transparent mechanism for the
redirection of requests.
During the alpha phase, use of the browser option is limited.
This option cannot be used if a firewall is between the browser
and the annotating proxy.
Fortunately, there is an alternative: pseudo proxies. This is a
hack, but it is used in implementing a number of useful
extensions, in particular, anonymizers use it to shield the
identity of users browsing the Web.
Instead of transparent redirection, the pseudo proxy's URL is
prepended to the URL being requested. This pseudo proxy sees all
requests and intercepts all responses. To make the initial
"connection" the user manually edits the requested URL,
but only for the first request. The pseudo proxy fetches the
document, but before returning it, it scans the document for
links and rewrites the URLs to point at itself. If the user
follows a link from this document, the request is redirected to
the pseudo proxy.
If the browser option limitation during alpha is unacceptable,
the annotating proxy can easily be modified to implement the
pseudo-proxy technique.
A post-alpha option is to port the Linux-based annotating proxy
to desktop machines. This "Personal Annotator" could
run on the Windows, Macintosh, or Java platforms.
Scaling Issues
The basic summary for how to make annotation sets scale is
`spread it out!' The annotation system above does not have good
scaling properties because it is too centralized. However, an
annotation system in the future would be much more decentralized
and do all of the annotation either directly in the browser or in
a process running on or near the user's machine. Let's talk about
how this future annotation system would scale.
For each annotation set, sort all of the URL's for
annotated documents. Further reduce this list to just the
names of the annotated hosts. When the annotator starts
up, it fetches this host name list and stores it locally.
For everything except the most
gargantuan/humungous/enormous annotation sets, this is a
fairly modest amount of data to fetch. The host name list
can be cached in the file system somewhere so that it
does not have to be refetched each time the user reboots
his machine. This step is payed only at annotator
start-up time.
The annotator merges all of the sorted host name lists
for all of its annotation sets into a single sorted list.
When the annotator is presented with a URL to fetch, it
performs a binary search of the merged host name list to
figure out if there are any annotation sets that may
pertain.
If there is a match after the binary search, the
annotator now knows which annotations sets may
have annotations for the requested document.
The annotator goes to each annotation set and fetches the
list of the documents for the host that have annotations.
Except for truly large sites that serve up millions of
documents, this will be down-loaded fairly quickly.
Now the annotator knows whether or not the document that
has been requested has any annotations. If so, it goes
off and fetches the specific annotations, merges them in
and returns the modified document to the web-browser.
Please note that in this process, that there is an initial
pause when the first document is fetched from a given host (to
down load the annotated document list.) As the user bounces
around the web site, he gets quick response time since the
annotated document list has already been downloaded.
Further scaling issues:
Really popular annotation sets will get a lot of hits (just
like popular web sites). How do we deal with this? There are two
answers -- geographic distribution and load balancing:
Geographic distribution:
Geographic distribution is simple--just put up a mirrored
annotation set server at multiple geographic sites.
Example, one in Europe, a couple in the US, one in Japan
and one in Australia. When an annotator first visits an
annotation set server, it asks 'where are your geographic
mirroring sites?' It returns the latitude and longitude
for each server. The annotator can compare its latitude
and longitude with the various geographic sites and find
the geographically closest one. The network routing
people do not like this answer, since sometimes
geographic proximity does not mean network proximity.
Tough! Right now this is how people do geographic
distribution; they ask users to click on the web page
that is closest to them.
Load balancing:
It may still be the case that a given server in a given
geographic location is getting pounded into oblivion. The
solution is to design annotation sets so that they can
load balance. What you do is mirror the annotation set
across N servers. The annotator goes to the geographic
site and asks 'how many mirrored sites do you have and
what are there names?' Then each annotator takes its
Internet address, computes a hash, and takes the
remainder of dividing it by N and talks to that
annotation set server.
The final component to the scaling issue is 'what about
gigantic annotation sets?' An example of a gigantic annotation
set is one that attempts to keep track of all back links. This is
an annotation set that would basically span the entire web.
First, you need lots and lots and lots of hardware. This is
essentially what Digital's Alta Vista is trying to do. Second,
the strategy of downloading the host name list is basically a
waste of time; the solution is to not do it. Instead, you go to
the annotation set server each time you visit a host and fetch
the annotated documents list. Again, for sites that have huge
document sets, down-loading the list of documents that are
annotated can be a waste of time. Again, the solution is to not
download, but instead get the annotation set server each time you
fetch a document. There is no magic here.
Terry Stanley is the lead programmer on the Annotator project
and can be reached attstanley@best.com.