Life of a Computer Scientist: Super-Sized Del.icio.us, or Down-Sized Wayback Machine

The problem with URL bookmarks is that they can always become dead links. Here are some of the common reasons:

It takes people and resources to keep web servers running, and the company or institution that maintains the web site just went away, bringing their web site with them.
The page disappears because the webmaster no longer deems the page worthy of keeping. It could happen when a web site is revamped, so some pages just disappear while others have their URL changed to fit the new site structure.
Pages can be moved around as part of a minor reorganization effort.
Content agreement mandates that the page is only accessible for a certain period of time.

Furthermore, even if the page still exists, it may have been edited. The information that you wanted to be bookmarked may no longer exist. The Internet Archive Wayback Machine solves part of this problem by crawling a given website constantly at different dates, and by allowing you to retrieve a version at an earlier date. However, it has its own problems:

It only does best-effort crawling. In particular, it may take up to 6 months for the URL you submitted to show up.
It is a web crawler, so it voluntarily observes robots.txt.
The web page it archives is often missing images and other constituent material.
It has to rewrite hyperlinks in the web page.
It won't work with Javascript and XMLHttpRequest (AJAX).

Furthermore, the Wayback Machine often crawls redundantly and unpredictably. It may attempt to crawl other parts of the website that you're not interested in, omitting the part you actually want. It may crawl the URL when you don't need it, and not crawl when you do.
Here I propose a service that takes a complete snapshot of a web page in the manner of a bookmark. This will be an immensely helpful research tool for historians and other people who need to keep a small piece of artifect for a web page they're interested in.
Web browsers have had the ability to cache web pages on disk, so static content doesn't have to be redownloaded everytime the web page needs to be rendered for display. The idea is to build a more sophisticated caching feature on top of a web browser.

All HTTP connections are memoized by the snapshot, so Javascript and even some Flash content will work.
Support multiple snapshots for the same URL at different time and be able to display them side by side for comparison. This is not something a proxy server can do.
Snapshots are searchable.
Schedule regular snapshots or prefetch web page.

Possible advanced features:

Store snapshot on external server. Pro: can consolidate snapshot of the same webpage taken by two different people, if they are the same. Pro: can allow public access to snapshots. Pro: high availability because customer doesn't have to manage their own storage. Pro: high mobility because snapshots live on "the cloud." Pro: centralized search and indexing. Con: will cost money to run storage server infrastructure, so service will not be free. Con: not likely supported by ads revenue.
Allow public access to the snapshots. Pro: any website can link to a snapshot to avoid the external dead link problem. Con: will run into copyright issue when the original web site owner determines he is losing profit because of the snapshot.

Technologies that can be leveraged:

QT for cross-platform user interface. Doesn't require commercial license; LGPL is fine.
WebKit (available in LGPL) or Presto (commercial proprietary license) layout engine for rendering web pages.

Life of a Computer Scientist

Wednesday, June 17, 2009

Super-Sized Del.icio.us, or Down-Sized Wayback Machine

No comments: