- It takes people and resources to keep web servers running, and the company or institution that maintains the web site just went away, bringing their web site with them.
- The page disappears because the webmaster no longer deems the page worthy of keeping. It could happen when a web site is revamped, so some pages just disappear while others have their URL changed to fit the new site structure.
- Pages can be moved around as part of a minor reorganization effort.
- Content agreement mandates that the page is only accessible for a certain period of time.
- It only does best-effort crawling. In particular, it may take up to 6 months for the URL you submitted to show up.
- It is a web crawler, so it voluntarily observes robots.txt.
- The web page it archives is often missing images and other constituent material.
- It has to rewrite hyperlinks in the web page.
- It won't work with Javascript and XMLHttpRequest (AJAX).
Here I propose a service that takes a complete snapshot of a web page in the manner of a bookmark. This will be an immensely helpful research tool for historians and other people who need to keep a small piece of artifect for a web page they're interested in.
Web browsers have had the ability to cache web pages on disk, so static content doesn't have to be redownloaded everytime the web page needs to be rendered for display. The idea is to build a more sophisticated caching feature on top of a web browser.
- All HTTP connections are memoized by the snapshot, so Javascript and even some Flash content will work.
- Support multiple snapshots for the same URL at different time and be able to display them side by side for comparison. This is not something a proxy server can do.
- Snapshots are searchable.
- Schedule regular snapshots or prefetch web page.
- Store snapshot on external server. Pro: can consolidate snapshot of the same webpage taken by two different people, if they are the same. Pro: can allow public access to snapshots. Pro: high availability because customer doesn't have to manage their own storage. Pro: high mobility because snapshots live on "the cloud." Pro: centralized search and indexing. Con: will cost money to run storage server infrastructure, so service will not be free. Con: not likely supported by ads revenue.
- Allow public access to the snapshots. Pro: any website can link to a snapshot to avoid the external dead link problem. Con: will run into copyright issue when the original web site owner determines he is losing profit because of the snapshot.
No comments:
Post a Comment