From Archivematica
Jump to: navigation, search

Main Page > Documentation > Format policies > Websites

[edit] Significant characteristics

See overview discussion at Significant characteristics of websites

[edit] Preservation format


[edit] Access format


[edit] Format registry information


[edit] Capture of WARC files

An open-source tool called Heritrix, developed by the Internet Archive, can be used to crawl selected websites and store the harvested content as WARC files. Repeated crawls of the same website are each saved as a WARC file with a timestamp in the filename to distinguish it from other crawls.

Heritrix users configure, run and monitor jobs from a web-based administrative console (see illustration below):


[edit] Rendering of WARC files

As a compressed, encapsulated object, a WARC file cannot be rendered in the same way as a live website. A tool called the Wayback Machine, an open-source java implementation of the Internet Archive Wayback Machine, can be used to render the WARC files so that a website appears with its appearance and functionality intact. This screenshot (from a virtual machine version of the Wayback Machine packaged by Artefactual) shows the Wayback Machine's standard search screen, which allows the user to select crawls by year:


This is an example of a web page rendered using the Wayback Machine; note the collapsible header indicating the crawl date, which can be used to navigate multiple harvests of the same website:


[edit] Indexing and searching WARC files

Websites captured as WARC files and rendered using the Wayback Machine lose their search functionality, which in the live website environment is typically handled by tools outside of the capture scope. For example, a test crawl conducted by Artefactual did not not render with a functioning search box because the host organization's search engine is external to its website. Indexing and searching of WARC files is accomplished by the addition and integration of external tools. The most commonly used tool is called NutchWax, Nutch (W)eb (A)rchive e(X)tensions. Nutch is an open-source java tool which uses Lucene as its indexing and search component; the web archive extensions adapt its functionality to archived websites.

Other tools which could be adapted to index and search WARC files include solr and elasticsearch, either of which could be integrated with the Wayback Machine. Solr or elasticsearch may prove more versatile than NutchWax, which is designed to be implemented in large multi-server environments.

[edit] More information

Personal tools