Significant characteristics of websites
As there is no formal default policy for websites in Archivematica, below is a summary of research done for clients that may inform future policy generation.
The goal of website archiving is to capture, preserve and render complete websites. An end user should be able to navigate the preserved website in the same way that the original website was navigated, and as much as possible should see the same content and experience the same functionality. Website preservation involves a number of steps, each of them requiring their own tools and procedures: capturing ("crawling" or "harvesting") a website, storing it in an archival format, applying preservation planning over time, rendering it, indexing it and providing keyword search capabilities for all of the archived content.
A number of institutions have undertaken website archiving on a large scale. Probably the best-known of these is the Internet Archive, founded by Brewster Kahle in 1996. The Internet Archive gathers and makes available a vast number of websites at no charge; it also offers a third party web archiving service called Archive-It, available for a fee. The Library of Congress has been preserving websites since 2000, acquiring government and private websites based on selected themes, events and subject areas. California Digital Library collects a wide variety of websites and makes them available on-line; like the Internet Archive, it offers a third party web archiving service for a fee. Numerous national archives and libraries also have web archiving projects, including the British Library (via the UK Web Archive), Library and Archives Canada, the National Library of New Zealand and the National Archives of Australia. (Lists of major website archiving initiatives are maintained at http://netpreserve.org/about/archiveList.php and http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives.) International efforts to develop web archiving tools, standards and practices are managed by the International Internet Preservation Consortium (IIPC), established in 2003 by the Library of Congress, the Internet Archive and the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden and the UK. (A complete list of current members is at http://netpreserve.org/about/memberList.php)
The tools and processes these large institutions use are scalable downward to smaller institutions. For example, the most popular website crawler, Heritrix, was developed by the Internet Archive and has been released as an open-source tool which can be used by any organization, large or small. Similarly, the WARC format, described in detail below, was developed by a consortium of large institutions, has been approved by the International Standards Organization, and can now be used by smaller institutions as their preservation format for harvested websites. A practical and pragmatic approach should be to follow in the footsteps of large institutions which have invested a great deal of time and resources into developing web archiving standards, procedures and tools, since any problems experienced by smaller institutions, now or in the future, will be the same problems being tackled by these large institutions.
WARC for website preservation
WARC (Web ARChive file format) is an extension of the ARC (Internet Archive ARC_IA) format which was developed by the Internet Archive in the mid-1990s. The ARC format stores simple content block sequences (representing objects such as html files and images) with additional text headers in a self-contained file. The ARC format captures responses to http requests only; WARC extends this by capturing content types such as assigned metadata and duplicate detection events (to reduce storage of identical resources). WARC also provides expanded metadata support in the text header. WARC was accepted as an international standard in 2009 (ISO 28500:2009) and is the Library of Congress' preferred format for harvested Web sites.1 Although ARC is still widely used, it is being replaced by WARC at a number of leading institutions, including the Internet Archive.
A WARC file stores multiple archived resources in a single file in order to avoid managing a large number of digital objects in numerous directories. A WARC file consists of any number of WARC records, which are single blocks of content, each with its own text header. These text headers can be extracted from the WARC file and stored separately for efficient indexing.
In this diagram, the WARC file represented on the left contains an unlimited number of WARC records: the WARC record contains the content, in this case a jpeg image, and the text header. The sample text header shows the type of resource, the URI of the image, the date of capture, a unique identifier for the WARC record, the content mime-type, checksums for the WARC record and jpeg image, and the content length in number of octets (bytes). (Sample text header content taken from WARC file format version 0.18, 2008-06-06. ISO/DIS 28500, bibnum.bnf.fr/WARC/WARC_ISO_DIS_28500.pdf.)
A WARC file can contain many types of content which may require migrating to preservation-friendly formats in the future. A Heritrix crawl result includes a report called mimetype-report.txt that provides a summary list of formats in the crawl. These summary reports can be used to monitor ingested formats at a high level in order to help inform decisions on at-risk formats and preservation planning. Log files included in the crawl results provide the URIs for each captured file. A preservation workflow for WARC files could consist of extracting content blocks (i.e. files such as images, pdf files etc.) from the WARC file, normalizing selected content based on a format risk assessment, and generating a new WARC file containing normalized content. (See Migrating content in WARC files , Stephan Strodl, Peter Paul Beran and Andreas Rauber, The 9th International Web Archiving Workshop proceedings, 2009, available at http://iwaw.europarchive.org/09/index.html.)
However, if crawled web content consists only of standard, widely-used formats, preservation actions need not be taken immediately. In fact, taking a wait and see approach to preservation planning for WARC files would be highly practical for a small- to medium-sized institution, since a large number of well-established institutions (such as Library of Congress, the Internet Archive and California Digital Library) use WARC and will likely start to put resources into developing preservation strategies and tools when standard web formats become at risk of format obsolescence.