Significant characteristics of websites
Main Page > Documentation > Format policies > Significant characteristics > Significant characteristics of websites
As there is no formal default policy for websites in Archivematica, below is a summary of research done for clients that may inform future policy generation.
Overview
The goal of website archiving is to capture, preserve and render complete websites. An end user should be able to navigate the preserved website in the same way that the original website was navigated, and as much as possible should see the same content and experience the same functionality. Website preservation involves a number of steps, each of them requiring their own tools and procedures: capturing ("crawling" or "harvesting") a website, storing it in an archival format, applying preservation planning over time, rendering it, indexing it and providing keyword search capabilities for all of the archived content.
A number of institutions have undertaken website archiving on a large scale. Probably the best-known of these is the Internet Archive, founded by Brewster Kahle in 1996. The Internet Archive gathers and makes available a vast number of websites at no charge; it also offers a third party web archiving service called Archive-It, available for a fee. The Library of Congress has been preserving websites since 2000, acquiring government and private websites based on selected themes, events and subject areas. California Digital Library collects a wide variety of websites and makes them available on-line; like the Internet Archive, it offers a third party web archiving service for a fee. Numerous national archives and libraries also have web archiving projects, including the British Library (via the UK Web Archive), Library and Archives Canada, the National Library of New Zealand and the National Archives of Australia. (Lists of major website archiving initiatives are maintained at http://netpreserve.org/about/archiveList.php and http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives.) International efforts to develop web archiving tools, standards and practices are managed by the International Internet Preservation Consortium (IIPC), established in 2003 by the Library of Congress, the Internet Archive and the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden and the UK. (A complete list of current members is at http://netpreserve.org/about/memberList.php)
The tools and processes these large institutions use are scalable downward to smaller institutions. For example, the most popular website crawler, Heritrix, was developed by the Internet Archive and has been released as an open-source tool which can be used by any organization, large or small. Similarly, the WARC archiving format, described in detail below, was developed by a consortium of large institutions, has been approved by the International Standards Organization, and can now be used by smaller institutions as their preservation format for harvested websites. A practical and pragmatic approach should be to follow in the footsteps of large institutions which have invested a great deal of time and resources into developing web archiving standards, procedures and tools, since any problems experienced by smaller institutions, now or in the future, will be the same problems being tackled by these large institutions.