Difference between revisions of "Significant characteristics of websites"
Line 10: | Line 10: | ||
The tools and processes these large institutions use are scalable downward to smaller institutions. For example, the most popular website crawler, [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix], was developed by the Internet Archive and has been released as an open-source tool which can be used by any organization, large or small. Similarly, the [http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml WARC archiving format], described in detail below, was developed by a consortium of large institutions, has been approved by the International Standards Organization, and can now be used by smaller institutions as their preservation format for harvested websites. A practical and pragmatic approach should be to follow in the footsteps of large institutions which have invested a great deal of time and resources into developing web archiving standards, procedures and tools, since any problems experienced by smaller institutions, now or in the future, will be the same problems being tackled by these large institutions. | The tools and processes these large institutions use are scalable downward to smaller institutions. For example, the most popular website crawler, [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix], was developed by the Internet Archive and has been released as an open-source tool which can be used by any organization, large or small. Similarly, the [http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml WARC archiving format], described in detail below, was developed by a consortium of large institutions, has been approved by the International Standards Organization, and can now be used by smaller institutions as their preservation format for harvested websites. A practical and pragmatic approach should be to follow in the footsteps of large institutions which have invested a great deal of time and resources into developing web archiving standards, procedures and tools, since any problems experienced by smaller institutions, now or in the future, will be the same problems being tackled by these large institutions. | ||
+ | |||
+ | ===WARC for website preservation=== | ||
+ | |||
+ | WARC (Web ARChive file format) is an extension of the ARC (Internet Archive ARC_IA) format which was developed by the Internet Archive in the mid-1990s. The ARC format stores simple content block sequences (representing objects such as html files and images) with additional text headers in a self-contained file. The ARC format captures responses to http requests only; WARC extends this by capturing content types such as assigned metadata and duplicate detection events (to reduce storage of identical resources). WARC also provides expanded metadata support in the text header. WARC was accepted as an international standard in 2009 (ISO 28500:2009) and is the Library of Congress' preferred format for harvested Web sites.1 Although ARC is still widely used, it is being replaced by WARC at a number of leading institutions, including the Internet Archive. | ||
+ | |||
+ | A WARC file stores multiple archived resources in a single file in order to avoid managing a large number of digital objects in numerous directories. A WARC file consists of any number of WARC records, which are single blocks of content, each with its own text header. These text headers can be extracted from the WARC file and stored separately for efficient indexing. | ||
+ | |||
+ | In this diagram, the WARC file represented on the left contains an unlimited number of WARC records: the WARC record contains the content, in this case a jpeg image, and the text header. The sample text header shows the type of resource, the URI of the image, the date of capture, a unique identifier for the WARC record, the content mime-type, checksums for the WARC record and jpeg image, and the content length in number of octets (bytes). (Sample text header content taken from WARC file format version 0.18, 2008-06-06. ISO/DIS 28500, bibnum.bnf.fr/WARC/WARC_ISO_DIS_28500.pdf.) | ||
+ | |||
+ | [[File:WARCdiagram.png|600px|thumb|center|]] |
Revision as of 17:38, 13 February 2013
Main Page > Documentation > Format policies > Significant characteristics > Significant characteristics of websites
As there is no formal default policy for websites in Archivematica, below is a summary of research done for clients that may inform future policy generation.
Overview
The goal of website archiving is to capture, preserve and render complete websites. An end user should be able to navigate the preserved website in the same way that the original website was navigated, and as much as possible should see the same content and experience the same functionality. Website preservation involves a number of steps, each of them requiring their own tools and procedures: capturing ("crawling" or "harvesting") a website, storing it in an archival format, applying preservation planning over time, rendering it, indexing it and providing keyword search capabilities for all of the archived content.
A number of institutions have undertaken website archiving on a large scale. Probably the best-known of these is the Internet Archive, founded by Brewster Kahle in 1996. The Internet Archive gathers and makes available a vast number of websites at no charge; it also offers a third party web archiving service called Archive-It, available for a fee. The Library of Congress has been preserving websites since 2000, acquiring government and private websites based on selected themes, events and subject areas. California Digital Library collects a wide variety of websites and makes them available on-line; like the Internet Archive, it offers a third party web archiving service for a fee. Numerous national archives and libraries also have web archiving projects, including the British Library (via the UK Web Archive), Library and Archives Canada, the National Library of New Zealand and the National Archives of Australia. (Lists of major website archiving initiatives are maintained at http://netpreserve.org/about/archiveList.php and http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives.) International efforts to develop web archiving tools, standards and practices are managed by the International Internet Preservation Consortium (IIPC), established in 2003 by the Library of Congress, the Internet Archive and the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden and the UK. (A complete list of current members is at http://netpreserve.org/about/memberList.php)
The tools and processes these large institutions use are scalable downward to smaller institutions. For example, the most popular website crawler, Heritrix, was developed by the Internet Archive and has been released as an open-source tool which can be used by any organization, large or small. Similarly, the WARC archiving format, described in detail below, was developed by a consortium of large institutions, has been approved by the International Standards Organization, and can now be used by smaller institutions as their preservation format for harvested websites. A practical and pragmatic approach should be to follow in the footsteps of large institutions which have invested a great deal of time and resources into developing web archiving standards, procedures and tools, since any problems experienced by smaller institutions, now or in the future, will be the same problems being tackled by these large institutions.
WARC for website preservation
WARC (Web ARChive file format) is an extension of the ARC (Internet Archive ARC_IA) format which was developed by the Internet Archive in the mid-1990s. The ARC format stores simple content block sequences (representing objects such as html files and images) with additional text headers in a self-contained file. The ARC format captures responses to http requests only; WARC extends this by capturing content types such as assigned metadata and duplicate detection events (to reduce storage of identical resources). WARC also provides expanded metadata support in the text header. WARC was accepted as an international standard in 2009 (ISO 28500:2009) and is the Library of Congress' preferred format for harvested Web sites.1 Although ARC is still widely used, it is being replaced by WARC at a number of leading institutions, including the Internet Archive.
A WARC file stores multiple archived resources in a single file in order to avoid managing a large number of digital objects in numerous directories. A WARC file consists of any number of WARC records, which are single blocks of content, each with its own text header. These text headers can be extracted from the WARC file and stored separately for efficient indexing.
In this diagram, the WARC file represented on the left contains an unlimited number of WARC records: the WARC record contains the content, in this case a jpeg image, and the text header. The sample text header shows the type of resource, the URI of the image, the date of capture, a unique identifier for the WARC record, the content mime-type, checksums for the WARC record and jpeg image, and the content length in number of octets (bytes). (Sample text header content taken from WARC file format version 0.18, 2008-06-06. ISO/DIS 28500, bibnum.bnf.fr/WARC/WARC_ISO_DIS_28500.pdf.)