Difference between revisions of "Improvements/warc"

From Archivematica
Jump to navigation Jump to search
(Created page with "== Synopsis == Improvements to Archivematica's handling of WARC files could go in a number of directions, most of which involve better extraction of technical and provenance ...")
 
Line 10: Line 10:
  
 
== Analysis ==
 
== Analysis ==
 +
 +
=== WARCat ===
 +
 +
We tested and evaluated the tool WARCat for verifying, validating and extracting content from WARC files: https://github.com/chfoo/warcat
 +
 +
Here's what we found about how warcat verifies WARC files:
 +
 +
* Iterates through a (possibly gzipped) WARC file
 +
** During iteration, uses the Content-Length and looking for delimiters (typically newlines) to verify that it's reading each block correctly. 
 +
** archive-it.warc, chrome.warc and wget.warc all fail this correct-iteration-checking
 +
 +
* The verify command checks lots of things, mostly related to the various headers.
 +
 +
* Checks 'WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type' in headers
 +
** If 'WARC-Block-Digest', checks block checksum
 +
** If 'WARC-Payload-Digest', checks payload checksum
 +
** Checks record ID has not been seen before in this WARC file
 +
** Checks no whitespace in record ID
 +
** Checks 'Content-Length' also has 'Content-Type'
 +
** If 'WARC-Concurrent-To' or 'WARC-Refers-To', checks 'WARC-Type' not 'warcinfo', 'conversion' or 'continuation' and that concurrent/refers to record ID has been seen before
 +
** If 'WARC-Type' is 'response', 'resource', 'request', 'revisit', 'conversion' or 'continuation', checks 'WARC-Target-URI'
 +
** If 'WARC-Type' is 'warc_info', checks no 'WARC-Target-URI' *
 +
** If 'WARC-Target-URI' checks no whitespace
 +
** If 'WARC-Type' is 'warcinfo', checks no 'WARC-Filename'
 +
** If 'WARC-Type' is 'revisit', checks no 'WARC-Profile'
 +
** If 'WARC-Type' is 'continuation', checks 'WARC-Segment-Origin-ID' and 'WARC-Segment-Total-Length'
 +
** If 'WARC-Type' is not 'continuation', checks no 'WARC-Segment-Origin-ID' or 'WARC-Segment-Total-Length'
 +
 +
We think there's a typo in this check, because other places refer to a 'WARC-Type' of 'warcinfo', and this is the only place that refers to a 'warc_info'
 +
 +
 +
The code for the checks is found here: https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L262-L406
 +
 +
and the checksum verification is here: https://github.com/chfoo/warcat/blob/master/warcat/verify.py#L38-L67
 +
 +
Iterating through the records is done here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89
 +
 +
and here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89
 +
 +
== Changes to METS ==
 +
 +
Below is a mock-up of an AIP METS file with enhancements for recording WARC files.
 +
 +
<pre>
 +
 +
<?xml version='1.0' encoding='ASCII'?>
 +
<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
 +
  <mets:metsHdr CREATEDATE="2015-11-27T17:17:29"/>
 +
  <mets:dmdSec ID="dmdSec_1">
 +
    <mets:mdWrap MDTYPE="DC">
 +
      <mets:xmlData>
 +
        <dcterms:dublincore xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd">
 +
          <dc:identifier>urn:uuid:33bf5acf-4584-446f-b187-ce4f6ad79af9</dc:identifier>
 +
          <!-- source = WARC-Record-ID -->
 +
          <dc:isPartOf>Sooke Artworks Exhibit 2014</dc:isPartOf>
 +
          <!-- source = collectionName (Archive-It only) -->
 +
          <dc:isPartOf>4867-20141008190114161</dc:isPartOf>
 +
          <!-- source = isPartOf (Archive-It and Chrome only) -->
 +
          <dc:rights>collectionPublic=false</dc:rights>
 +
          <!-- source = description (Archive-It only) -->
 +
        </dcterms:dublincore>
 +
      </mets:xmlData>
 +
    </mets:mdWrap>
 +
  </mets:dmdSec>
 +
  <mets:amdSec ID="amdSec_1">
 +
    <mets:techMD ID="techMD_1">
 +
      <mets:mdWrap MDTYPE="PREMIS:OBJECT">
 +
        <mets:xmlData>
 +
          <premis:object xmlns:premis="info:lc/xmlns/premis-v2" xsi:type="premis:file" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
 +
            <premis:objectIdentifier>
 +
              <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
 +
              <premis:objectIdentifierValue>9a6db35f-b444-4295-a1b9-c0c94665c778</premis:objectIdentifierValue>
 +
            </premis:objectIdentifier>
 +
            <premis:objectCharacteristics>
 +
              <premis:compositionLevel>0</premis:compositionLevel>
 +
              <premis:fixity>
 +
                <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm>
 +
                <premis:messageDigest>b8ed228653bbe2fc73f5a4711daaab3b427bc57920fc00778b9b96da35d5cbd9</premis:messageDigest>
 +
              </premis:fixity>
 +
              <premis:size>77038680</premis:size>
 +
              <premis:format>
 +
                <premis:formatDesignation>
 +
                  <premis:formatName>WARC (Web ARChive)</premis:formatName>
 +
                  <premis:formatVersion>ISO 28500</premis:formatVersion>
 +
                </premis:formatDesignation>
 +
                <premis:formatRegistry>
 +
                  <premis:formatRegistryName>PRONOM</premis:formatRegistryName>
 +
                  <premis:formatRegistryKey>fmt/289</premis:formatRegistryKey>
 +
                </premis:formatRegistry>
 +
              </premis:format>
 +
              <premis:objectCharacteristicsExtension>
 +
              <!-- tool output -->
 +
              </premis:objectCharacteristicsExtension>
 +
            </premis:objectCharacteristics>
 +
            <premis:originalName>%transferDirectory%objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc</premis:originalName>
 +
          </premis:object>
 +
        </mets:xmlData>
 +
      </mets:mdWrap>
 +
    </mets:techMD>
 +
    <mets:digiprovMD ID="digiprovMD_1">
 +
      <mets:mdWrap MDTYPE="PREMIS:EVENT">
 +
        <mets:xmlData>
 +
          <premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
 +
            <premis:eventIdentifier>
 +
              <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
 +
              <premis:eventIdentifierValue>670799cf-5ca0-4869-b0ba-7d1d951e3857</premis:eventIdentifierValue>
 +
            </premis:eventIdentifier>
 +
            <premis:eventType>creation</premis:eventType>
 +
            <premis:eventDateTime>2015-11-27T17:14:59</premis:eventDateTime>
 +
            <premis:eventDetail>software: Heritrix/3.3.0-SNAPSHOT-20140912-0039 http://crawler.archive.org
 +
ip: 207.241.226.89
 +
hostname: wbgrp-crawl052.us.archive.org
 +
format: WARC File Format 1.0
 +
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
 +
isPartOf: 4867-20141008190114161
 +
description: recurrence=NONE, maxDuration=3600, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=CRAWL_SELECTED_SEEDS, seedCount=1, accountId=739, accountType=SUBSCRIBER, organizationName="Not a Real Institution", collectionId=4867, collectionName="Sooke Artworks Exhibit 2014", collectionPublic=false
 +
robots: obey
 +
http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html)
 +
            </premis:eventDetail>
 +
            <!-- source = text block starting with software -->
 +
            <premis:eventOutcomeInformation>
 +
              <premis:eventOutcome></premis:eventOutcome>
 +
              <premis:eventOutcomeDetail>
 +
                <premis:eventOutcomeDetailNote></premis:eventOutcomeDetailNote>
 +
              </premis:eventOutcomeDetail>
 +
            </premis:eventOutcomeInformation>
 +
            <premis:linkingAgentIdentifier>
 +
              <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType>
 +
              <premis:linkingAgentIdentifierValue>http://crawler.archive.org</premis:linkingAgentIdentifierValue>
 +
              <!-- source = software -->
 +
            </premis:linkingAgentIdentifier>
 +
          </premis:event>
 +
        </mets:xmlData>
 +
      </mets:mdWrap>
 +
    </mets:digiprovMD>
 +
    <mets:digiprovMD ID="digiprovMD_7">
 +
      <mets:mdWrap MDTYPE="PREMIS:AGENT">
 +
        <mets:xmlData>
 +
          <premis:agent xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
 +
            <premis:agentIdentifier>
 +
              <premis:agentIdentifierType>URI</premis:agentIdentifierType>
 +
              <premis:agentIdentifierValue>http://crawler.archive.org</premis:agentIdentifierValue>
 +
              <!-- source = software -->
 +
            </premis:agentIdentifier>
 +
            <premis:agentName>Heritrix/3.3.0-SNAPSHOT-20140912-0039</premis:agentName>
 +
            <!-- source = software -->
 +
            <premis:agentType>software</premis:agentType>
 +
          </premis:agent>
 +
        </mets:xmlData>
 +
      </mets:mdWrap>
 +
    </mets:digiprovMD>
 +
  </mets:amdSec>
 +
  <mets:fileSec>
 +
    <mets:fileGrp USE="original">
 +
      <mets:file GROUPID="Group-9a6db35f-b444-4295-a1b9-c0c94665c778" ID="file-9a6db35f-b444-4295-a1b9-c0c94665c778" ADMID="amdSec_1" DMDID="dmdSec_1">
 +
        <mets:FLocat xlink:href="objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
 +
      </mets:file>
 +
    </mets:fileGrp>
 +
  </mets:fileSec>
 +
  <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical">
 +
    <mets:div LABEL="WARC_file-b681af4b-8e17-479e-8a1f-0e9443415d5e" TYPE="Directory">
 +
      <mets:div LABEL="objects" TYPE="Directory">
 +
        <mets:div LABEL="ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" TYPE="Item">
 +
          <mets:fptr FILEID="file-9a6db35f-b444-4295-a1b9-c0c94665c778"/>
 +
        </mets:div>
 +
      </mets:div>
 +
    </mets:div>
 +
  </mets:structMap>
 +
</mets:mets>
 +
 +
</pre>

Revision as of 12:30, 9 September 2016

Synopsis

Improvements to Archivematica's handling of WARC files could go in a number of directions, most of which involve better extraction of technical and provenance metadata to Archivematica's METS file, which would improve the understanding and preservation of the WARC files overtime.

User story

Status

Some code is in a development branch of Archivematica (https://github.com/artefactual/archivematica/tree/dev/issue-8634-warc-mets) which will read certain elements of the WARC header. This lays the groundwork for parsing this descriptive information to the METS file. This code is based on an Archivematica branch that introduces external agents to the METS file, which lays the ground work for describing the software agent that created the WARC (e.g. ArchiveIt, wget, chrome extension, etc)

Analysis

WARCat

We tested and evaluated the tool WARCat for verifying, validating and extracting content from WARC files: https://github.com/chfoo/warcat

Here's what we found about how warcat verifies WARC files:

  • Iterates through a (possibly gzipped) WARC file
    • During iteration, uses the Content-Length and looking for delimiters (typically newlines) to verify that it's reading each block correctly.
    • archive-it.warc, chrome.warc and wget.warc all fail this correct-iteration-checking
  • The verify command checks lots of things, mostly related to the various headers.
  • Checks 'WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type' in headers
    • If 'WARC-Block-Digest', checks block checksum
    • If 'WARC-Payload-Digest', checks payload checksum
    • Checks record ID has not been seen before in this WARC file
    • Checks no whitespace in record ID
    • Checks 'Content-Length' also has 'Content-Type'
    • If 'WARC-Concurrent-To' or 'WARC-Refers-To', checks 'WARC-Type' not 'warcinfo', 'conversion' or 'continuation' and that concurrent/refers to record ID has been seen before
    • If 'WARC-Type' is 'response', 'resource', 'request', 'revisit', 'conversion' or 'continuation', checks 'WARC-Target-URI'
    • If 'WARC-Type' is 'warc_info', checks no 'WARC-Target-URI' *
    • If 'WARC-Target-URI' checks no whitespace
    • If 'WARC-Type' is 'warcinfo', checks no 'WARC-Filename'
    • If 'WARC-Type' is 'revisit', checks no 'WARC-Profile'
    • If 'WARC-Type' is 'continuation', checks 'WARC-Segment-Origin-ID' and 'WARC-Segment-Total-Length'
    • If 'WARC-Type' is not 'continuation', checks no 'WARC-Segment-Origin-ID' or 'WARC-Segment-Total-Length'

We think there's a typo in this check, because other places refer to a 'WARC-Type' of 'warcinfo', and this is the only place that refers to a 'warc_info'


The code for the checks is found here: https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L262-L406

and the checksum verification is here: https://github.com/chfoo/warcat/blob/master/warcat/verify.py#L38-L67

Iterating through the records is done here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89

and here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89

Changes to METS

Below is a mock-up of an AIP METS file with enhancements for recording WARC files.


<?xml version='1.0' encoding='ASCII'?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
  <mets:metsHdr CREATEDATE="2015-11-27T17:17:29"/>
  <mets:dmdSec ID="dmdSec_1">
    <mets:mdWrap MDTYPE="DC">
      <mets:xmlData>
        <dcterms:dublincore xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd">
          <dc:identifier>urn:uuid:33bf5acf-4584-446f-b187-ce4f6ad79af9</dc:identifier>
          <!-- source = WARC-Record-ID -->
          <dc:isPartOf>Sooke Artworks Exhibit 2014</dc:isPartOf>
          <!-- source = collectionName (Archive-It only) -->
          <dc:isPartOf>4867-20141008190114161</dc:isPartOf>
          <!-- source = isPartOf (Archive-It and Chrome only) -->
          <dc:rights>collectionPublic=false</dc:rights>
          <!-- source = description (Archive-It only) -->
        </dcterms:dublincore>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
   <mets:amdSec ID="amdSec_1">
    <mets:techMD ID="techMD_1">
      <mets:mdWrap MDTYPE="PREMIS:OBJECT">
        <mets:xmlData>
          <premis:object xmlns:premis="info:lc/xmlns/premis-v2" xsi:type="premis:file" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:objectIdentifier>
              <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
              <premis:objectIdentifierValue>9a6db35f-b444-4295-a1b9-c0c94665c778</premis:objectIdentifierValue>
            </premis:objectIdentifier>
            <premis:objectCharacteristics>
              <premis:compositionLevel>0</premis:compositionLevel>
              <premis:fixity>
                <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm>
                <premis:messageDigest>b8ed228653bbe2fc73f5a4711daaab3b427bc57920fc00778b9b96da35d5cbd9</premis:messageDigest>
              </premis:fixity>
              <premis:size>77038680</premis:size>
              <premis:format>
                <premis:formatDesignation>
                  <premis:formatName>WARC (Web ARChive)</premis:formatName>
                  <premis:formatVersion>ISO 28500</premis:formatVersion>
                </premis:formatDesignation>
                <premis:formatRegistry>
                  <premis:formatRegistryName>PRONOM</premis:formatRegistryName>
                  <premis:formatRegistryKey>fmt/289</premis:formatRegistryKey>
                </premis:formatRegistry>
              </premis:format>
              <premis:objectCharacteristicsExtension>
              <!-- tool output -->
              </premis:objectCharacteristicsExtension>
            </premis:objectCharacteristics>
            <premis:originalName>%transferDirectory%objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc</premis:originalName>
          </premis:object>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:techMD>
    <mets:digiprovMD ID="digiprovMD_1">
      <mets:mdWrap MDTYPE="PREMIS:EVENT">
        <mets:xmlData>
          <premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:eventIdentifier>
              <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
              <premis:eventIdentifierValue>670799cf-5ca0-4869-b0ba-7d1d951e3857</premis:eventIdentifierValue>
            </premis:eventIdentifier>
            <premis:eventType>creation</premis:eventType>
            <premis:eventDateTime>2015-11-27T17:14:59</premis:eventDateTime>
            <premis:eventDetail>software: Heritrix/3.3.0-SNAPSHOT-20140912-0039 http://crawler.archive.org
ip: 207.241.226.89
hostname: wbgrp-crawl052.us.archive.org
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: 4867-20141008190114161
description: recurrence=NONE, maxDuration=3600, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=CRAWL_SELECTED_SEEDS, seedCount=1, accountId=739, accountType=SUBSCRIBER, organizationName="Not a Real Institution", collectionId=4867, collectionName="Sooke Artworks Exhibit 2014", collectionPublic=false
robots: obey
http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html)
            </premis:eventDetail>
            <!-- source = text block starting with software -->
            <premis:eventOutcomeInformation>
              <premis:eventOutcome></premis:eventOutcome>
              <premis:eventOutcomeDetail>
                <premis:eventOutcomeDetailNote></premis:eventOutcomeDetailNote>
              </premis:eventOutcomeDetail>
            </premis:eventOutcomeInformation>
            <premis:linkingAgentIdentifier>
              <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType>
              <premis:linkingAgentIdentifierValue>http://crawler.archive.org</premis:linkingAgentIdentifierValue>
              <!-- source = software -->
            </premis:linkingAgentIdentifier>
          </premis:event>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:digiprovMD>
    <mets:digiprovMD ID="digiprovMD_7">
      <mets:mdWrap MDTYPE="PREMIS:AGENT">
        <mets:xmlData>
          <premis:agent xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:agentIdentifier>
              <premis:agentIdentifierType>URI</premis:agentIdentifierType>
              <premis:agentIdentifierValue>http://crawler.archive.org</premis:agentIdentifierValue>
              <!-- source = software -->
            </premis:agentIdentifier>
            <premis:agentName>Heritrix/3.3.0-SNAPSHOT-20140912-0039</premis:agentName>
            <!-- source = software -->
            <premis:agentType>software</premis:agentType>
          </premis:agent>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:digiprovMD>
  </mets:amdSec>
  <mets:fileSec>
    <mets:fileGrp USE="original">
      <mets:file GROUPID="Group-9a6db35f-b444-4295-a1b9-c0c94665c778" ID="file-9a6db35f-b444-4295-a1b9-c0c94665c778" ADMID="amdSec_1" DMDID="dmdSec_1">
        <mets:FLocat xlink:href="objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>
  <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical">
    <mets:div LABEL="WARC_file-b681af4b-8e17-479e-8a1f-0e9443415d5e" TYPE="Directory">
      <mets:div LABEL="objects" TYPE="Directory">
        <mets:div LABEL="ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" TYPE="Item">
          <mets:fptr FILEID="file-9a6db35f-b444-4295-a1b9-c0c94665c778"/>
        </mets:div>
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>