Improvements/warc

From Archivematica
Jump to navigation Jump to search


Synopsis[edit]

Improvements to Archivematica's handling of WARC files could go in a number of directions, most of which involve better extraction of technical and provenance metadata to Archivematica's METS file, which would improve the understanding and preservation of the WARC files overtime.

User story[edit]

As an archivist I want an external agent to be recorded in the METS file to records the software agent that created the WARC file.

As an archivist I want a creation event for the creation of the WARC file to be recorded in the METS file.

As an archivist, I want to record relevant Dublin core in the dmdSec (identifier, isPartOf, rights) of the METS file.

As an archivist I want there to be a validation rule in FPR using Warcat so I get more meaningful validation output.

Status[edit]

Some code is in a development branch of Archivematica (https://github.com/artefactual/archivematica/tree/dev/issue-8634-warc-mets) which will read certain elements of the WARC header. This lays the groundwork for parsing this descriptive information to the METS file. This code is based on an Archivematica branch that introduces external agents to the METS file, which lays the ground work for describing the software agent that created the WARC (e.g. ArchiveIt, wget, chrome extension, etc)

Analysis[edit]

WARCat[edit]

We tested and evaluated the tool WARCat for verifying, validating and extracting content from WARC files: https://github.com/chfoo/warcat

Here's what we found about how warcat verifies WARC files:

  • Iterates through a (possibly gzipped) WARC file
    • During iteration, uses the Content-Length and looking for delimiters (typically newlines) to verify that it's reading each block correctly.
    • archive-it.warc, chrome.warc and wget.warc all fail this correct-iteration-checking
  • The verify command checks lots of things, mostly related to the various headers.
  • Checks 'WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type' in headers
    • If 'WARC-Block-Digest', checks block checksum
    • If 'WARC-Payload-Digest', checks payload checksum
    • Checks record ID has not been seen before in this WARC file
    • Checks no whitespace in record ID
    • Checks 'Content-Length' also has 'Content-Type'
    • If 'WARC-Concurrent-To' or 'WARC-Refers-To', checks 'WARC-Type' not 'warcinfo', 'conversion' or 'continuation' and that concurrent/refers to record ID has been seen before
    • If 'WARC-Type' is 'response', 'resource', 'request', 'revisit', 'conversion' or 'continuation', checks 'WARC-Target-URI'
    • If 'WARC-Type' is 'warc_info', checks no 'WARC-Target-URI' *
    • If 'WARC-Target-URI' checks no whitespace
    • If 'WARC-Type' is 'warcinfo', checks no 'WARC-Filename'
    • If 'WARC-Type' is 'revisit', checks no 'WARC-Profile'
    • If 'WARC-Type' is 'continuation', checks 'WARC-Segment-Origin-ID' and 'WARC-Segment-Total-Length'
    • If 'WARC-Type' is not 'continuation', checks no 'WARC-Segment-Origin-ID' or 'WARC-Segment-Total-Length'

We think there's a typo in this check, because other places refer to a 'WARC-Type' of 'warcinfo', and this is the only place that refers to a 'warc_info'


The code for the checks is found here: https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L262-L406

and the checksum verification is here: https://github.com/chfoo/warcat/blob/master/warcat/verify.py#L38-L67

Iterating through the records is done here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89

and here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89

Changes to METS[edit]

Below is a mock-up of an AIP METS file with enhancements for recording WARC files.


<?xml version='1.0' encoding='ASCII'?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
  <mets:metsHdr CREATEDATE="2015-11-27T17:17:29"/>
  <mets:dmdSec ID="dmdSec_1">
    <mets:mdWrap MDTYPE="DC">
      <mets:xmlData>
        <dcterms:dublincore xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd">
          <dc:identifier>urn:uuid:33bf5acf-4584-446f-b187-ce4f6ad79af9</dc:identifier>
          <!-- source = WARC-Record-ID -->
          <dc:isPartOf>Sooke Artworks Exhibit 2014</dc:isPartOf>
          <!-- source = collectionName (Archive-It only) -->
          <dc:isPartOf>4867-20141008190114161</dc:isPartOf>
          <!-- source = isPartOf (Archive-It and Chrome only) -->
          <dc:rights>collectionPublic=false</dc:rights>
          <!-- source = description (Archive-It only) -->
        </dcterms:dublincore>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
   <mets:amdSec ID="amdSec_1">
    <mets:techMD ID="techMD_1">
      <mets:mdWrap MDTYPE="PREMIS:OBJECT">
        <mets:xmlData>
          <premis:object xmlns:premis="info:lc/xmlns/premis-v2" xsi:type="premis:file" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:objectIdentifier>
              <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
              <premis:objectIdentifierValue>9a6db35f-b444-4295-a1b9-c0c94665c778</premis:objectIdentifierValue>
            </premis:objectIdentifier>
            <premis:objectCharacteristics>
              <premis:compositionLevel>0</premis:compositionLevel>
              <premis:fixity>
                <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm>
                <premis:messageDigest>b8ed228653bbe2fc73f5a4711daaab3b427bc57920fc00778b9b96da35d5cbd9</premis:messageDigest>
              </premis:fixity>
              <premis:size>77038680</premis:size>
              <premis:format>
                <premis:formatDesignation>
                  <premis:formatName>WARC (Web ARChive)</premis:formatName>
                  <premis:formatVersion>ISO 28500</premis:formatVersion>
                </premis:formatDesignation>
                <premis:formatRegistry>
                  <premis:formatRegistryName>PRONOM</premis:formatRegistryName>
                  <premis:formatRegistryKey>fmt/289</premis:formatRegistryKey>
                </premis:formatRegistry>
              </premis:format>
              <premis:objectCharacteristicsExtension>
              <!-- tool output -->
              </premis:objectCharacteristicsExtension>
            </premis:objectCharacteristics>
            <premis:originalName>%transferDirectory%objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc</premis:originalName>
          </premis:object>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:techMD>
    <mets:digiprovMD ID="digiprovMD_1">
      <mets:mdWrap MDTYPE="PREMIS:EVENT">
        <mets:xmlData>
          <premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:eventIdentifier>
              <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
              <premis:eventIdentifierValue>670799cf-5ca0-4869-b0ba-7d1d951e3857</premis:eventIdentifierValue>
            </premis:eventIdentifier>
            <premis:eventType>creation</premis:eventType>
            <premis:eventDateTime>2015-11-27T17:14:59</premis:eventDateTime>
            <premis:eventDetail>software: Heritrix/3.3.0-SNAPSHOT-20140912-0039 http://crawler.archive.org
ip: 207.241.226.89
hostname: wbgrp-crawl052.us.archive.org
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: 4867-20141008190114161
description: recurrence=NONE, maxDuration=3600, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=CRAWL_SELECTED_SEEDS, seedCount=1, accountId=739, accountType=SUBSCRIBER, organizationName="Not a Real Institution", collectionId=4867, collectionName="Sooke Artworks Exhibit 2014", collectionPublic=false
robots: obey
http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html)
            </premis:eventDetail>
            <!-- source = text block starting with software -->
            <premis:eventOutcomeInformation>
              <premis:eventOutcome></premis:eventOutcome>
              <premis:eventOutcomeDetail>
                <premis:eventOutcomeDetailNote></premis:eventOutcomeDetailNote>
              </premis:eventOutcomeDetail>
            </premis:eventOutcomeInformation>
            <premis:linkingAgentIdentifier>
              <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType>
              <premis:linkingAgentIdentifierValue>http://crawler.archive.org</premis:linkingAgentIdentifierValue>
              <!-- source = software -->
            </premis:linkingAgentIdentifier>
          </premis:event>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:digiprovMD>
    <mets:digiprovMD ID="digiprovMD_7">
      <mets:mdWrap MDTYPE="PREMIS:AGENT">
        <mets:xmlData>
          <premis:agent xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:agentIdentifier>
              <premis:agentIdentifierType>URI</premis:agentIdentifierType>
              <premis:agentIdentifierValue>http://crawler.archive.org</premis:agentIdentifierValue>
              <!-- source = software -->
            </premis:agentIdentifier>
            <premis:agentName>Heritrix/3.3.0-SNAPSHOT-20140912-0039</premis:agentName>
            <!-- source = software -->
            <premis:agentType>software</premis:agentType>
          </premis:agent>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:digiprovMD>
  </mets:amdSec>
  <mets:fileSec>
    <mets:fileGrp USE="original">
      <mets:file GROUPID="Group-9a6db35f-b444-4295-a1b9-c0c94665c778" ID="file-9a6db35f-b444-4295-a1b9-c0c94665c778" ADMID="amdSec_1" DMDID="dmdSec_1">
        <mets:FLocat xlink:href="objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>
  <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical">
    <mets:div LABEL="WARC_file-b681af4b-8e17-479e-8a1f-0e9443415d5e" TYPE="Directory">
      <mets:div LABEL="objects" TYPE="Directory">
        <mets:div LABEL="ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" TYPE="Item">
          <mets:fptr FILEID="file-9a6db35f-b444-4295-a1b9-c0c94665c778"/>
        </mets:div>
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>