Difference between revisions of "Improvements/warc"
(Created page with "== Synopsis == Improvements to Archivematica's handling of WARC files could go in a number of directions, most of which involve better extraction of technical and provenance ...") |
|||
Line 10: | Line 10: | ||
== Analysis == | == Analysis == | ||
+ | |||
+ | === WARCat === | ||
+ | |||
+ | We tested and evaluated the tool WARCat for verifying, validating and extracting content from WARC files: https://github.com/chfoo/warcat | ||
+ | |||
+ | Here's what we found about how warcat verifies WARC files: | ||
+ | |||
+ | * Iterates through a (possibly gzipped) WARC file | ||
+ | ** During iteration, uses the Content-Length and looking for delimiters (typically newlines) to verify that it's reading each block correctly. | ||
+ | ** archive-it.warc, chrome.warc and wget.warc all fail this correct-iteration-checking | ||
+ | |||
+ | * The verify command checks lots of things, mostly related to the various headers. | ||
+ | |||
+ | * Checks 'WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type' in headers | ||
+ | ** If 'WARC-Block-Digest', checks block checksum | ||
+ | ** If 'WARC-Payload-Digest', checks payload checksum | ||
+ | ** Checks record ID has not been seen before in this WARC file | ||
+ | ** Checks no whitespace in record ID | ||
+ | ** Checks 'Content-Length' also has 'Content-Type' | ||
+ | ** If 'WARC-Concurrent-To' or 'WARC-Refers-To', checks 'WARC-Type' not 'warcinfo', 'conversion' or 'continuation' and that concurrent/refers to record ID has been seen before | ||
+ | ** If 'WARC-Type' is 'response', 'resource', 'request', 'revisit', 'conversion' or 'continuation', checks 'WARC-Target-URI' | ||
+ | ** If 'WARC-Type' is 'warc_info', checks no 'WARC-Target-URI' * | ||
+ | ** If 'WARC-Target-URI' checks no whitespace | ||
+ | ** If 'WARC-Type' is 'warcinfo', checks no 'WARC-Filename' | ||
+ | ** If 'WARC-Type' is 'revisit', checks no 'WARC-Profile' | ||
+ | ** If 'WARC-Type' is 'continuation', checks 'WARC-Segment-Origin-ID' and 'WARC-Segment-Total-Length' | ||
+ | ** If 'WARC-Type' is not 'continuation', checks no 'WARC-Segment-Origin-ID' or 'WARC-Segment-Total-Length' | ||
+ | |||
+ | We think there's a typo in this check, because other places refer to a 'WARC-Type' of 'warcinfo', and this is the only place that refers to a 'warc_info' | ||
+ | |||
+ | |||
+ | The code for the checks is found here: https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L262-L406 | ||
+ | |||
+ | and the checksum verification is here: https://github.com/chfoo/warcat/blob/master/warcat/verify.py#L38-L67 | ||
+ | |||
+ | Iterating through the records is done here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89 | ||
+ | |||
+ | and here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89 | ||
+ | |||
+ | == Changes to METS == | ||
+ | |||
+ | Below is a mock-up of an AIP METS file with enhancements for recording WARC files. | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | <?xml version='1.0' encoding='ASCII'?> | ||
+ | <mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd"> | ||
+ | <mets:metsHdr CREATEDATE="2015-11-27T17:17:29"/> | ||
+ | <mets:dmdSec ID="dmdSec_1"> | ||
+ | <mets:mdWrap MDTYPE="DC"> | ||
+ | <mets:xmlData> | ||
+ | <dcterms:dublincore xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd"> | ||
+ | <dc:identifier>urn:uuid:33bf5acf-4584-446f-b187-ce4f6ad79af9</dc:identifier> | ||
+ | <!-- source = WARC-Record-ID --> | ||
+ | <dc:isPartOf>Sooke Artworks Exhibit 2014</dc:isPartOf> | ||
+ | <!-- source = collectionName (Archive-It only) --> | ||
+ | <dc:isPartOf>4867-20141008190114161</dc:isPartOf> | ||
+ | <!-- source = isPartOf (Archive-It and Chrome only) --> | ||
+ | <dc:rights>collectionPublic=false</dc:rights> | ||
+ | <!-- source = description (Archive-It only) --> | ||
+ | </dcterms:dublincore> | ||
+ | </mets:xmlData> | ||
+ | </mets:mdWrap> | ||
+ | </mets:dmdSec> | ||
+ | <mets:amdSec ID="amdSec_1"> | ||
+ | <mets:techMD ID="techMD_1"> | ||
+ | <mets:mdWrap MDTYPE="PREMIS:OBJECT"> | ||
+ | <mets:xmlData> | ||
+ | <premis:object xmlns:premis="info:lc/xmlns/premis-v2" xsi:type="premis:file" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2"> | ||
+ | <premis:objectIdentifier> | ||
+ | <premis:objectIdentifierType>UUID</premis:objectIdentifierType> | ||
+ | <premis:objectIdentifierValue>9a6db35f-b444-4295-a1b9-c0c94665c778</premis:objectIdentifierValue> | ||
+ | </premis:objectIdentifier> | ||
+ | <premis:objectCharacteristics> | ||
+ | <premis:compositionLevel>0</premis:compositionLevel> | ||
+ | <premis:fixity> | ||
+ | <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm> | ||
+ | <premis:messageDigest>b8ed228653bbe2fc73f5a4711daaab3b427bc57920fc00778b9b96da35d5cbd9</premis:messageDigest> | ||
+ | </premis:fixity> | ||
+ | <premis:size>77038680</premis:size> | ||
+ | <premis:format> | ||
+ | <premis:formatDesignation> | ||
+ | <premis:formatName>WARC (Web ARChive)</premis:formatName> | ||
+ | <premis:formatVersion>ISO 28500</premis:formatVersion> | ||
+ | </premis:formatDesignation> | ||
+ | <premis:formatRegistry> | ||
+ | <premis:formatRegistryName>PRONOM</premis:formatRegistryName> | ||
+ | <premis:formatRegistryKey>fmt/289</premis:formatRegistryKey> | ||
+ | </premis:formatRegistry> | ||
+ | </premis:format> | ||
+ | <premis:objectCharacteristicsExtension> | ||
+ | <!-- tool output --> | ||
+ | </premis:objectCharacteristicsExtension> | ||
+ | </premis:objectCharacteristics> | ||
+ | <premis:originalName>%transferDirectory%objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc</premis:originalName> | ||
+ | </premis:object> | ||
+ | </mets:xmlData> | ||
+ | </mets:mdWrap> | ||
+ | </mets:techMD> | ||
+ | <mets:digiprovMD ID="digiprovMD_1"> | ||
+ | <mets:mdWrap MDTYPE="PREMIS:EVENT"> | ||
+ | <mets:xmlData> | ||
+ | <premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2"> | ||
+ | <premis:eventIdentifier> | ||
+ | <premis:eventIdentifierType>UUID</premis:eventIdentifierType> | ||
+ | <premis:eventIdentifierValue>670799cf-5ca0-4869-b0ba-7d1d951e3857</premis:eventIdentifierValue> | ||
+ | </premis:eventIdentifier> | ||
+ | <premis:eventType>creation</premis:eventType> | ||
+ | <premis:eventDateTime>2015-11-27T17:14:59</premis:eventDateTime> | ||
+ | <premis:eventDetail>software: Heritrix/3.3.0-SNAPSHOT-20140912-0039 http://crawler.archive.org | ||
+ | ip: 207.241.226.89 | ||
+ | hostname: wbgrp-crawl052.us.archive.org | ||
+ | format: WARC File Format 1.0 | ||
+ | conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf | ||
+ | isPartOf: 4867-20141008190114161 | ||
+ | description: recurrence=NONE, maxDuration=3600, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=CRAWL_SELECTED_SEEDS, seedCount=1, accountId=739, accountType=SUBSCRIBER, organizationName="Not a Real Institution", collectionId=4867, collectionName="Sooke Artworks Exhibit 2014", collectionPublic=false | ||
+ | robots: obey | ||
+ | http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html) | ||
+ | </premis:eventDetail> | ||
+ | <!-- source = text block starting with software --> | ||
+ | <premis:eventOutcomeInformation> | ||
+ | <premis:eventOutcome></premis:eventOutcome> | ||
+ | <premis:eventOutcomeDetail> | ||
+ | <premis:eventOutcomeDetailNote></premis:eventOutcomeDetailNote> | ||
+ | </premis:eventOutcomeDetail> | ||
+ | </premis:eventOutcomeInformation> | ||
+ | <premis:linkingAgentIdentifier> | ||
+ | <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType> | ||
+ | <premis:linkingAgentIdentifierValue>http://crawler.archive.org</premis:linkingAgentIdentifierValue> | ||
+ | <!-- source = software --> | ||
+ | </premis:linkingAgentIdentifier> | ||
+ | </premis:event> | ||
+ | </mets:xmlData> | ||
+ | </mets:mdWrap> | ||
+ | </mets:digiprovMD> | ||
+ | <mets:digiprovMD ID="digiprovMD_7"> | ||
+ | <mets:mdWrap MDTYPE="PREMIS:AGENT"> | ||
+ | <mets:xmlData> | ||
+ | <premis:agent xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2"> | ||
+ | <premis:agentIdentifier> | ||
+ | <premis:agentIdentifierType>URI</premis:agentIdentifierType> | ||
+ | <premis:agentIdentifierValue>http://crawler.archive.org</premis:agentIdentifierValue> | ||
+ | <!-- source = software --> | ||
+ | </premis:agentIdentifier> | ||
+ | <premis:agentName>Heritrix/3.3.0-SNAPSHOT-20140912-0039</premis:agentName> | ||
+ | <!-- source = software --> | ||
+ | <premis:agentType>software</premis:agentType> | ||
+ | </premis:agent> | ||
+ | </mets:xmlData> | ||
+ | </mets:mdWrap> | ||
+ | </mets:digiprovMD> | ||
+ | </mets:amdSec> | ||
+ | <mets:fileSec> | ||
+ | <mets:fileGrp USE="original"> | ||
+ | <mets:file GROUPID="Group-9a6db35f-b444-4295-a1b9-c0c94665c778" ID="file-9a6db35f-b444-4295-a1b9-c0c94665c778" ADMID="amdSec_1" DMDID="dmdSec_1"> | ||
+ | <mets:FLocat xlink:href="objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> | ||
+ | </mets:file> | ||
+ | </mets:fileGrp> | ||
+ | </mets:fileSec> | ||
+ | <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical"> | ||
+ | <mets:div LABEL="WARC_file-b681af4b-8e17-479e-8a1f-0e9443415d5e" TYPE="Directory"> | ||
+ | <mets:div LABEL="objects" TYPE="Directory"> | ||
+ | <mets:div LABEL="ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" TYPE="Item"> | ||
+ | <mets:fptr FILEID="file-9a6db35f-b444-4295-a1b9-c0c94665c778"/> | ||
+ | </mets:div> | ||
+ | </mets:div> | ||
+ | </mets:div> | ||
+ | </mets:structMap> | ||
+ | </mets:mets> | ||
+ | |||
+ | </pre> |
Revision as of 11:30, 9 September 2016
Synopsis
Improvements to Archivematica's handling of WARC files could go in a number of directions, most of which involve better extraction of technical and provenance metadata to Archivematica's METS file, which would improve the understanding and preservation of the WARC files overtime.
User story
Status
Some code is in a development branch of Archivematica (https://github.com/artefactual/archivematica/tree/dev/issue-8634-warc-mets) which will read certain elements of the WARC header. This lays the groundwork for parsing this descriptive information to the METS file. This code is based on an Archivematica branch that introduces external agents to the METS file, which lays the ground work for describing the software agent that created the WARC (e.g. ArchiveIt, wget, chrome extension, etc)
Analysis
WARCat
We tested and evaluated the tool WARCat for verifying, validating and extracting content from WARC files: https://github.com/chfoo/warcat
Here's what we found about how warcat verifies WARC files:
- Iterates through a (possibly gzipped) WARC file
- During iteration, uses the Content-Length and looking for delimiters (typically newlines) to verify that it's reading each block correctly.
- archive-it.warc, chrome.warc and wget.warc all fail this correct-iteration-checking
- The verify command checks lots of things, mostly related to the various headers.
- Checks 'WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type' in headers
- If 'WARC-Block-Digest', checks block checksum
- If 'WARC-Payload-Digest', checks payload checksum
- Checks record ID has not been seen before in this WARC file
- Checks no whitespace in record ID
- Checks 'Content-Length' also has 'Content-Type'
- If 'WARC-Concurrent-To' or 'WARC-Refers-To', checks 'WARC-Type' not 'warcinfo', 'conversion' or 'continuation' and that concurrent/refers to record ID has been seen before
- If 'WARC-Type' is 'response', 'resource', 'request', 'revisit', 'conversion' or 'continuation', checks 'WARC-Target-URI'
- If 'WARC-Type' is 'warc_info', checks no 'WARC-Target-URI' *
- If 'WARC-Target-URI' checks no whitespace
- If 'WARC-Type' is 'warcinfo', checks no 'WARC-Filename'
- If 'WARC-Type' is 'revisit', checks no 'WARC-Profile'
- If 'WARC-Type' is 'continuation', checks 'WARC-Segment-Origin-ID' and 'WARC-Segment-Total-Length'
- If 'WARC-Type' is not 'continuation', checks no 'WARC-Segment-Origin-ID' or 'WARC-Segment-Total-Length'
We think there's a typo in this check, because other places refer to a 'WARC-Type' of 'warcinfo', and this is the only place that refers to a 'warc_info'
The code for the checks is found here: https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L262-L406
and the checksum verification is here: https://github.com/chfoo/warcat/blob/master/warcat/verify.py#L38-L67
Iterating through the records is done here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89
and here https://github.com/chfoo/warcat/blob/master/warcat/model/warc.py#L62-L89
Changes to METS
Below is a mock-up of an AIP METS file with enhancements for recording WARC files.
<?xml version='1.0' encoding='ASCII'?> <mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd"> <mets:metsHdr CREATEDATE="2015-11-27T17:17:29"/> <mets:dmdSec ID="dmdSec_1"> <mets:mdWrap MDTYPE="DC"> <mets:xmlData> <dcterms:dublincore xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd"> <dc:identifier>urn:uuid:33bf5acf-4584-446f-b187-ce4f6ad79af9</dc:identifier> <!-- source = WARC-Record-ID --> <dc:isPartOf>Sooke Artworks Exhibit 2014</dc:isPartOf> <!-- source = collectionName (Archive-It only) --> <dc:isPartOf>4867-20141008190114161</dc:isPartOf> <!-- source = isPartOf (Archive-It and Chrome only) --> <dc:rights>collectionPublic=false</dc:rights> <!-- source = description (Archive-It only) --> </dcterms:dublincore> </mets:xmlData> </mets:mdWrap> </mets:dmdSec> <mets:amdSec ID="amdSec_1"> <mets:techMD ID="techMD_1"> <mets:mdWrap MDTYPE="PREMIS:OBJECT"> <mets:xmlData> <premis:object xmlns:premis="info:lc/xmlns/premis-v2" xsi:type="premis:file" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2"> <premis:objectIdentifier> <premis:objectIdentifierType>UUID</premis:objectIdentifierType> <premis:objectIdentifierValue>9a6db35f-b444-4295-a1b9-c0c94665c778</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCharacteristics> <premis:compositionLevel>0</premis:compositionLevel> <premis:fixity> <premis:messageDigestAlgorithm>sha256</premis:messageDigestAlgorithm> <premis:messageDigest>b8ed228653bbe2fc73f5a4711daaab3b427bc57920fc00778b9b96da35d5cbd9</premis:messageDigest> </premis:fixity> <premis:size>77038680</premis:size> <premis:format> <premis:formatDesignation> <premis:formatName>WARC (Web ARChive)</premis:formatName> <premis:formatVersion>ISO 28500</premis:formatVersion> </premis:formatDesignation> <premis:formatRegistry> <premis:formatRegistryName>PRONOM</premis:formatRegistryName> <premis:formatRegistryKey>fmt/289</premis:formatRegistryKey> </premis:formatRegistry> </premis:format> <premis:objectCharacteristicsExtension> <!-- tool output --> </premis:objectCharacteristicsExtension> </premis:objectCharacteristics> <premis:originalName>%transferDirectory%objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc</premis:originalName> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> <mets:digiprovMD ID="digiprovMD_1"> <mets:mdWrap MDTYPE="PREMIS:EVENT"> <mets:xmlData> <premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2"> <premis:eventIdentifier> <premis:eventIdentifierType>UUID</premis:eventIdentifierType> <premis:eventIdentifierValue>670799cf-5ca0-4869-b0ba-7d1d951e3857</premis:eventIdentifierValue> </premis:eventIdentifier> <premis:eventType>creation</premis:eventType> <premis:eventDateTime>2015-11-27T17:14:59</premis:eventDateTime> <premis:eventDetail>software: Heritrix/3.3.0-SNAPSHOT-20140912-0039 http://crawler.archive.org ip: 207.241.226.89 hostname: wbgrp-crawl052.us.archive.org format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf isPartOf: 4867-20141008190114161 description: recurrence=NONE, maxDuration=3600, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=CRAWL_SELECTED_SEEDS, seedCount=1, accountId=739, accountType=SUBSCRIBER, organizationName="Not a Real Institution", collectionId=4867, collectionName="Sooke Artworks Exhibit 2014", collectionPublic=false robots: obey http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html) </premis:eventDetail> <!-- source = text block starting with software --> <premis:eventOutcomeInformation> <premis:eventOutcome></premis:eventOutcome> <premis:eventOutcomeDetail> <premis:eventOutcomeDetailNote></premis:eventOutcomeDetailNote> </premis:eventOutcomeDetail> </premis:eventOutcomeInformation> <premis:linkingAgentIdentifier> <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType> <premis:linkingAgentIdentifierValue>http://crawler.archive.org</premis:linkingAgentIdentifierValue> <!-- source = software --> </premis:linkingAgentIdentifier> </premis:event> </mets:xmlData> </mets:mdWrap> </mets:digiprovMD> <mets:digiprovMD ID="digiprovMD_7"> <mets:mdWrap MDTYPE="PREMIS:AGENT"> <mets:xmlData> <premis:agent xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2"> <premis:agentIdentifier> <premis:agentIdentifierType>URI</premis:agentIdentifierType> <premis:agentIdentifierValue>http://crawler.archive.org</premis:agentIdentifierValue> <!-- source = software --> </premis:agentIdentifier> <premis:agentName>Heritrix/3.3.0-SNAPSHOT-20140912-0039</premis:agentName> <!-- source = software --> <premis:agentType>software</premis:agentType> </premis:agent> </mets:xmlData> </mets:mdWrap> </mets:digiprovMD> </mets:amdSec> <mets:fileSec> <mets:fileGrp USE="original"> <mets:file GROUPID="Group-9a6db35f-b444-4295-a1b9-c0c94665c778" ID="file-9a6db35f-b444-4295-a1b9-c0c94665c778" ADMID="amdSec_1" DMDID="dmdSec_1"> <mets:FLocat xlink:href="objects/ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> </mets:file> </mets:fileGrp> </mets:fileSec> <mets:structMap ID="structMap_1" LABEL="Archivematica default" TYPE="physical"> <mets:div LABEL="WARC_file-b681af4b-8e17-479e-8a1f-0e9443415d5e" TYPE="Directory"> <mets:div LABEL="objects" TYPE="Directory"> <mets:div LABEL="ARCHIVEIT-4867-NONE-15219-20141008190130659-00000-wbgrp-crawl052.us.archive.org-6442.warc" TYPE="Item"> <mets:fptr FILEID="file-9a6db35f-b444-4295-a1b9-c0c94665c778"/> </mets:div> </mets:div> </mets:div> </mets:structMap> </mets:mets>