Difference between revisions of "LOCKSS Integration"

From Archivematica
Jump to navigation Jump to search
Line 154: Line 154:
  
 
===Updating Metadata===
 
===Updating Metadata===
Informing LOCKSS-o-matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.
+
Informing LOCKSS-O-Matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.
  
 
POST SE-IRI
 
POST SE-IRI

Revision as of 10:42, 14 March 2014

Main Page > Development > Development documentation > LOCKSS Integration

Status

Sponsored: SFU Library
Development Status: In Progress
Public Release: estimated April 2014


Overview

LOCKSS (Lots of Copies Keeps Stuff Safe) http://www.lockss.org/ is treated by Archivematica as a storage sub-system. The Archivematica Storage Service can be configured to store AIPs in a LOCKSS network. The procedure for storing AIPs follows the SWORD v2 protocol: http://swordapp.org/sword-v2/ .

The Archivematica Storage Service contains an implementation of a SWORD client, which communicates with a Sword server. The implementation of the SWORD Server is provided by LOCKSS-O-Matic https://github.com/mjordan/lockss-o-matic. LOCKSS-O-Matic is responsible for communicating with the Private LOCKSS Network (PLN)

Basic Workflow

  • Digital Objects are processed and packaged into AIPs in an Archivematica pipeline
  • Archivematica pipeline sends the final AIP to the Storage Service using the 'Store AIP' micro-service
  • Storage Service keeps a local copy of the AIP
  • Storage Service POSTs the AIP to LOCKSS-O-Matic
  • LOCKSS-O-Matic edits the configuration files of all the LOCKSS boxes in the Private LOCKSS Network
  • LOCKSS boxes GET the AIP from the Storage Service
  • Storage Service polls LOCKSS-O-Matic (GETs the State document) to determine when the AIP has been stored in LOCKSS
  • If/When the Storage Service wants to remove its local copy of the AIP, it first POSTs a metadata update to LOCKSS-O-Matic
  • LOCKSS-O-Matic then updates the configuration files in LOCKSS boxes so they will stop harvesting the AIP

Technical Details

This document outlines a minimalist SWORD API where LOCKSS-O-Matic is the server, and the Archivematica Storage Service is the client. @todos and questions appear in throughout.

In the examples below, Archivematica is at http://archivematica.example.org and LOCKSS-O-Matic is at http://lockssomatic.example.org. The content file being managed by the SWORD deposit is an Archivematica AIP with the UUID dd3e3247-8466-4f2a-bb32-22a210cfce60. LOCKSS-O-Matic is managing a small Private LOCKSS Network containing two boxes, http://lockss1.example.org and http://lockss2.example.org.

Service Document

Archivematica issues a GET (with an HTTP header of ‘LOM-Content-Provider’ with the value of the Archivematica instance’s content provider ID in the target LOM instance) to the SD-IRI (http://lockssomatic.example.org/api/sword/2.0/sd-iri). LOCKSS-O-Matic responds with a Service Document like:

LOM-Content-Provider: 12

<service xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:sword="http://purl.org/net/sword/terms/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:lom="http://lockssomatic.info/SWORD2"
    xmlns="http://www.w3.org/2007/app">

    <sword:version>2.0</sword:version>
    <!-- maxUploadSize is configurable in LOCKSS-O-Matic. →
    <!-- measured in kB -->
    <sword:maxUploadSize>102400</sword:maxUploadSize>
   
    <!-- uploadChecksumType chosen from a list - suggested: md5, sha1, sha256 -->
    <lom:uploadChecksumType>md5</lom:uploadChecksumType>

    <workspace>
        <atom:title>LOCKSS-O-Matic at Simon Fraser University</atom:title>     
        <!-- Each LOCKSS-O-Matic content provider will have its own SWORD collection; 
            in this case, http://archivematica.example.org  is collection 12. -->
        <collection href="http://lockssomatic.example.org/api/sword/2.0/col-iri/12">
            <atom:title>SFU Archivematica content provider</atom:title>
            <accept>application/atom_xml;type=entry</accept>
            <sword:mediation>true</sword:mediation>
         </collection>
    </workspace>
</service>


Creating a Resource with an Atom Entry

Archivematica issues a POST to the Col-IRI, ensuring that the <id> element in the Atom entry contains the UUID of the AIP:

<entry xmlns="http://www.w3.org/2005/Atom"    
        xmlns:dcterms="http://purl.org/dc/terms/"
        xmlns:lom="http://lockssomatic.info/SWORD2">
    <title>Some AIP</title>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2013-10-07T17:17:08Z</updated>
    <author><name>Name of PREMIS agent owning AIP</name></author>
    <summary type="text">The AIP’s dc:description if it has one. If not, use a generic summary.</summary>
    <lom:content size="102400" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001</lom:content>
    <lom:content size="46899" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
</entry>

LOCKSS-O-Matic responds with a 201 Created and the Location header with the entry’s Edit-IRI, and the deposit receipt.

Deposit Receipt

Contains the Cont-IRI, the EM-IRI, and the State-IRI.

<entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">
    <sword:treatment>Stored in LOCKSS via LOCKSS-o-matic</sword:treatment>
    <content type="application/x-7z-compressed" src="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />

    <!-- EM-IRI. The EM-IRI and Cont-IRI can (and in LOCKSS-O-Matic, will) have the same value.-->
    <link rel="edit-media" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />

    <!-- SE-IRI (can be same as Edit-IRI) -->
    <!-- Archivematica will POST to this iri when deleting local content -->
    <link rel="http://purl.org/net/sword/terms/add" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
    <!-- Edit-IRI -->
    <link rel="edit" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
    <!-- In LOCKSS-O-Matic, the State-IRI will be the EM-IRI/Cont-IRI with the string ‘/state’ appended. -->
    <link rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml;type=feed" href="http://lockssomatic.example.org/api/sword/2.0/cont-IRI/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/state" />
</entry>

SWORD Statement

State-IRI is defined in the deposit receipt. GET requests to the State-IRI for a resource will return an Atom feed as described below.

These state terms apply to the Allowed state term and message values, using the namespace http://lockssomatic.info/SWORD2, are:

Term Message (one of)
failed Content cannot be harvested by LOCKSS.
disagreement LOCKSS network is not in agreement on content checksums.
agreement LOCKSS network agrees internally on content checksums.

If LOCKSS-O-Matic reports ‘agreement’, Archivematica may delete the AIP from local storage.

Sample SWORD statement serialized as an Atom feed:

<atom:feed xmlns:sword="http://purl.org/net/sword/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:lom="http://lockssomatic.info/SWORD2">
  <atom:entry>
    <atom:category scheme="http://purl.org/net/sword/terms/" term="http://purl.org/net/sword/terms/originalDeposit" label="Orignal Deposit"/>

    <!-- This content node is for a chunk (i.e. a LOM content entry). -->
    <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001">
      <lom:serverlist>
        <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
        <lom:server id="2" state="failed" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
      </lom:serverlist>
    </lom:content>

    <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002">
      <lom:serverlist>
        <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
        <lom:server id="2" state="disagreement" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
       </lom:serverlist>
    </lom:content>

  </atom:entry>
</atom:feed>

Updating Metadata

Informing LOCKSS-O-Matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.

POST SE-IRI

 
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:lom="http://lockssomatic.info/SWORD2">
    <id>urn:uuid:dd3e3247-8466-4f2a-bb32-22a210cfce60</id>
    <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-
+22a210cfce60.001</lom:content>
    <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
</entry>

LOM can return:

  • - HTTP 202 (Accepted) meaning LOM is updating the LOCKSS config files saying not to harvest this, but it is not done yet.
  • - HTTP 200 (OK) meaning all config updates are complete
  • - HTTP 204 (No Content) if there is no matching aip
  • - HTTP 409 (Conflict) There are files in the LOCKSS AU that do not have ‘recrawl=false’.

“Once the server has processed the request it MUST return an HTTP status code 200 (OK) or 204 (No Content), or an appropriate error code. “

Comments/questions

Can something be deleted from LOCKSS-o-matic? How?
Not really, since LOCKSS doesn’t let you delete content, not directly anyway (it can be done on the boxes’ command lines). In LOM we could have a flag that specifies that content should not be harvested into the LOCKSS network.


Is there any sort of authentication/authorization on the API?
This should support HTTP Basic Auth (which is a “SHOULD” in SWORD). At a minimum IP whitelisting.

LOCKSS itself relies on both whitelisting and HTTP Basic

Voting
Each LOCKSS box initiates a ‘poll’ at random intervals bounded by two box-specific configuration settings:
 <property name="contentpoll.min" value="30m" />
 <property name="contentpoll.max" value="2d" />
How long does it take for 100% agreement on an Archival Unit (AU)?
Tom Lipkis, LOCKSS lead developer, says:

"Because each box recrawls AUs (looking for changes/additions) on its own schedule, AUs that are constantly changing likely won't achieve 100% agreement until they stop changing. We have some solutions for this in mind (the simplest of which is to exclude from polling files that appeared more recently than the recrawl interval), but at present monitoring systems shouldn't be alarmed at less than 100% agreement in this situation. If all the copies of the AU are the same when it settles down it will reach 100% agreement once all boxes have called and completed a poll [....] If the copies aren't identical (usually because of transient crawl errors) it may take a couple cycles of polling to resolve the differences."