LOCKSS Integration

From Archivematica
Jump to navigation Jump to search

Main Page > Development > Development documentation > LOCKSS Integration

Status

Sponsored: SFU Library
Development Status: In Progress
Public Release: estimated April 2014


Overview

LOCKSS (Lots of Copies Keeps Stuff Safe) http://www.lockss.org/ is treated by Archivematica as a storage sub-system. The Archivematica Storage Service can be configured to store AIPs in a LOCKSS network. The procedure for storing AIPs follows the SWORD v2 protocol: http://swordapp.org/sword-v2/ .

The Archivematica Storage Service contains an implementation of a SWORD client, which communicates with a Sword server. The implementation of the SWORD Server is provided by LOCKSS-O-Matic https://github.com/mjordan/lockss-o-matic. LOCKSS-O-Matic is responsible for communicating with the Private LOCKSS Network (PLN)

Basic Workflow

  • Digital Objects are processed and packaged into AIPs in an Archivematica pipeline
  • Archivematica pipeline sends the final AIP to the Storage Service using the 'Store AIP' micro-service
  • Storage Service keeps a local copy of the AIP
  • Storage Service POSTs the list of files (e.g., a .7z file or parts of one created with split) that make up the AIP to LOCKSS-O-Matic. This list is in the form of an Atom document.
  • LOCKSS-O-Matic parses the list of files in the Atom document and registers them in its database.
  • LOCKSS-O-Matic edits the configuration files of all the LOCKSS boxes in the Private LOCKSS Network
  • LOCKSS boxes harvest the AIP files from the Storage Service
  • Storage Service polls LOCKSS-O-Matic (GETs the State document) to determine when the AIP has been stored in LOCKSS
  • If/When the Storage Service wants to remove its local copy of the AIP, it first POSTs a metadata update to LOCKSS-O-Matic
  • LOCKSS-O-Matic then updates the configuration files in LOCKSS boxes so they will stop harvesting the AIP

Technical Details

This document outlines a minimalist SWORD API where LOCKSS-O-Matic is the server, and the Archivematica Storage Service is the client. @todos and questions appear in throughout.

In the examples below, Archivematica is at http://archivematica.example.org and LOCKSS-O-Matic is at http://lockssomatic.example.org. The content file being managed by the SWORD deposit is an Archivematica AIP with the UUID dd3e3247-8466-4f2a-bb32-22a210cfce60. LOCKSS-O-Matic is managing a small Private LOCKSS Network containing two boxes, http://lockss1.example.org and http://lockss2.example.org.

Service Document

Archivematica issues a GET (with an HTTP header of ‘On-Behalf-Of’ with the value of the Archivematica instance’s content provider ID in the target LOM instance) to the SD-IRI (http://lockssomatic.example.org/api/sword/2.0/sd-iri). LOCKSS-O-Matic responds with a Service Document like:

On-Behalf-Of: 12

<service xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:sword="http://purl.org/net/sword/terms/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:lom="http://lockssomatic.info/SWORD2"
    xmlns="http://www.w3.org/2007/app">

    <sword:version>2.0</sword:version>
    <!-- maxUploadSize is configurable in LOCKSS-O-Matic. →
    <!-- measured in kB -->
    <sword:maxUploadSize>102400</sword:maxUploadSize>
   
    <!-- uploadChecksumType chosen from a list - suggested: md5, sha1, sha256 -->
    <lom:uploadChecksumType>md5</lom:uploadChecksumType>

    <workspace>
        <atom:title>LOCKSS-O-Matic at Simon Fraser University</atom:title>     
        <!-- Each LOCKSS-O-Matic content provider will have its own SWORD collection; 
            in this case, http://archivematica.example.org  is collection 12. -->
        <collection href="http://lockssomatic.example.org/api/sword/2.0/col-iri/12">
            <atom:title>SFU Archivematica content provider</atom:title>
            <accept>application/atom_xml;type=entry</accept>
            <sword:mediation>true</sword:mediation>
         </collection>
    </workspace>
</service>


Creating a Resource with an Atom Entry

Archivematica issues a POST to the Col-IRI, ensuring that the <id> element in the Atom entry contains the UUID of the AIP:

<entry xmlns="http://www.w3.org/2005/Atom"    
        xmlns:dcterms="http://purl.org/dc/terms/"
        xmlns:lom="http://lockssomatic.info/SWORD2">
    <title>Some AIP</title>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2013-10-07T17:17:08Z</updated>
    <author><name>Name of PREMIS agent owning AIP</name></author>
    <summary type="text">The AIP’s dc:description if it has one. If not, use a generic summary.</summary>
    <lom:content size="102400" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d">
        http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001
    </lom:content>
    <lom:content size="46899" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d">
        http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002
    </lom:content>
</entry>

LOCKSS-O-Matic responds with a 201 Created and the Location header with the entry’s Edit-IRI, and the deposit receipt.

Deposit Receipt

Contains the Cont-IRI, the EM-IRI, and the State-IRI.

<entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">
    <sword:treatment>Stored in LOCKSS via LOCKSS-o-matic</sword:treatment>
    <content type="application/x-7z-compressed" src="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />

    <!-- EM-IRI. The EM-IRI and Cont-IRI can (and in LOCKSS-O-Matic, will) have the same value.-->
    <link rel="edit-media" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />

    <!-- SE-IRI (can be same as Edit-IRI) -->
    <!-- Archivematica will POST to this iri when deleting local content -->
    <link rel="http://purl.org/net/sword/terms/add" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
    <!-- Edit-IRI -->
    <link rel="edit" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
    <!-- In LOCKSS-O-Matic, the State-IRI will be the EM-IRI/Cont-IRI with the string ‘/state’ appended. -->
    <link rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml;type=feed" 
      href="http://lockssomatic.example.org/api/sword/2.0/cont-IRI/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/state" />
</entry>

SWORD Statement

State-IRI is defined in the deposit receipt. GET requests to the State-IRI for a resource will return an Atom feed as described below.

These state terms apply to the Allowed state term and message values, using the namespace http://lockssomatic.info/SWORD2, are:

Term Message (one of)
failed Content cannot be harvested by LOCKSS.
disagreement LOCKSS network is not in agreement on content checksums.
agreement LOCKSS network agrees internally on content checksums.

If LOCKSS-O-Matic reports ‘agreement’, Archivematica may delete the AIP from local storage.

Sample SWORD statement serialized as an Atom feed:

<atom:feed xmlns:sword="http://purl.org/net/sword/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:lom="http://lockssomatic.info/SWORD2">
  <atom:entry>
    <atom:category scheme="http://purl.org/net/sword/terms/" term="http://purl.org/net/sword/terms/originalDeposit" label="Orignal Deposit"/>

    <!-- This content node is for a chunk (i.e. a LOM content entry). -->
    <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001">
      <lom:serverlist>
        <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
        <lom:server id="2" state="failed" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
      </lom:serverlist>
    </lom:content>

    <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002">
      <lom:serverlist>
        <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
        <lom:server id="2" state="disagreement" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
       </lom:serverlist>
    </lom:content>

  </atom:entry>
</atom:feed>

Updating Metadata

Informing LOCKSS-O-Matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.

POST SE-IRI

 
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:lom="http://lockssomatic.info/SWORD2">
    <id>urn:uuid:dd3e3247-8466-4f2a-bb32-22a210cfce60</id>
    <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-
+22a210cfce60.001</lom:content>
    <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
</entry>

LOM can return:

  • - HTTP 202 (Accepted) meaning LOM is updating the LOCKSS config files saying not to harvest this, but it is not done yet.
  • - HTTP 200 (OK) meaning all config updates are complete
  • - HTTP 204 (No Content) if there is no matching AIP
  • - HTTP 409 (Conflict) There are files in the LOCKSS Archival Unit (AU) that do not have ‘recrawl=false’.

“Once the server has processed the request it MUST return an HTTP status code 200 (OK) or 204 (No Content), or an appropriate error code. “

Comments/questions

Can an AIP be deleted from LOCKSS-O-Matic? How?
Not really, since LOCKSS doesn’t let you delete content, not directly anyway (it can be done on the boxes’ command lines). In LOM we could have a flag that specifies that content should not be harvested into the LOCKSS network. In the Archivematica dashboard, an AIP deletion request can be made. This deletion request goes to the Storage Service, which would tell LOM to stop harvesting the AIP, and then delete the local copy. The copies in the PLN would not be deleted, and would have to be manually deleted by a LOCKSS administrator.
Can an AIP be retrieved from LOCKSS via the Archivematica Dashboard?
Yes - with the following caveat. All AIP's stored in the Storage Service are searchable via the Archival Storage tab in the dashboard. The Storage Service keeps a local copy of AIP's that are stored in LOCKSS, so when a user clicks on an AIP in the Archival Storage tab, that local copy will be delivered. If the local copy has been deleted (which will eventually happen, exactly when is based on Storage Service configuration settings) then the dashboard user will get a 'local copy not available, contact your storage administrator' message. From the Storage Service, it is possible for a storage administrator to click a button to download a copy of the AIP directly from one of the LOCKSS boxes, to the storage service. Once that is done, the AIP will then be retrievable from a dashboard.
Is there any sort of authentication/authorization on the API?
This should support HTTP Basic Auth (which is a “SHOULD” in SWORD). At a minimum IP whitelisting.

LOCKSS itself relies on both whitelisting and HTTP Basic

Voting
Each LOCKSS box initiates a ‘poll’ at random intervals bounded by two box-specific configuration settings:
 <property name="contentpoll.min" value="30m" />
 <property name="contentpoll.max" value="2d" />
How long does it take for 100% agreement on an Archival Unit (AU)?
Because each box recrawls AUs (looking for changes/additions) on its own schedule, AUs that are constantly changing likely won't achieve 100% agreement until they stop changing. If all the copies of the AU are the same when it settles down, polling will reach 100% agreement once all boxes have called and completed a poll. If the copies aren't identical (usually because of random crawl errors) it may take a couple cycles of polling to resolve the differences.