LOCKSS Integration

From Archivematica
Revision as of 10:40, 14 March 2014 by Jhs (talk | contribs) (Created page with "Main Page > Development > Development documentation > LOCKSS Integration Development Status: In Progress Public Release: estim...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Main Page > Development > Development documentation > LOCKSS Integration

Development Status: In Progress Public Release: estimated April 2014

Overview This document outlines a minimalist SWORD API where LOCKSS-O-Matic is the server, and the Archivematica Storage Service is the client. @todos and questions appear in throughout.

In the examples below, Archivematica is at http://archivematica.example.org and LOCKSS-O-Matic is at http://lockssomatic.example.org. The content file being managed by the SWORD deposit is an Archivematica AIP with the UUID dd3e3247-8466-4f2a-bb32-22a210cfce60. LOCKSS-O-Matic is managing a small Private LOCKSS Network containing two boxes, http://lockss1.example.org and http://lockss2.example.org.

Service Document Archivematica issues a GET (with an HTTP header of ‘LOM-Content-Provider’ with the value of the Archivematica instance’s content provider ID in the target LOM instance) to the SD-IRI (http://lockssomatic.example.org/api/sword/2.0/sd-iri). LOCKSS-O-Matic responds with a Service Document like:

LOM-Content-Provider: 12

<service xmlns:dcterms="http://purl.org/dc/terms/"

       <atom:title>LOCKSS-O-Matic at Simon Fraser University</atom:title>     
       <collection href="http://lockssomatic.example.org/api/sword/2.0/col-iri/12">
           <atom:title>SFU Archivematica content provider</atom:title>


Creating a Resource with an Atom Entry Archivematica issues a POST to the Col-IRI, ensuring that the <id> element in the Atom entry contains the UUID of the AIP:

<entry xmlns="http://www.w3.org/2005/Atom"

   <title>Some AIP</title>
   <author><name>Name of PREMIS agent owning AIP</name></author>
   <summary type="text">The AIP’s dc:description if it has one. If not, use a generic summary.</summary>
   <lom:content size="102400" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001</lom:content>
   <lom:content size="46899" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>


LOCKSS-O-Matic responds with a 201 Created and the Location header with the entry’s Edit-IRI, and the deposit receipt.

Deposit Receipt Contains the Cont-IRI, the EM-IRI, and the State-IRI.

<entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">

   <sword:treatment>Stored in LOCKSS via LOCKSS-o-matic</sword:treatment>
   <content type="application/x-7z-compressed" src="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />
   <link rel="edit-media" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />
   <link rel="http://purl.org/net/sword/terms/add" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
   <link rel="edit" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
   <link rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml;type=feed" href="http://lockssomatic.example.org/api/sword/2.0/cont-IRI/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/state" />


SWORD Statement State-IRI is defined in the deposit receipt. GET requests to the State-IRI for a resource will return an Atom feed as described below.

These state terms apply to the Allowed state term and message values, using the namespace http://lockssomatic.info/SWORD2, are:

Term Message (one of) failed Content cannot be harvested by LOCKSS.

disagreement LOCKSS network is not in agreement on content checksums. agreement LOCKSS network agrees internally on content checksums.

If LOCKSS-O-Matic reports ‘agreement’, Archivematica may delete the AIP from local storage.

Sample SWORD statement serialized as an Atom feed:

<atom:feed xmlns:sword="http://purl.org/net/sword/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:lom="http://lockssomatic.info/SWORD2">

   <atom:category scheme="http://purl.org/net/sword/terms/" term="http://purl.org/net/sword/terms/originalDeposit" label="Orignal Deposit"/>
   <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001">
       <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
       <lom:server id="2" state="failed" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
   <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002">
       <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
       <lom:server id="2" state="disagreement" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />


Updating Metadata Informing LOCKSS-o-matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.

POST SE-IRI <?xml version="1.0"?> <entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:lom="http://lockssomatic.info/SWORD2">

   <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-


   <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>


LOM can return: - HTTP 202 (Accepted) meaning LOM is updating the LOCKSS config files saying not to harvest this, but it is not done yet. - HTTP 200 (OK) meaning all config updates are complete - HTTP 204 (No Content) if there is no matching aip - HTTP 409 (Conflict) There are files in the LOCKSS AU that do not have ‘recrawl=false’.

“Once the server has processed the request it MUST return an HTTP status code 200 (OK) or 204 (No Content), or an appropriate error code. “ Comments/questions

Can something be deleted from LOCKSS-o-matic? How? Not really, since LOCKSS doesn’t let you delete content, not directly anyway (it can be done on the boxes’ command lines). In LOM we could have a flag that specifies that content should not be harvested into the LOCKSS network.

Could be useful. Is there any sort of authentication/authorization on the API? At a minimum IP whitelisting. We can also do HTTP Basic (which is a “SHOULD” in SWORD). How are you handling it with Islandora? I’m not working on Islandora - I’ll have to ask MikeC LOCKSS itself relies on both whitelisting and HTTP Basic

Voting Each LOCKSS box initiates a ‘poll’ at random intervals bounded by two box-specific configuration settings:

<property name="contentpoll.min" value="30m" />
     <property name="contentpoll.max" value="2d" />

How long does it take for 100% agreement on an Archival Unit (AU)? Tom Lipkis, LOCKSS lead developer, says:

"Because each box recrawls AUs (looking for changes/additions) on its own schedule, AUs that are constantly changing likely won't achieve 100% agreement until they stop changing. We have some solutions for this in mind (the simplest of which is to exclude from polling files that appeared more recently than the recrawl interval), but at present monitoring systems shouldn't be alarmed at less than 100% agreement in this situation. If all the copies of the AU are the same when it settles down it will reach 100% agreement once all boxes have called and completed a poll [....] If the copies aren't identical (usually because of transient crawl errors) it may take a couple cycles of polling to resolve the differences."