Difference between revisions of "LOCKSS Integration"

From Archivematica
Jump to navigation Jump to search
(Created page with "Main Page > Development > Development documentation > LOCKSS Integration Development Status: In Progress Public Release: estim...")
 
Line 1: Line 1:
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > LOCKSS Integration
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > LOCKSS Integration
  
Development Status: In Progress
+
===Status===
Public Release: estimated April 2014
 
  
Overview
+
'''Sponsored''': SFU Library<br>
 +
'''Development Status''': In Progress<br>
 +
'''Public Release''': estimated April 2014
 +
 
 +
 
 +
===Overview===
 
This document outlines a minimalist SWORD API where LOCKSS-O-Matic is the server, and the Archivematica Storage Service is the client. @todos and questions appear in <!-- comments --> throughout.
 
This document outlines a minimalist SWORD API where LOCKSS-O-Matic is the server, and the Archivematica Storage Service is the client. @todos and questions appear in <!-- comments --> throughout.
  
 
In the examples below, Archivematica is at http://archivematica.example.org and LOCKSS-O-Matic is at http://lockssomatic.example.org. The content file being managed by the SWORD deposit is an Archivematica AIP with the UUID dd3e3247-8466-4f2a-bb32-22a210cfce60. LOCKSS-O-Matic is managing a small Private LOCKSS Network containing two boxes, http://lockss1.example.org and http://lockss2.example.org.
 
In the examples below, Archivematica is at http://archivematica.example.org and LOCKSS-O-Matic is at http://lockssomatic.example.org. The content file being managed by the SWORD deposit is an Archivematica AIP with the UUID dd3e3247-8466-4f2a-bb32-22a210cfce60. LOCKSS-O-Matic is managing a small Private LOCKSS Network containing two boxes, http://lockss1.example.org and http://lockss2.example.org.
  
Service Document
+
===Service Document===
 
Archivematica issues a GET (with an HTTP header of ‘LOM-Content-Provider’ with the value of the Archivematica instance’s content provider ID in the target LOM instance) to the SD-IRI (http://lockssomatic.example.org/api/sword/2.0/sd-iri). LOCKSS-O-Matic responds with a Service Document like:
 
Archivematica issues a GET (with an HTTP header of ‘LOM-Content-Provider’ with the value of the Archivematica instance’s content provider ID in the target LOM instance) to the SD-IRI (http://lockssomatic.example.org/api/sword/2.0/sd-iri). LOCKSS-O-Matic responds with a Service Document like:
  
LOM-Content-Provider: 12
+
''LOM-Content-Provider: 12''
  
 +
<pre>
 
<service xmlns:dcterms="http://purl.org/dc/terms/"
 
<service xmlns:dcterms="http://purl.org/dc/terms/"
 
     xmlns:sword="http://purl.org/net/sword/terms/"
 
     xmlns:sword="http://purl.org/net/sword/terms/"
Line 39: Line 44:
 
     </workspace>
 
     </workspace>
 
</service>
 
</service>
 +
</pre>
  
  
 
+
===Creating a Resource with an Atom Entry===
Creating a Resource with an Atom Entry
 
 
Archivematica issues a POST to the Col-IRI, ensuring that the <id> element in the Atom entry contains the UUID of the AIP:
 
Archivematica issues a POST to the Col-IRI, ensuring that the <id> element in the Atom entry contains the UUID of the AIP:
  
 +
<pre>
 
<entry xmlns="http://www.w3.org/2005/Atom"     
 
<entry xmlns="http://www.w3.org/2005/Atom"     
 
         xmlns:dcterms="http://purl.org/dc/terms/"
 
         xmlns:dcterms="http://purl.org/dc/terms/"
Line 56: Line 62:
 
     <lom:content size="46899" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
 
     <lom:content size="46899" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
 
</entry>
 
</entry>
 
+
</pre>
  
 
LOCKSS-O-Matic responds with a 201 Created and the Location header with the entry’s Edit-IRI, and the deposit receipt.  
 
LOCKSS-O-Matic responds with a 201 Created and the Location header with the entry’s Edit-IRI, and the deposit receipt.  
  
Deposit Receipt
+
===Deposit Receipt===
 
Contains the Cont-IRI, the EM-IRI, and the State-IRI.
 
Contains the Cont-IRI, the EM-IRI, and the State-IRI.
  
 +
<pre>
 
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">
 
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">
 
     <sword:treatment>Stored in LOCKSS via LOCKSS-o-matic</sword:treatment>
 
     <sword:treatment>Stored in LOCKSS via LOCKSS-o-matic</sword:treatment>
Line 78: Line 85:
 
     <link rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml;type=feed" href="http://lockssomatic.example.org/api/sword/2.0/cont-IRI/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/state" />
 
     <link rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml;type=feed" href="http://lockssomatic.example.org/api/sword/2.0/cont-IRI/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/state" />
 
</entry>
 
</entry>
 +
</pre>
  
 
+
===SWORD Statement===
SWORD Statement
 
 
State-IRI is defined in the deposit receipt. GET requests to the State-IRI for a resource will return an Atom feed as described below.
 
State-IRI is defined in the deposit receipt. GET requests to the State-IRI for a resource will return an Atom feed as described below.
  
 
These state terms apply to the Allowed state term and message values, using the namespace http://lockssomatic.info/SWORD2,  are:
 
These state terms apply to the Allowed state term and message values, using the namespace http://lockssomatic.info/SWORD2,  are:
  
Term
+
{| class="wikitable"
Message (one of)
+
|-
failed
+
! Term
Content cannot be harvested by LOCKSS.
+
! Message (one of)
 
+
|-
 
+
| failed
disagreement
+
| Content cannot be harvested by LOCKSS.
LOCKSS network is not in agreement on content checksums.
+
|-
agreement
+
| disagreement
LOCKSS network agrees internally on content checksums.
+
| LOCKSS network is not in agreement on content checksums.
 +
|-
 +
| agreement
 +
| LOCKSS network agrees internally on content checksums.
 +
|}
  
 
If LOCKSS-O-Matic reports ‘agreement’, Archivematica may delete the AIP from local storage.
 
If LOCKSS-O-Matic reports ‘agreement’, Archivematica may delete the AIP from local storage.
  
 
Sample SWORD statement serialized as an Atom feed:
 
Sample SWORD statement serialized as an Atom feed:
 
+
<pre>
 
<atom:feed xmlns:sword="http://purl.org/net/sword/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:lom="http://lockssomatic.info/SWORD2">
 
<atom:feed xmlns:sword="http://purl.org/net/sword/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:lom="http://lockssomatic.info/SWORD2">
 
   <atom:entry>
 
   <atom:entry>
Line 121: Line 132:
 
   </atom:entry>
 
   </atom:entry>
 
</atom:feed>
 
</atom:feed>
 +
</pre>
  
 
+
===Updating Metadata===
Updating Metadata
 
 
Informing LOCKSS-o-matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.
 
Informing LOCKSS-o-matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.
  
POST SE-IRI  
+
POST SE-IRI
 +
<pre>
 
<?xml version="1.0"?>
 
<?xml version="1.0"?>
 
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:lom="http://lockssomatic.info/SWORD2">
 
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:lom="http://lockssomatic.info/SWORD2">
Line 134: Line 146:
 
     <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
 
     <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
 
</entry>
 
</entry>
 
+
</pre>
 
LOM can return:
 
LOM can return:
- HTTP 202 (Accepted) meaning LOM is updating the LOCKSS config files saying not to harvest this, but it is not done yet.
+
*- HTTP 202 (Accepted) meaning LOM is updating the LOCKSS config files saying not to harvest this, but it is not done yet.
- HTTP 200 (OK) meaning all config updates are complete
+
*- HTTP 200 (OK) meaning all config updates are complete
- HTTP 204 (No Content) if there is no matching aip
+
*- HTTP 204 (No Content) if there is no matching aip
- HTTP 409 (Conflict) There are files in the LOCKSS AU that do not have ‘recrawl=false’.
+
*- HTTP 409 (Conflict) There are files in the LOCKSS AU that do not have ‘recrawl=false’.
  
 
“Once the server has processed the request it MUST return an HTTP status code 200 (OK) or 204 (No Content), or an appropriate error code. “
 
“Once the server has processed the request it MUST return an HTTP status code 200 (OK) or 204 (No Content), or an appropriate error code. “
Comments/questions
 
  
Can something be deleted from LOCKSS-o-matic? How?  
+
===Comments/questions===
Not really, since LOCKSS doesn’t let you delete content, not directly anyway (it can be done on the boxes’ command lines). In LOM we could have a flag that specifies that content should not be harvested into the LOCKSS network.  
+
 
 +
;'''Can something be deleted from LOCKSS-o-matic? How?  
 +
:Not really, since LOCKSS doesn’t let you delete content, not directly anyway (it can be done on the boxes’ command lines). In LOM we could have a flag that specifies that content should not be harvested into the LOCKSS network.  
  
Could be useful.
+
 
Is there any sort of authentication/authorization on the API?
+
;'''Is there any sort of authentication/authorization on the API?
At a minimum IP whitelisting. We can also do HTTP Basic (which is a “SHOULD” in SWORD). How are you handling it with Islandora?
+
: This should support HTTP Basic Auth (which is a “SHOULD” in SWORD). At a minimum IP whitelisting.
I’m not working on Islandora - I’ll have to ask MikeC
 
 
LOCKSS itself relies on both whitelisting and HTTP Basic
 
LOCKSS itself relies on both whitelisting and HTTP Basic
  
Voting
+
;'''Voting
Each LOCKSS box initiates a ‘poll’ at random intervals bounded by two box-specific configuration settings:
+
:Each LOCKSS box initiates a ‘poll’ at random intervals bounded by two box-specific configuration settings:
  
 +
<pre>
 
  <property name="contentpoll.min" value="30m" />
 
  <property name="contentpoll.min" value="30m" />
      <property name="contentpoll.max" value="2d" />
+
<property name="contentpoll.max" value="2d" />
 
+
</pre>
How long does it take for 100% agreement on an Archival Unit (AU)?
 
Tom Lipkis, LOCKSS lead developer, says:
 
 
 
"Because each box recrawls AUs (looking for changes/additions) on its own schedule, AUs that are constantly changing likely won't achieve 100% agreement until they stop changing.  We have some solutions for this in mind (the simplest of which is to exclude from polling files that appeared more recently than the recrawl interval), but at present monitoring systems shouldn't be alarmed at less than 100% agreement in this situation. If all the copies of the AU are the same when it settles down it will reach 100% agreement once all boxes have called and completed a poll [....] If the copies aren't identical (usually because of transient crawl errors) it may take a couple cycles of polling to resolve the differences."
 
 
 
  
 +
;'''How long does it take for 100% agreement on an Archival Unit (AU)?
 +
:Tom Lipkis, LOCKSS lead developer, says:
 +
<blockquote>
 +
"Because each box recrawls AUs (looking for changes/additions) on its own schedule, AUs that are constantly changing likely won't achieve 100% agreement until they stop changing.  We have some solutions for this in mind (the simplest of which is to exclude from polling files that appeared more recently than the recrawl interval), but at present monitoring systems shouldn't be alarmed at less than 100% agreement in this situation. If all the copies of the AU are the same when it settles down it will reach 100% agreement once all boxes have called and completed a poll [....] If the copies aren't identical (usually because of transient crawl errors) it may take a couple cycles of polling to resolve the differences."</blockquote>
  
  
 
[[Category:Development documentation]]
 
[[Category:Development documentation]]

Revision as of 10:08, 14 March 2014

Main Page > Development > Development documentation > LOCKSS Integration

Status

Sponsored: SFU Library
Development Status: In Progress
Public Release: estimated April 2014


Overview

This document outlines a minimalist SWORD API where LOCKSS-O-Matic is the server, and the Archivematica Storage Service is the client. @todos and questions appear in throughout.

In the examples below, Archivematica is at http://archivematica.example.org and LOCKSS-O-Matic is at http://lockssomatic.example.org. The content file being managed by the SWORD deposit is an Archivematica AIP with the UUID dd3e3247-8466-4f2a-bb32-22a210cfce60. LOCKSS-O-Matic is managing a small Private LOCKSS Network containing two boxes, http://lockss1.example.org and http://lockss2.example.org.

Service Document

Archivematica issues a GET (with an HTTP header of ‘LOM-Content-Provider’ with the value of the Archivematica instance’s content provider ID in the target LOM instance) to the SD-IRI (http://lockssomatic.example.org/api/sword/2.0/sd-iri). LOCKSS-O-Matic responds with a Service Document like:

LOM-Content-Provider: 12

<service xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:sword="http://purl.org/net/sword/terms/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:lom="http://lockssomatic.info/SWORD2"
    xmlns="http://www.w3.org/2007/app">

    <sword:version>2.0</sword:version>
    <!-- maxUploadSize is configurable in LOCKSS-O-Matic. →
    <!-- measured in kB -->
    <sword:maxUploadSize>102400</sword:maxUploadSize>
   
    <!-- uploadChecksumType chosen from a list - suggested: md5, sha1, sha256 -->
    <lom:uploadChecksumType>md5</lom:uploadChecksumType>

    <workspace>
        <atom:title>LOCKSS-O-Matic at Simon Fraser University</atom:title>     
        <!-- Each LOCKSS-O-Matic content provider will have its own SWORD collection; 
            in this case, http://archivematica.example.org  is collection 12. -->
        <collection href="http://lockssomatic.example.org/api/sword/2.0/col-iri/12">
            <atom:title>SFU Archivematica content provider</atom:title>
            <accept>application/atom_xml;type=entry</accept>
            <sword:mediation>true</sword:mediation>
         </collection>
    </workspace>
</service>


Creating a Resource with an Atom Entry

Archivematica issues a POST to the Col-IRI, ensuring that the <id> element in the Atom entry contains the UUID of the AIP:

<entry xmlns="http://www.w3.org/2005/Atom"    
        xmlns:dcterms="http://purl.org/dc/terms/"
        xmlns:lom="http://lockssomatic.info/SWORD2">
    <title>Some AIP</title>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2013-10-07T17:17:08Z</updated>
    <author><name>Name of PREMIS agent owning AIP</name></author>
    <summary type="text">The AIP’s dc:description if it has one. If not, use a generic summary.</summary>
    <lom:content size="102400" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001</lom:content>
    <lom:content size="46899" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
</entry>

LOCKSS-O-Matic responds with a 201 Created and the Location header with the entry’s Edit-IRI, and the deposit receipt.

Deposit Receipt

Contains the Cont-IRI, the EM-IRI, and the State-IRI.

<entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">
    <sword:treatment>Stored in LOCKSS via LOCKSS-o-matic</sword:treatment>
    <content type="application/x-7z-compressed" src="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />

    <!-- EM-IRI. The EM-IRI and Cont-IRI can (and in LOCKSS-O-Matic, will) have the same value.-->
    <link rel="edit-media" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60" />

    <!-- SE-IRI (can be same as Edit-IRI) -->
    <!-- Archivematica will POST to this iri when deleting local content -->
    <link rel="http://purl.org/net/sword/terms/add" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
    <!-- Edit-IRI -->
    <link rel="edit" href="http://lockssomatic.example.org/api/sword/2.0/cont-iri/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/edit" />
    <!-- In LOCKSS-O-Matic, the State-IRI will be the EM-IRI/Cont-IRI with the string ‘/state’ appended. -->
    <link rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml;type=feed" href="http://lockssomatic.example.org/api/sword/2.0/cont-IRI/12/dd3e3247-8466-4f2a-bb32-22a210cfce60/state" />
</entry>

SWORD Statement

State-IRI is defined in the deposit receipt. GET requests to the State-IRI for a resource will return an Atom feed as described below.

These state terms apply to the Allowed state term and message values, using the namespace http://lockssomatic.info/SWORD2, are:

Term Message (one of)
failed Content cannot be harvested by LOCKSS.
disagreement LOCKSS network is not in agreement on content checksums.
agreement LOCKSS network agrees internally on content checksums.

If LOCKSS-O-Matic reports ‘agreement’, Archivematica may delete the AIP from local storage.

Sample SWORD statement serialized as an Atom feed:

<atom:feed xmlns:sword="http://purl.org/net/sword/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:lom="http://lockssomatic.info/SWORD2">
  <atom:entry>
    <atom:category scheme="http://purl.org/net/sword/terms/" term="http://purl.org/net/sword/terms/originalDeposit" label="Orignal Deposit"/>

    <!-- This content node is for a chunk (i.e. a LOM content entry). -->
    <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.001">
      <lom:serverlist>
        <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
        <lom:server id="2" state="failed" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce60.001" checksumType="md5" checksumValue="bd4a9b642562547754086de2dab26b7d" />
      </lom:serverlist>
    </lom:content>

    <lom:content id="http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002">
      <lom:serverlist>
        <lom:server id="1" state="agreement" src="http://lockss1.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
        <lom:server id="2" state="disagreement" src="http://lockss2.example.org:8083/ServeContent?url=http://archivematicastorage.example.com/lockssomatic/dd3e3247-8466-4f2a-bb32-22a210cfce602.002" checksumType="md5" checksumValue="226190d94b21d1b0c7b1a42d855e419d" />
       </lom:serverlist>
    </lom:content>

  </atom:entry>
</atom:feed>

Updating Metadata

Informing LOCKSS-o-matic that Archivematica wants to delete the local copy , and that LOM should not harvest it anymore.

POST SE-IRI

 
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:lom="http://lockssomatic.info/SWORD2">
    <id>urn:uuid:dd3e3247-8466-4f2a-bb32-22a210cfce60</id>
    <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-
+22a210cfce60.001</lom:content>
    <lom:content recrawl="false">http://archivematica.example.org/archival-storage/download/aip/dd3e3247-8466-4f2a-bb32-22a210cfce60.002</lom:content>
</entry>

LOM can return:

  • - HTTP 202 (Accepted) meaning LOM is updating the LOCKSS config files saying not to harvest this, but it is not done yet.
  • - HTTP 200 (OK) meaning all config updates are complete
  • - HTTP 204 (No Content) if there is no matching aip
  • - HTTP 409 (Conflict) There are files in the LOCKSS AU that do not have ‘recrawl=false’.

“Once the server has processed the request it MUST return an HTTP status code 200 (OK) or 204 (No Content), or an appropriate error code. “

Comments/questions

Can something be deleted from LOCKSS-o-matic? How?
Not really, since LOCKSS doesn’t let you delete content, not directly anyway (it can be done on the boxes’ command lines). In LOM we could have a flag that specifies that content should not be harvested into the LOCKSS network.


Is there any sort of authentication/authorization on the API?
This should support HTTP Basic Auth (which is a “SHOULD” in SWORD). At a minimum IP whitelisting.

LOCKSS itself relies on both whitelisting and HTTP Basic

Voting
Each LOCKSS box initiates a ‘poll’ at random intervals bounded by two box-specific configuration settings:
 <property name="contentpoll.min" value="30m" />
 <property name="contentpoll.max" value="2d" />
How long does it take for 100% agreement on an Archival Unit (AU)?
Tom Lipkis, LOCKSS lead developer, says:

"Because each box recrawls AUs (looking for changes/additions) on its own schedule, AUs that are constantly changing likely won't achieve 100% agreement until they stop changing. We have some solutions for this in mind (the simplest of which is to exclude from polling files that appeared more recently than the recrawl interval), but at present monitoring systems shouldn't be alarmed at less than 100% agreement in this situation. If all the copies of the AU are the same when it settles down it will reach 100% agreement once all boxes have called and completed a poll [....] If the copies aren't identical (usually because of transient crawl errors) it may take a couple cycles of polling to resolve the differences."