Difference between revisions of "Storage Service"

From Archivematica
Jump to navigation Jump to search
(Expand description)
(Expand spaces, add location db codes)
Line 25: Line 25:
 
| Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two.  A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path
 
| Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two.  A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path
 
|}
 
|}
 +
 +
Spaces are sorted alphabetically in the docs.
  
 
=== Arkivum ===
 
=== Arkivum ===
Line 30: Line 32:
 
* '''Uses Space.path''': Yes
 
* '''Uses Space.path''': Yes
 
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 +
 +
This uses Arkivum's A-Stor.  A-Stor exposes a CIFS share, which is mounted on the Storage Service and treated like a local filesystem. After files are copied to the share, a release request is sent to A-Stor to start its internal processing (copying the files to multiple datapools).  While the mount is exposed as a CIFS share, it is only imitating the behaviour. The files may not exist, and the package info must be checked before accessing files to ensure they are actually present and avoid long waits.
  
 
=== Dataverse ===
 
=== Dataverse ===
Line 35: Line 39:
 
* '''Uses Space.path''': No
 
* '''Uses Space.path''': No
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]]
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]]
 +
 +
Dataverse is a prototype Space that requires support in Archivematica that is [https://github.com/artefactual/archivematica/pull/347 not yet merged]. It uses the search interface to 'browse' datasets by querying with the provided path.  It uses the returned JSON to fetch all files associated with that dataset, and the returned JSON is stored as <code>dataset.json</code>
 +
 +
{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
 +
| When fetching datasets to start a transfer with, it assumes that the identifier is a digit, and the Location & Space paths contain no digits.  This is a likely source of bugs.
 +
|}
  
 
=== Duracloud ===
 
=== Duracloud ===
Line 40: Line 50:
 
* '''Uses Space.path''': No
 
* '''Uses Space.path''': No
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]]
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]]
 +
 +
A Duracloud Space corresponds with a Space in Archivematica, so to support multiple Duracloud Spaces, multiple storage service spaces must be created. The Location path is used as a prefix to the path. Duracloud is used in hosted Archivematica.
 +
 +
{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
 +
| Improvement Note: The Location path should track which Space in Duracloud to upload to, instead of Duracloud.duraspace.  This would also remove the unnecessary prefixes in paths when uploading.  Care would have to be taken to migrate existing Duracloud configurations correctly. An optional path prefix could be useful. This would benefit from having Space.path removed first (see above improvement notes).
 +
|}
  
 
=== DSpace ===
 
=== DSpace ===
Line 45: Line 61:
 
* '''Uses Space.path''': No
 
* '''Uses Space.path''': No
 
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 +
 +
DSpace uses the SWORD2 API & DSpace REST API to upload the AIP.  Before uploading, the AIP is split into two packages: one containing the objects, and one containing everything else (metadata, logs, bagit structure).  It also uploads Dublin Core information to DSpace if available. For the Dublin Core upload to work, some configuration changes in DSpace are required.
 +
 +
{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
 +
| Improvement Note: DSpace does not support fetching files from DSpace, so downloading the AIP, fixity check, and AIP reingest do not work. This should be implemented.
 +
|}
  
 
=== FEDORA via SWORD2 ===
 
=== FEDORA via SWORD2 ===
Line 50: Line 72:
 
* '''Uses Space.path''': Yes
 
* '''Uses Space.path''': Yes
 
* '''Supported purposes''': [[#FEDORA Deposit|FEDORA Deposity]]
 
* '''Supported purposes''': [[#FEDORA Deposit|FEDORA Deposity]]
 +
 +
This offers a SWORD2 server API to allow another system (developed for [https://wiki.duraspace.org/display/ISLANDORA/Archidora Archidora], but could be others) to deposit content into Archivematica and trigger a Transfer. This is contrasted with the other Spaces, which require Archivematica to initiate contact. Examples and documentation at [[Sword_API]].
  
 
=== Local Filesystem ===
 
=== Local Filesystem ===
Line 55: Line 79:
 
* '''Uses Space.path''': Yes
 
* '''Uses Space.path''': Yes
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]]
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]]
 +
 +
Local Filesystem spaces handle storage that is available locally on the machine running the storage service. This can be a hard drive, or a mounted remote filesystem. This is the default configured space.
  
 
=== LOCKSS-o-matic ===
 
=== LOCKSS-o-matic ===
Line 61: Line 87:
 
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
  
The Space.path is used as a staging location when making files available for harvesting.
+
This support storing AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN). The Space.path is used as a staging location when making files available for harvesting.
  
 
=== NFS ===
 
=== NFS ===
Line 74: Line 100:
 
* '''Uses Space.path''': Yes
 
* '''Uses Space.path''': Yes
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]]
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]]
 +
 +
Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.  This is the easiest way to support having the Pipeline and Storage Service on different machines.
  
 
=== Swift ===
 
=== Swift ===
Line 80: Line 108:
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]]
 
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]]
  
 +
This stores in OpenStack's Swift using the [https://pypi.python.org/pypi/python-swiftclient swiftclient] library.
  
 
== Locations ==
 
== Locations ==
Line 88: Line 117:
  
 
Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.
 
Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.
 +
 +
Locations are sorted by order of appearance in processing in the docs.
  
 
=== Transfer Source ===
 
=== Transfer Source ===
Line 94: Line 125:
 
* '''Required''': Yes
 
* '''Required''': Yes
 
* '''Multiples allowed''': Yes
 
* '''Multiples allowed''': Yes
 +
* '''Database code''': TS
  
 
Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from.  Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is <code>/home</code> in a Local Filesystem.
 
Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from.  Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is <code>/home</code> in a Local Filesystem.
Line 102: Line 134:
 
* '''Required''': Yes
 
* '''Required''': Yes
 
* '''Multiples allowed''': No
 
* '''Multiples allowed''': No
 +
* '''Database code''': CP
  
 
During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is <code>/var/archivematica/sharedDirectory</code> in a Local Filesystem. This is required for Archivematica to run.
 
During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is <code>/var/archivematica/sharedDirectory</code> in a Local Filesystem. This is required for Archivematica to run.
Line 110: Line 143:
 
* '''Required''': No (Yes if using Backlog)
 
* '''Required''': No (Yes if using Backlog)
 
* '''Multiples allowed''': No
 
* '''Multiples allowed''': No
 +
* '''Database code''': BL
  
 
Transfer backlog stores transfers until such a time that the user continues processing them. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog</code> in a Local Filesystem. This is required to store and retrieve transfers in backlog.
 
Transfer backlog stores transfers until such a time that the user continues processing them. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog</code> in a Local Filesystem. This is required to store and retrieve transfers in backlog.
Line 118: Line 152:
 
* '''Required''': Yes
 
* '''Required''': Yes
 
* '''Multiples allowed''': Yes
 
* '''Multiples allowed''': Yes
 +
* '''Database code''': AS
  
 
AIP storage locations are where the completed AIPs are put for long-term storage. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore</code> in a Local Filesystem. This is required to store and retrieve AIPs.
 
AIP storage locations are where the completed AIPs are put for long-term storage. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore</code> in a Local Filesystem. This is required to store and retrieve AIPs.
Line 126: Line 161:
 
* '''Required''': No
 
* '''Required''': No
 
* '''Multiples allowed''': Yes
 
* '''Multiples allowed''': Yes
 +
* '''Database code''': DS
  
 
DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is <code>/var/archivematica/sharedDirectory/www/DIPsStore</code> in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.
 
DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is <code>/var/archivematica/sharedDirectory/www/DIPsStore</code> in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.
Line 134: Line 170:
 
* '''Required''': No
 
* '''Required''': No
 
* '''Multiples allowed''': No
 
* '''Multiples allowed''': No
 +
* '''Database code''': AR
  
 
AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is <code>/var/archivematica/storage_service/recover</code> in a Local Filesystem. This is only required if AIP recovery is used.
 
AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is <code>/var/archivematica/storage_service/recover</code> in a Local Filesystem. This is only required if AIP recovery is used.
Line 144: Line 181:
 
* '''Required''': Yes
 
* '''Required''': Yes
 
* '''Multiples allowed''': No
 
* '''Multiples allowed''': No
 +
* '''Database code''': SS
 
* '''Associated with a pipeline''': No
 
* '''Associated with a pipeline''': No
  
Line 153: Line 191:
 
* '''Required''': No
 
* '''Required''': No
 
* '''Multiples allowed''': Yes
 
* '''Multiples allowed''': Yes
 +
* '''Database code''': SD
  
 
FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.
 
FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.

Revision as of 19:14, 10 March 2017

The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval.

There are 2 main configuration levels in the Storage Service: Spaces and Locations.

  • Space: where the files are stored. This is the protocol used to fetch and store the files in a storage system. Examples: Local filesystem, Duracloud. Spaces contain Locations.
  • Location: why the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage. Locations are inside Spaces.


Space

A storage Space contains all the information necessary to connect to the physical storage. It is where the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space.


Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance.

For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at /home/artefactual/archivematica-sampledata-2013-10-10-09-17-20 and /home/artefactual/maildir_transfers, the Space’s path could be /home/artefactual/

All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files. The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols.


Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [1] [2]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded. A better future design would use concrete multi-table inheritance for the different types of Spaces.
Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two. A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path

Spaces are sorted alphabetically in the docs.

Arkivum

This uses Arkivum's A-Stor. A-Stor exposes a CIFS share, which is mounted on the Storage Service and treated like a local filesystem. After files are copied to the share, a release request is sent to A-Stor to start its internal processing (copying the files to multiple datapools). While the mount is exposed as a CIFS share, it is only imitating the behaviour. The files may not exist, and the package info must be checked before accessing files to ensure they are actually present and avoid long waits.

Dataverse

Dataverse is a prototype Space that requires support in Archivematica that is not yet merged. It uses the search interface to 'browse' datasets by querying with the provided path. It uses the returned JSON to fetch all files associated with that dataset, and the returned JSON is stored as dataset.json

When fetching datasets to start a transfer with, it assumes that the identifier is a digit, and the Location & Space paths contain no digits. This is a likely source of bugs.

Duracloud

A Duracloud Space corresponds with a Space in Archivematica, so to support multiple Duracloud Spaces, multiple storage service spaces must be created. The Location path is used as a prefix to the path. Duracloud is used in hosted Archivematica.

Improvement Note: The Location path should track which Space in Duracloud to upload to, instead of Duracloud.duraspace. This would also remove the unnecessary prefixes in paths when uploading. Care would have to be taken to migrate existing Duracloud configurations correctly. An optional path prefix could be useful. This would benefit from having Space.path removed first (see above improvement notes).

DSpace

DSpace uses the SWORD2 API & DSpace REST API to upload the AIP. Before uploading, the AIP is split into two packages: one containing the objects, and one containing everything else (metadata, logs, bagit structure). It also uploads Dublin Core information to DSpace if available. For the Dublin Core upload to work, some configuration changes in DSpace are required.

Improvement Note: DSpace does not support fetching files from DSpace, so downloading the AIP, fixity check, and AIP reingest do not work. This should be implemented.

FEDORA via SWORD2

This offers a SWORD2 server API to allow another system (developed for Archidora, but could be others) to deposit content into Archivematica and trigger a Transfer. This is contrasted with the other Spaces, which require Archivematica to initiate contact. Examples and documentation at Sword_API.

Local Filesystem

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. This can be a hard drive, or a mounted remote filesystem. This is the default configured space.

LOCKSS-o-matic

This support storing AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN). The Space.path is used as a staging location when making files available for harvesting.

NFS

NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from Local Filesystem. Currently, NFS handling should be done outside of Archivematica.

Pipeline Local Filesystem

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host. This is the easiest way to support having the Pipeline and Storage Service on different machines.

Swift

This stores in OpenStack's Swift using the swiftclient library.

Locations

A storage Location is contained in a Space, and knows its purpose in the Archivematica system. This is why the files are there. A Location allows Archivematica to query for only storage that has been marked for a particular purpose.

Each Location should be associated with at least one pipeline. A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.

Locations are sorted by order of appearance in processing in the docs.

Transfer Source

  • Purpose: Input into Archivematica
  • Required: Yes
  • Multiples allowed: Yes
  • Database code: TS

Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from. Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is /home in a Local Filesystem.

Currently Processing

  • Purpose: For Archivematica's internal processing
  • Required: Yes
  • Multiples allowed: No
  • Database code: CP

During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is /var/archivematica/sharedDirectory in a Local Filesystem. This is required for Archivematica to run.

Transfer Backlog

  • Purpose: Store Transfers in backlog
  • Required: No (Yes if using Backlog)
  • Multiples allowed: No
  • Database code: BL

Transfer backlog stores transfers until such a time that the user continues processing them. The default value is /var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog in a Local Filesystem. This is required to store and retrieve transfers in backlog.

AIP Storage

  • Purpose: Store AIPs for long term storage
  • Required: Yes
  • Multiples allowed: Yes
  • Database code: AS

AIP storage locations are where the completed AIPs are put for long-term storage. The default value is /var/archivematica/sharedDirectory/www/AIPsStore in a Local Filesystem. This is required to store and retrieve AIPs.

DIP Storage

  • Purpose: Store DIPs before uploading to access systems
  • Required: No
  • Multiples allowed: Yes
  • Database code: DS

DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is /var/archivematica/sharedDirectory/www/DIPsStore in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.

AIP Recovery

  • Purpose: Recover a corrupted AIP
  • Required: No
  • Multiples allowed: No
  • Database code: AR

AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is /var/archivematica/storage_service/recover in a Local Filesystem. This is only required if AIP recovery is used.

Needs clarification: Is this to stored the corrupted AIP, or stores a duplicated copy of the AIP so it can be recovered from, or something else?

Storage Service Internal

  • Purpose: Internal staging area for the Storage Service
  • Required: Yes
  • Multiples allowed: No
  • Database code: SS
  • Associated with a pipeline: No

There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is /var/archivematica/storage_service in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines.

FEDORA Deposit

  • Purpose: Store deposited transfers from Archidora before starting as a Transfer.
  • Required: No
  • Multiples allowed: Yes
  • Database code: SD

FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.


Pipelines

Archivematica installations are tracked in the Storage Service as Pipelines. Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned.


Troubleshooting

The Storage service keeps a log at /var/log/archivematica/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log

See also