Storage Service

From Archivematica
Jump to navigation Jump to search

Main Page > Development > Development documentation > Storage Service

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Design

This page proposes a new feature and reviews design options

Development

This page describes a feature that's in development

Documentation

This page documents an implemented feature

The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval.

There are 2 main configuration levels in the Storage Service: Spaces and Locations.

  • Space: where the files are stored. This is the protocol used to fetch and store the files in a storage system. Examples: Local filesystem, Duracloud. Spaces contain Locations.
  • Location: why the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage. Locations are inside Spaces.

Initial design requirements and API documentation also exist.


Space[edit]

A storage Space contains all the information necessary to connect to the physical storage. It is where the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space.


Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance.

For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at /home/artefactual/archivematica-sampledata-2013-10-10-09-17-20 and /home/artefactual/maildir_transfers, the Space’s path could be /home/artefactual/

All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files. The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols.

There are 4 core methods that a Space should implement

  • browse
    • Allows seeing the files available or stored here.
  • move_to_storage_service
    • Moves files from the remote storage to the staging path
  • move_from_storage_service
    • Moves files from the staging path to the remote storage.
  • delete_path
    • Deletes files at a path.

Additionally, several other functions enable custom behaviour

  • post_move_to_storage_service
    • Allows post-processing after fetching files. E.g. putting a split package back together
  • post_move_from_storage_service
    • Allows post-processing after storing files. E.g. Notifying the storage system of the new files
  • update_package_status
    • Allow the status of a package in this system to be checked and updated. E.g. ensure replication has completed
  • check_package_fixity
    • Allow a space's fixity check to be called, instead of downloading and running bagit manually

Spaces can also be configured with a maximum quota to allow in that space, and track how much data has been stored in them through the storage service.

Improvement Note: Quota and storage tracking is a stub feature, and does not properly track how much has been stored, especially with uncompressed AIPs and deleting AIPs.


Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [1] [2]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded. A better future design would use concrete multi-table inheritance for the different types of Spaces.
Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two. A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path

Spaces are sorted alphabetically in the docs.

Arkivum[edit]

  • Uses Space.path: Yes
  • Supported purposes: AIP Storage
  • Database code: ARKIVUM

This uses Arkivum's A-Stor. A-Stor exposes a CIFS share, which is mounted on the Storage Service and treated like a local filesystem. After files are copied to the share, a release request is sent to A-Stor to start its internal processing (copying the files to multiple datapools). While the mount is exposed as a CIFS share, it is only imitating the behaviour. The files may not exist, and the package info must be checked before accessing files to ensure they are actually present and avoid long waits.

Dataverse[edit]

Dataverse is a prototype Space that requires support in Archivematica that is not yet merged. It uses the search interface to 'browse' datasets by querying with the provided path. It uses the returned JSON to fetch all files associated with that dataset, and the returned JSON is stored as dataset.json

When fetching datasets to start a transfer with, it assumes that the identifier is a digit, and the Location & Space paths contain no digits. This is a likely source of bugs.

Duracloud[edit]

A Duracloud Space corresponds with a Space in Archivematica, so to support multiple Duracloud Spaces, multiple storage service spaces must be created. The Location path is used as a prefix to the path. Duracloud is used in hosted Archivematica.

Improvement Note: The Location path should track which Space in Duracloud to upload to, instead of Duracloud.duraspace. This would also remove the unnecessary prefixes in paths when uploading. Care would have to be taken to migrate existing Duracloud configurations correctly. An optional path prefix could be useful. This would benefit from having Space.path removed first (see above improvement notes).

DSpace[edit]

  • Uses Space.path: No
  • Supported purposes: AIP Storage
  • Database code: DSPACE

DSpace uses the SWORD2 API & DSpace REST API to upload the AIP. Before uploading, the AIP is split into two packages: one containing the objects, and one containing everything else (metadata, logs, bagit structure). It also uploads Dublin Core information to DSpace if available. For the Dublin Core upload to work, some configuration changes in DSpace are required.

To make DSpace fit with the path-based structure of Spaces, the Space.path is unused, and the service document URL is stored in the Service Document IRI field. The Location.relative_path is overloaded to represent the collection.

Improvement Note: DSpace does not support fetching files from DSpace, so downloading the AIP, fixity check, and AIP reingest do not work. This should be implemented.

FEDORA via SWORD2[edit]

  • Uses Space.path: Yes
  • Supported purposes: FEDORA Deposit
  • Database code: FEDORA

This offers a SWORD2 server API to allow another system (developed for Archidora, but could be others) to deposit content into Archivematica and trigger a Transfer. This is contrasted with the other Spaces, which require Archivematica to initiate contact. Examples and documentation at Sword API.

Local Filesystem[edit]

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. This can be a hard drive, or a mounted remote filesystem. This is the default configured space.

LOCKSS-o-matic[edit]

  • Uses Space.path: Yes
  • Supported purposes: AIP Storage
  • Database code: LOM

This support storing AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN). The Space.path is used as a staging location when making files available for harvesting.

NFS[edit]

NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from Local Filesystem. Currently, NFS handling should be done outside of Archivematica.

Pipeline Local Filesystem[edit]

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host. This is the easiest way to support having the Pipeline and Storage Service on different machines.

Swift[edit]

This stores in OpenStack's Swift using the swiftclient library.

Locations[edit]

A storage Location is contained in a Space, and knows its purpose in the Archivematica system. This is why the files are there. A Location allows Archivematica to query for only storage that has been marked for a particular purpose.

Each Location should be associated with at least one pipeline. A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.

Locations are sorted by order of appearance in processing in the docs.

Transfer Source[edit]

  • Purpose: Input into Archivematica
  • Required: Yes
  • Multiples allowed: Yes
  • Database code: TS

Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from. Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is /home in a Local Filesystem.

Currently Processing[edit]

  • Purpose: For Archivematica's internal processing
  • Required: Yes
  • Multiples allowed: No
  • Database code: CP

During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is /var/archivematica/sharedDirectory in a Local Filesystem. This is required for Archivematica to run.

Transfer Backlog[edit]

  • Purpose: Store Transfers in backlog
  • Required: No (Yes if using Backlog)
  • Multiples allowed: No
  • Database code: BL

Transfer backlog stores transfers until such a time that the user continues processing them. The default value is /var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog in a Local Filesystem. This is required to store and retrieve transfers in backlog.

AIP Storage[edit]

  • Purpose: Store AIPs for long term storage
  • Required: Yes
  • Multiples allowed: Yes
  • Database code: AS

AIP storage locations are where the completed AIPs are put for long-term storage. The default value is /var/archivematica/sharedDirectory/www/AIPsStore in a Local Filesystem. This is required to store and retrieve AIPs.

DIP Storage[edit]

  • Purpose: Store DIPs before uploading to access systems
  • Required: No
  • Multiples allowed: Yes
  • Database code: DS

DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is /var/archivematica/sharedDirectory/www/DIPsStore in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.

AIP Recovery[edit]

  • Purpose: Recover a corrupted AIP
  • Required: No
  • Multiples allowed: No
  • Database code: AR

AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is /var/archivematica/storage_service/recover in a Local Filesystem. This is only required if AIP recovery is used.

Needs clarification: Is this for storing the corrupted AIP, or for storing a duplicate copy of the AIP from which it can be recovered, or for something else?

Storage Service Internal[edit]

  • Purpose: Internal staging area for the Storage Service
  • Required: Yes
  • Multiples allowed: No
  • Database code: SS
  • Associated with a pipeline: No

There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is /var/archivematica/storage_service in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines.

FEDORA Deposit[edit]

  • Purpose: Store deposited transfers from Archidora before starting as a Transfer.
  • Required: No
  • Multiples allowed: Yes
  • Database code: SD

FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.


Pipelines[edit]

Archivematica installations are tracked in the Storage Service as Pipelines. Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned.

Package[edit]

Packages are a file or directory (collection of files) that Archivematica knows about and can track. Most Packages are AIPs in long term storage, but they could also be Transfers in backlog or stored DIPs.

Most of the additional functionality in the storage service not directly related to moving files around is implemented on the Package class.

Status[edit]

Packages can have several different statuses, indicating where they are in terms of being stored.

  • Upload Pending: Still on Archivematica
  • Staged on Storage Service: In Storage Service staging directory
  • Uploaded: In final storage location
  • Verified: Verified to be in final storage location
  • Failed: Error occurred - may or may not be at final location
  • Delete requested: Delete requested, package state unchanged
  • Deleted: Storage service tried to delete it, considers it gone
  • Deposit Finalized: For SWORD API accepting deposits (unused for AIPs)

"Upload Pending" is set when storing an AIP, before it's been moved from the pipeline to the staging directory of the final destination.

"Staging" is set when storing an AIP after it's been moved to the staging directory of the destination Space.

"Uploaded" is set when storing an AIP after it has been moved to the final location, unless it's LOCKSS or Arkivum. Arkivum packages move to "Uploaded" once the replication status is "green", and LOCKSS once all server states are "agreement"

"Verified" exists but is never used.

"Failed" is the default state if none is set (eg if it doesn't get to Upload Pending), but is otherwise unused.

"Delete requested" is set when a request is deleted. Note that this replaces the previous state, so if the delete request is rejected the previous state is lost.

"Deleted" happens when a delete request is approved. The package entry still exists, but is assumed to be gone. A package can be in this state but still exist if the delete failed. The storage service doesn't enforce not interacting with deleted packages

"Deposit Finalized" is the state of a FEDORA transfer after the deposit is finalized and the transfer has been started.

Improvement Note: Make "Uploaded" more consistent - state is always Uploaded after the AIP has been moved to the final location, and start using "Verified" for what Arkivum & LOCKSS currently use "Uploaded" for. Perhaps "Uploaded" should be renamed, since in the case of LOCKSS/Arkivum it may not have been actually uploaded yet. This is potentially confused though, because the 'completely stored' state (Uploaded vs Verified) is different depending on the Space.
Improvement Note: Start using "Failed". Use cases might include Arkivum status going from Green to Red, or otherwise failing a fixity check.
Improvement Note: Store the package state as set or a list. This would allow a package to be Uploaded and Delete requested, instead of losing that state. We could also mark something as both Uploaded and Verified, resolving how to handle Arkivum status. Failed could coexist with the last known good state.

API[edit]

See Storage Service API

Troubleshooting[edit]

The Storage service keeps a log at /var/log/archivematica/storage-service.log and errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log

See also[edit]