Difference between revisions of "Storage Service"
m (Add API docs link) |
(→Space: Add database codes) |
||
Line 72: | Line 72: | ||
* '''Uses Space.path''': Yes | * '''Uses Space.path''': Yes | ||
* '''Supported purposes''': [[#AIP Storage | AIP Storage]] | * '''Supported purposes''': [[#AIP Storage | AIP Storage]] | ||
+ | * '''Database code''': ARKIVUM | ||
This uses Arkivum's A-Stor. A-Stor exposes a CIFS share, which is mounted on the Storage Service and treated like a local filesystem. After files are copied to the share, a release request is sent to A-Stor to start its internal processing (copying the files to multiple datapools). While the mount is exposed as a CIFS share, it is only imitating the behaviour. The files may not exist, and the package info must be checked before accessing files to ensure they are actually present and avoid long waits. | This uses Arkivum's A-Stor. A-Stor exposes a CIFS share, which is mounted on the Storage Service and treated like a local filesystem. After files are copied to the share, a release request is sent to A-Stor to start its internal processing (copying the files to multiple datapools). While the mount is exposed as a CIFS share, it is only imitating the behaviour. The files may not exist, and the package info must be checked before accessing files to ensure they are actually present and avoid long waits. | ||
Line 79: | Line 80: | ||
* '''Uses Space.path''': No | * '''Uses Space.path''': No | ||
* '''Supported purposes''': [[#Transfer Source | Transfer Source]] | * '''Supported purposes''': [[#Transfer Source | Transfer Source]] | ||
+ | * '''Database code''': DV | ||
Dataverse is a prototype Space that requires support in Archivematica that is [https://github.com/artefactual/archivematica/pull/347 not yet merged]. It uses the search interface to 'browse' datasets by querying with the provided path. It uses the returned JSON to fetch all files associated with that dataset, and the returned JSON is stored as <code>dataset.json</code> | Dataverse is a prototype Space that requires support in Archivematica that is [https://github.com/artefactual/archivematica/pull/347 not yet merged]. It uses the search interface to 'browse' datasets by querying with the provided path. It uses the returned JSON to fetch all files associated with that dataset, and the returned JSON is stored as <code>dataset.json</code> | ||
Line 90: | Line 92: | ||
* '''Uses Space.path''': No | * '''Uses Space.path''': No | ||
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]] | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]] | ||
+ | * '''Database code''': DC | ||
A Duracloud Space corresponds with a Space in Archivematica, so to support multiple Duracloud Spaces, multiple storage service spaces must be created. The Location path is used as a prefix to the path. Duracloud is used in hosted Archivematica. | A Duracloud Space corresponds with a Space in Archivematica, so to support multiple Duracloud Spaces, multiple storage service spaces must be created. The Location path is used as a prefix to the path. Duracloud is used in hosted Archivematica. | ||
Line 101: | Line 104: | ||
* '''Uses Space.path''': No | * '''Uses Space.path''': No | ||
* '''Supported purposes''': [[#AIP Storage | AIP Storage]] | * '''Supported purposes''': [[#AIP Storage | AIP Storage]] | ||
+ | * '''Database code''': DSPACE | ||
DSpace uses the SWORD2 API & DSpace REST API to upload the AIP. Before uploading, the AIP is split into two packages: one containing the objects, and one containing everything else (metadata, logs, bagit structure). It also uploads Dublin Core information to DSpace if available. For the Dublin Core upload to work, some configuration changes in DSpace are required. | DSpace uses the SWORD2 API & DSpace REST API to upload the AIP. Before uploading, the AIP is split into two packages: one containing the objects, and one containing everything else (metadata, logs, bagit structure). It also uploads Dublin Core information to DSpace if available. For the Dublin Core upload to work, some configuration changes in DSpace are required. | ||
Line 112: | Line 116: | ||
* '''Uses Space.path''': Yes | * '''Uses Space.path''': Yes | ||
* '''Supported purposes''': [[#FEDORA Deposit|FEDORA Deposit]] | * '''Supported purposes''': [[#FEDORA Deposit|FEDORA Deposit]] | ||
+ | * '''Database code''': FEDORA | ||
This offers a SWORD2 server API to allow another system (developed for [https://wiki.duraspace.org/display/ISLANDORA/Archidora Archidora], but could be others) to deposit content into Archivematica and trigger a Transfer. This is contrasted with the other Spaces, which require Archivematica to initiate contact. Examples and documentation at [[Sword API]]. | This offers a SWORD2 server API to allow another system (developed for [https://wiki.duraspace.org/display/ISLANDORA/Archidora Archidora], but could be others) to deposit content into Archivematica and trigger a Transfer. This is contrasted with the other Spaces, which require Archivematica to initiate contact. Examples and documentation at [[Sword API]]. | ||
Line 119: | Line 124: | ||
* '''Uses Space.path''': Yes | * '''Uses Space.path''': Yes | ||
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]] | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]] | ||
+ | * '''Database code''': FS | ||
Local Filesystem spaces handle storage that is available locally on the machine running the storage service. This can be a hard drive, or a mounted remote filesystem. This is the default configured space. | Local Filesystem spaces handle storage that is available locally on the machine running the storage service. This can be a hard drive, or a mounted remote filesystem. This is the default configured space. | ||
Line 126: | Line 132: | ||
* '''Uses Space.path''': Yes | * '''Uses Space.path''': Yes | ||
* '''Supported purposes''': [[#AIP Storage | AIP Storage]] | * '''Supported purposes''': [[#AIP Storage | AIP Storage]] | ||
+ | * '''Database code''': LOM | ||
This support storing AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN). The Space.path is used as a staging location when making files available for harvesting. | This support storing AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN). The Space.path is used as a staging location when making files available for harvesting. | ||
Line 133: | Line 140: | ||
* '''Uses Space.path''': Yes | * '''Uses Space.path''': Yes | ||
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]] | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]] | ||
+ | * '''Database code''': NFS | ||
NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from [[#Local Filesystem | Local Filesystem]]. Currently, NFS handling should be done outside of Archivematica. | NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from [[#Local Filesystem | Local Filesystem]]. Currently, NFS handling should be done outside of Archivematica. | ||
Line 140: | Line 148: | ||
* '''Uses Space.path''': Yes | * '''Uses Space.path''': Yes | ||
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]] | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]] | ||
+ | * '''Database code''': PIPE_FS | ||
Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host. This is the easiest way to support having the Pipeline and Storage Service on different machines. | Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host. This is the easiest way to support having the Pipeline and Storage Service on different machines. | ||
Line 147: | Line 156: | ||
* '''Uses Space.path''': No | * '''Uses Space.path''': No | ||
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]] | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]] | ||
+ | * '''Database code''': SWIFT | ||
This stores in OpenStack's Swift using the [https://pypi.python.org/pypi/python-swiftclient swiftclient] library. | This stores in OpenStack's Swift using the [https://pypi.python.org/pypi/python-swiftclient swiftclient] library. |
Revision as of 16:37, 13 March 2017
Main Page > Development > Development documentation > Storage Service
Design
Development
Documentation
The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval.
There are 2 main configuration levels in the Storage Service: Spaces and Locations.
- Space: where the files are stored. This is the protocol used to fetch and store the files in a storage system. Examples: Local filesystem, Duracloud. Spaces contain Locations.
- Location: why the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage. Locations are inside Spaces.
Initial design requirements and API documentation also exist.
Space
A storage Space contains all the information necessary to connect to the physical storage. It is where the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space.
Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance.
For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at /home/artefactual/archivematica-sampledata-2013-10-10-09-17-20
and /home/artefactual/maildir_transfers
, the Space’s path could be /home/artefactual/
All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files. The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols.
There are 4 core methods that a Space should implement
browse
- Allows seeing the files available or stored here.
move_to_storage_service
- Moves files from the remote storage to the staging path
move_from_storage_service
- Moves files from the staging path to the remote storage.
delete_path
- Deletes files at a path.
Additionally, several other functions enable custom behaviour
post_move_to_storage_service
- Allows post-processing after fetching files. E.g. putting a split package back together
post_move_from_storage_service
- Allows post-processing after storing files. E.g. Notifying the storage system of the new files
update_package_status
- Allow the status of a package in this system to be checked and updated. E.g. ensure replication has completed
check_package_fixity
- Allow a space's fixity check to be called, instead of downloading and running bagit manually
Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [1] [2]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded. A better future design would use concrete multi-table inheritance for the different types of Spaces. |
Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two. A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path |
Spaces are sorted alphabetically in the docs.
Arkivum
- Uses Space.path: Yes
- Supported purposes: AIP Storage
- Database code: ARKIVUM
This uses Arkivum's A-Stor. A-Stor exposes a CIFS share, which is mounted on the Storage Service and treated like a local filesystem. After files are copied to the share, a release request is sent to A-Stor to start its internal processing (copying the files to multiple datapools). While the mount is exposed as a CIFS share, it is only imitating the behaviour. The files may not exist, and the package info must be checked before accessing files to ensure they are actually present and avoid long waits.
Dataverse
- Uses Space.path: No
- Supported purposes: Transfer Source
- Database code: DV
Dataverse is a prototype Space that requires support in Archivematica that is not yet merged. It uses the search interface to 'browse' datasets by querying with the provided path. It uses the returned JSON to fetch all files associated with that dataset, and the returned JSON is stored as dataset.json
When fetching datasets to start a transfer with, it assumes that the identifier is a digit, and the Location & Space paths contain no digits. This is a likely source of bugs. |
Duracloud
- Uses Space.path: No
- Supported purposes: Transfer Source, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery
- Database code: DC
A Duracloud Space corresponds with a Space in Archivematica, so to support multiple Duracloud Spaces, multiple storage service spaces must be created. The Location path is used as a prefix to the path. Duracloud is used in hosted Archivematica.
Improvement Note: The Location path should track which Space in Duracloud to upload to, instead of Duracloud.duraspace. This would also remove the unnecessary prefixes in paths when uploading. Care would have to be taken to migrate existing Duracloud configurations correctly. An optional path prefix could be useful. This would benefit from having Space.path removed first (see above improvement notes). |
DSpace
- Uses Space.path: No
- Supported purposes: AIP Storage
- Database code: DSPACE
DSpace uses the SWORD2 API & DSpace REST API to upload the AIP. Before uploading, the AIP is split into two packages: one containing the objects, and one containing everything else (metadata, logs, bagit structure). It also uploads Dublin Core information to DSpace if available. For the Dublin Core upload to work, some configuration changes in DSpace are required.
Improvement Note: DSpace does not support fetching files from DSpace, so downloading the AIP, fixity check, and AIP reingest do not work. This should be implemented. |
FEDORA via SWORD2
- Uses Space.path: Yes
- Supported purposes: FEDORA Deposit
- Database code: FEDORA
This offers a SWORD2 server API to allow another system (developed for Archidora, but could be others) to deposit content into Archivematica and trigger a Transfer. This is contrasted with the other Spaces, which require Archivematica to initiate contact. Examples and documentation at Sword API.
Local Filesystem
- Uses Space.path: Yes
- Supported purposes: Transfer Source, Currently Processing, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery, Storage Service Internal
- Database code: FS
Local Filesystem spaces handle storage that is available locally on the machine running the storage service. This can be a hard drive, or a mounted remote filesystem. This is the default configured space.
LOCKSS-o-matic
- Uses Space.path: Yes
- Supported purposes: AIP Storage
- Database code: LOM
This support storing AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN). The Space.path is used as a staging location when making files available for harvesting.
NFS
- Uses Space.path: Yes
- Supported purposes: Transfer Source, Currently Processing, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery, Storage Service Internal
- Database code: NFS
NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from Local Filesystem. Currently, NFS handling should be done outside of Archivematica.
Pipeline Local Filesystem
- Uses Space.path: Yes
- Supported purposes: Transfer Source, Currently Processing, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery
- Database code: PIPE_FS
Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host. This is the easiest way to support having the Pipeline and Storage Service on different machines.
Swift
- Uses Space.path: No
- Supported purposes: Transfer Source, Transfer Backlog, AIP Storage , DIP Storage
- Database code: SWIFT
This stores in OpenStack's Swift using the swiftclient library.
Locations
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. This is why the files are there. A Location allows Archivematica to query for only storage that has been marked for a particular purpose.
Each Location should be associated with at least one pipeline. A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.
Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.
Locations are sorted by order of appearance in processing in the docs.
Transfer Source
- Purpose: Input into Archivematica
- Required: Yes
- Multiples allowed: Yes
- Database code: TS
Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from. Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is /home
in a Local Filesystem.
Currently Processing
- Purpose: For Archivematica's internal processing
- Required: Yes
- Multiples allowed: No
- Database code: CP
During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is /var/archivematica/sharedDirectory
in a Local Filesystem. This is required for Archivematica to run.
Transfer Backlog
- Purpose: Store Transfers in backlog
- Required: No (Yes if using Backlog)
- Multiples allowed: No
- Database code: BL
Transfer backlog stores transfers until such a time that the user continues processing them. The default value is /var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog
in a Local Filesystem. This is required to store and retrieve transfers in backlog.
AIP Storage
- Purpose: Store AIPs for long term storage
- Required: Yes
- Multiples allowed: Yes
- Database code: AS
AIP storage locations are where the completed AIPs are put for long-term storage. The default value is /var/archivematica/sharedDirectory/www/AIPsStore
in a Local Filesystem. This is required to store and retrieve AIPs.
DIP Storage
- Purpose: Store DIPs before uploading to access systems
- Required: No
- Multiples allowed: Yes
- Database code: DS
DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is /var/archivematica/sharedDirectory/www/DIPsStore
in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.
AIP Recovery
- Purpose: Recover a corrupted AIP
- Required: No
- Multiples allowed: No
- Database code: AR
AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is /var/archivematica/storage_service/recover
in a Local Filesystem. This is only required if AIP recovery is used.
Needs clarification: Is this to stored the corrupted AIP, or stores a duplicated copy of the AIP so it can be recovered from, or something else?
Storage Service Internal
- Purpose: Internal staging area for the Storage Service
- Required: Yes
- Multiples allowed: No
- Database code: SS
- Associated with a pipeline: No
There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is /var/archivematica/storage_service
in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines.
FEDORA Deposit
- Purpose: Store deposited transfers from Archidora before starting as a Transfer.
- Required: No
- Multiples allowed: Yes
- Database code: SD
FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.
Pipelines
Archivematica installations are tracked in the Storage Service as Pipelines. Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned.
Package
Packages are a file or directory (collection of files) that Archivematica knows about and can track. Most Packages are AIPs in long term storage, but they could also be Transfers in backlog or stored DIPs.
Most of the additional functionality in the storage service not directly related to moving files around is implemented on the Package class.
Status
Packages can have several different statuses, indicating where they are in terms of being stored.
- Upload Pending: Still on Archivematica
- Staged on Storage Service: In Storage Service staging directory
- Uploaded: In final storage location
- Verified: Verified to be in final storage location
- Failed: Error occurred - may or may not be at final location
- Delete requested: Delete requested, package state unchanged
- Deleted: Storage service tried to delete it, considers it gone
- Deposit Finalized: For SWORD API accepting deposits (unused for AIPs)
"Upload Pending" is set when storing an AIP, before it's been moved from the pipeline to the staging directory of the final destination.
"Staging" is set when storing an AIP after it's been moved to the staging directory of the destination Space.
"Uploaded" is set when storing an AIP after it has been moved to the final location, unless it's LOCKSS or Arkivum. Arkivum packages move to "Uploaded" once the replication status is "green", and LOCKSS once all server states are "agreement"
"Verified" exists but is never used.
"Failed" is the default state if none is set (eg if it doesn't get to Upload Pending), but is otherwise unused.
"Delete requested" is set when a request is deleted. Note that this replaces the previous state, so if the delete request is rejected the previous state is lost.
"Deleted" happens when a delete request is approved. The package entry still exists, but is assumed to be gone. A package can be in this state but still exist if the delete failed. The storage service doesn't enforce not interacting with deleted packages
"Deposit Finalized" is the state of a FEDORA transfer after the deposit is finalized and the transfer has been started.
Improvement Note: Make "Uploaded" more consistent - state is always Uploaded after the AIP has been moved to the final location, and start using "Verified" for what Arkivum & LOCKSS currently use "Uploaded" for. Perhaps "Uploaded" should be renamed, since in the case of LOCKSS/Arkivum it may not have been actually uploaded yet. This is potentially confused though, because the 'completely stored' state (Uploaded vs Verified) is different depending on the Space. |
Improvement Note: Start using "Failed". Use cases might include Arkivum status going from Green to Red, or otherwise failing a fixity check. |
Improvement Note: Store the package state as set or a list. This would allow a package to be Uploaded and Delete requested, instead of losing that state. We could also mark something as both Uploaded and Verified, resolving how to handle Arkivum status. Failed could coexist with the last known good state. |
API
Troubleshooting
The Storage service keeps a log at /var/log/archivematica/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log