Difference between revisions of "Storage Service"

From Archivematica
Jump to navigation Jump to search
(Expand description)
Line 1: Line 1:
The Archivematica '''Storage Service''' is a standalone web application for the management of Archivematica storage "spaces", storage "locations" and stored AIPs.  The '''Storage Service''' also allows linking spaces and locations to an Archivematica pipeline.
+
The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval.
 +
 
 +
There are 2 main configuration levels in the Storage Service: Spaces and Locations.
 +
* [[#Space | Space]]: '''where''' the files are stored. This is the protocol used to fetch and store the files in a storage system.  Examples: Local filesystem, Duracloud.  Spaces contain Locations.
 +
* [[#Location | Location]]: '''why''' the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage.  Locations are inside Spaces.
 +
 
 +
 
 +
== Space ==
 +
 
 +
A storage Space contains all the information necessary to connect to the physical storage. It is '''where''' the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space.
 +
 
 +
 
 +
Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance.
 +
 
 +
For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at <code>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</code> and <code>/home/artefactual/maildir_transfers</code>, the Space’s path could be <code>/home/artefactual/</code>
 +
 
 +
All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files.  The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols.
 +
 
 +
 
 +
{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
 +
| Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [https://jacobian.org/writing/concrete-inheritance/] [http://stackoverflow.com/questions/23466577/should-i-avoid-multi-table-concrete-inheritance-in-django-by-any-means]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded.  A better future design would use concrete multi-table inheritance for the different types of Spaces.
 +
|}
 +
 
 +
{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
 +
| Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the twoA better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path
 +
|}
 +
 
 +
=== Arkivum ===
 +
 
 +
* '''Uses Space.path''': Yes
 +
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 +
 
 +
=== Dataverse ===
 +
 
 +
* '''Uses Space.path''': No
 +
* '''Supported purposes''': [[#Transfer Source | Transfer Source]]
 +
 
 +
=== Duracloud ===
 +
 
 +
* '''Uses Space.path''': No
 +
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]]
 +
 
 +
=== DSpace ===
 +
 
 +
* '''Uses Space.path''': No
 +
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 +
 
 +
=== FEDORA via SWORD2 ===
 +
 
 +
* '''Uses Space.path''': Yes
 +
* '''Supported purposes''': [[#FEDORA Deposit|FEDORA Deposity]]
 +
 
 +
=== Local Filesystem ===
 +
 
 +
* '''Uses Space.path''': Yes
 +
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]]
 +
 
 +
=== LOCKSS-o-matic ===
 +
 
 +
* '''Uses Space.path''': Yes
 +
* '''Supported purposes''': [[#AIP Storage | AIP Storage]]
 +
 
 +
The Space.path is used as a staging location when making files available for harvesting.
 +
 
 +
=== NFS ===
 +
 
 +
* '''Uses Space.path''': Yes
 +
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]]
 +
 
 +
NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from [[#Local Filesystem | Local Filesystem]]. Currently, NFS handling should be done outside of Archivematica.
 +
 
 +
=== Pipeline Local Filesystem ===
 +
 
 +
* '''Uses Space.path''': Yes
 +
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]]
 +
 
 +
=== Swift ===
 +
 
 +
* '''Uses Space.path''': No
 +
* '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]]
 +
 
 +
 
 +
== Locations ==
 +
 
 +
A storage Location is contained in a Space, and knows its purpose in the Archivematica system.  This is '''why''' the files are there.  A Location allows Archivematica to query for only storage that has been marked for a particular purpose.
 +
 
 +
Each Location should be associated with at least one pipeline.  A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.
 +
 
 +
Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.
 +
 
 +
=== Transfer Source ===
 +
 
 +
* '''Purpose''': Input into Archivematica
 +
* '''Required''': Yes
 +
* '''Multiples allowed''': Yes
 +
 
 +
Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from.  Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is <code>/home</code> in a Local Filesystem.
 +
 
 +
=== Currently Processing ===
 +
 
 +
* '''Purpose''': For Archivematica's internal processing
 +
* '''Required''': Yes
 +
* '''Multiples allowed''': No
 +
 
 +
During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is <code>/var/archivematica/sharedDirectory</code> in a Local Filesystem. This is required for Archivematica to run.
 +
 
 +
=== Transfer Backlog ===
 +
 
 +
* '''Purpose''': Store Transfers in backlog
 +
* '''Required''': No (Yes if using Backlog)
 +
* '''Multiples allowed''': No
 +
 
 +
Transfer backlog stores transfers until such a time that the user continues processing them. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog</code> in a Local Filesystem. This is required to store and retrieve transfers in backlog.
 +
 
 +
=== AIP Storage ===
 +
 
 +
* '''Purpose''': Store AIPs for long term storage
 +
* '''Required''': Yes
 +
* '''Multiples allowed''': Yes
 +
 
 +
AIP storage locations are where the completed AIPs are put for long-term storage. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore</code> in a Local Filesystem. This is required to store and retrieve AIPs.
 +
 
 +
=== DIP Storage ===
 +
 
 +
* '''Purpose''': Store DIPs before uploading to access systems
 +
* '''Required''': No
 +
* '''Multiples allowed''': Yes
 +
 
 +
DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is <code>/var/archivematica/sharedDirectory/www/DIPsStore</code> in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.
 +
 
 +
=== AIP Recovery ===
 +
 
 +
* '''Purpose''': Recover a corrupted AIP
 +
* '''Required''': No
 +
* '''Multiples allowed''': No
 +
 
 +
AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is <code>/var/archivematica/storage_service/recover</code> in a Local Filesystem. This is only required if AIP recovery is used.
 +
 
 +
Needs clarification: Is this to stored the corrupted AIP, or stores a duplicated copy of the AIP so it can be recovered from, or something else?
 +
 
 +
=== Storage Service Internal ===
 +
 
 +
* '''Purpose''': Internal staging area for the Storage Service
 +
* '''Required''': Yes
 +
* '''Multiples allowed''': No
 +
* '''Associated with a pipeline''': No
 +
 
 +
There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is <code>/var/archivematica/storage_service</code> in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines.
 +
 
 +
=== FEDORA Deposit ===
 +
 
 +
* '''Purpose''': Store deposited transfers from Archidora before starting as a Transfer.
 +
* '''Required''': No
 +
* '''Multiples allowed''': Yes
 +
 
 +
FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.
 +
 
 +
 
 +
== Pipelines ==
 +
 
 +
Archivematica installations are tracked in the Storage Service as Pipelines.  Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned.
 +
 
  
 
== Troubleshooting ==
 
== Troubleshooting ==
  
The Storage service keeps a log at /tmp/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log
+
The Storage service keeps a log at /var/log/archivematica/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log
  
 
== See also ==
 
== See also ==
  
 
* [[Administrator manual 1.1#Storage service]]
 
* [[Administrator manual 1.1#Storage service]]
 +
 +
[[Category:Development documentation]]

Revision as of 19:34, 10 March 2017

The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval.

There are 2 main configuration levels in the Storage Service: Spaces and Locations.

  • Space: where the files are stored. This is the protocol used to fetch and store the files in a storage system. Examples: Local filesystem, Duracloud. Spaces contain Locations.
  • Location: why the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage. Locations are inside Spaces.


Space

A storage Space contains all the information necessary to connect to the physical storage. It is where the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space.


Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance.

For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at /home/artefactual/archivematica-sampledata-2013-10-10-09-17-20 and /home/artefactual/maildir_transfers, the Space’s path could be /home/artefactual/

All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files. The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols.


Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [1] [2]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded. A better future design would use concrete multi-table inheritance for the different types of Spaces.
Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two. A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path

Arkivum

Dataverse

Duracloud

DSpace

FEDORA via SWORD2

Local Filesystem

LOCKSS-o-matic

The Space.path is used as a staging location when making files available for harvesting.

NFS

NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from Local Filesystem. Currently, NFS handling should be done outside of Archivematica.

Pipeline Local Filesystem

Swift


Locations

A storage Location is contained in a Space, and knows its purpose in the Archivematica system. This is why the files are there. A Location allows Archivematica to query for only storage that has been marked for a particular purpose.

Each Location should be associated with at least one pipeline. A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.

Transfer Source

  • Purpose: Input into Archivematica
  • Required: Yes
  • Multiples allowed: Yes

Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from. Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is /home in a Local Filesystem.

Currently Processing

  • Purpose: For Archivematica's internal processing
  • Required: Yes
  • Multiples allowed: No

During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is /var/archivematica/sharedDirectory in a Local Filesystem. This is required for Archivematica to run.

Transfer Backlog

  • Purpose: Store Transfers in backlog
  • Required: No (Yes if using Backlog)
  • Multiples allowed: No

Transfer backlog stores transfers until such a time that the user continues processing them. The default value is /var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog in a Local Filesystem. This is required to store and retrieve transfers in backlog.

AIP Storage

  • Purpose: Store AIPs for long term storage
  • Required: Yes
  • Multiples allowed: Yes

AIP storage locations are where the completed AIPs are put for long-term storage. The default value is /var/archivematica/sharedDirectory/www/AIPsStore in a Local Filesystem. This is required to store and retrieve AIPs.

DIP Storage

  • Purpose: Store DIPs before uploading to access systems
  • Required: No
  • Multiples allowed: Yes

DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is /var/archivematica/sharedDirectory/www/DIPsStore in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.

AIP Recovery

  • Purpose: Recover a corrupted AIP
  • Required: No
  • Multiples allowed: No

AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is /var/archivematica/storage_service/recover in a Local Filesystem. This is only required if AIP recovery is used.

Needs clarification: Is this to stored the corrupted AIP, or stores a duplicated copy of the AIP so it can be recovered from, or something else?

Storage Service Internal

  • Purpose: Internal staging area for the Storage Service
  • Required: Yes
  • Multiples allowed: No
  • Associated with a pipeline: No

There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is /var/archivematica/storage_service in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines.

FEDORA Deposit

  • Purpose: Store deposited transfers from Archidora before starting as a Transfer.
  • Required: No
  • Multiples allowed: Yes

FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.


Pipelines

Archivematica installations are tracked in the Storage Service as Pipelines. Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned.


Troubleshooting

The Storage service keeps a log at /var/log/archivematica/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log

See also