Difference between revisions of "Storage Service"
(Link to Administrator manual 1.1#Storage_service) |
(Expand description) |
||
Line 1: | Line 1: | ||
− | The Archivematica '''Storage | + | The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval. |
+ | |||
+ | There are 2 main configuration levels in the Storage Service: Spaces and Locations. | ||
+ | * [[#Space | Space]]: '''where''' the files are stored. This is the protocol used to fetch and store the files in a storage system. Examples: Local filesystem, Duracloud. Spaces contain Locations. | ||
+ | * [[#Location | Location]]: '''why''' the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage. Locations are inside Spaces. | ||
+ | |||
+ | |||
+ | == Space == | ||
+ | |||
+ | A storage Space contains all the information necessary to connect to the physical storage. It is '''where''' the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space. | ||
+ | |||
+ | |||
+ | Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance. | ||
+ | |||
+ | For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at <code>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</code> and <code>/home/artefactual/maildir_transfers</code>, the Space’s path could be <code>/home/artefactual/</code> | ||
+ | |||
+ | All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files. The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols. | ||
+ | |||
+ | |||
+ | {| class="wikitable" style="background-color:#ffeecc;" cellpadding="10"; | ||
+ | | Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [https://jacobian.org/writing/concrete-inheritance/] [http://stackoverflow.com/questions/23466577/should-i-avoid-multi-table-concrete-inheritance-in-django-by-any-means]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded. A better future design would use concrete multi-table inheritance for the different types of Spaces. | ||
+ | |} | ||
+ | |||
+ | {| class="wikitable" style="background-color:#ffeecc;" cellpadding="10"; | ||
+ | | Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two. A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path | ||
+ | |} | ||
+ | |||
+ | === Arkivum === | ||
+ | |||
+ | * '''Uses Space.path''': Yes | ||
+ | * '''Supported purposes''': [[#AIP Storage | AIP Storage]] | ||
+ | |||
+ | === Dataverse === | ||
+ | |||
+ | * '''Uses Space.path''': No | ||
+ | * '''Supported purposes''': [[#Transfer Source | Transfer Source]] | ||
+ | |||
+ | === Duracloud === | ||
+ | |||
+ | * '''Uses Space.path''': No | ||
+ | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]] | ||
+ | |||
+ | === DSpace === | ||
+ | |||
+ | * '''Uses Space.path''': No | ||
+ | * '''Supported purposes''': [[#AIP Storage | AIP Storage]] | ||
+ | |||
+ | === FEDORA via SWORD2 === | ||
+ | |||
+ | * '''Uses Space.path''': Yes | ||
+ | * '''Supported purposes''': [[#FEDORA Deposit|FEDORA Deposity]] | ||
+ | |||
+ | === Local Filesystem === | ||
+ | |||
+ | * '''Uses Space.path''': Yes | ||
+ | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]] | ||
+ | |||
+ | === LOCKSS-o-matic === | ||
+ | |||
+ | * '''Uses Space.path''': Yes | ||
+ | * '''Supported purposes''': [[#AIP Storage | AIP Storage]] | ||
+ | |||
+ | The Space.path is used as a staging location when making files available for harvesting. | ||
+ | |||
+ | === NFS === | ||
+ | |||
+ | * '''Uses Space.path''': Yes | ||
+ | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]], [[#Storage Service Internal | Storage Service Internal]] | ||
+ | |||
+ | NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from [[#Local Filesystem | Local Filesystem]]. Currently, NFS handling should be done outside of Archivematica. | ||
+ | |||
+ | === Pipeline Local Filesystem === | ||
+ | |||
+ | * '''Uses Space.path''': Yes | ||
+ | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Currently Processing | Currently Processing]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]], [[#AIP Recovery | AIP Recovery]] | ||
+ | |||
+ | === Swift === | ||
+ | |||
+ | * '''Uses Space.path''': No | ||
+ | * '''Supported purposes''': [[#Transfer Source | Transfer Source]], [[#Transfer Backlog | Transfer Backlog]], [[#AIP Storage | AIP Storage ]], [[#DIP Storage | DIP Storage]] | ||
+ | |||
+ | |||
+ | == Locations == | ||
+ | |||
+ | A storage Location is contained in a Space, and knows its purpose in the Archivematica system. This is '''why''' the files are there. A Location allows Archivematica to query for only storage that has been marked for a particular purpose. | ||
+ | |||
+ | Each Location should be associated with at least one pipeline. A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created. | ||
+ | |||
+ | Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu. | ||
+ | |||
+ | === Transfer Source === | ||
+ | |||
+ | * '''Purpose''': Input into Archivematica | ||
+ | * '''Required''': Yes | ||
+ | * '''Multiples allowed''': Yes | ||
+ | |||
+ | Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from. Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is <code>/home</code> in a Local Filesystem. | ||
+ | |||
+ | === Currently Processing === | ||
+ | |||
+ | * '''Purpose''': For Archivematica's internal processing | ||
+ | * '''Required''': Yes | ||
+ | * '''Multiples allowed''': No | ||
+ | |||
+ | During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is <code>/var/archivematica/sharedDirectory</code> in a Local Filesystem. This is required for Archivematica to run. | ||
+ | |||
+ | === Transfer Backlog === | ||
+ | |||
+ | * '''Purpose''': Store Transfers in backlog | ||
+ | * '''Required''': No (Yes if using Backlog) | ||
+ | * '''Multiples allowed''': No | ||
+ | |||
+ | Transfer backlog stores transfers until such a time that the user continues processing them. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog</code> in a Local Filesystem. This is required to store and retrieve transfers in backlog. | ||
+ | |||
+ | === AIP Storage === | ||
+ | |||
+ | * '''Purpose''': Store AIPs for long term storage | ||
+ | * '''Required''': Yes | ||
+ | * '''Multiples allowed''': Yes | ||
+ | |||
+ | AIP storage locations are where the completed AIPs are put for long-term storage. The default value is <code>/var/archivematica/sharedDirectory/www/AIPsStore</code> in a Local Filesystem. This is required to store and retrieve AIPs. | ||
+ | |||
+ | === DIP Storage === | ||
+ | |||
+ | * '''Purpose''': Store DIPs before uploading to access systems | ||
+ | * '''Required''': No | ||
+ | * '''Multiples allowed''': Yes | ||
+ | |||
+ | DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is <code>/var/archivematica/sharedDirectory/www/DIPsStore</code> in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems. | ||
+ | |||
+ | === AIP Recovery === | ||
+ | |||
+ | * '''Purpose''': Recover a corrupted AIP | ||
+ | * '''Required''': No | ||
+ | * '''Multiples allowed''': No | ||
+ | |||
+ | AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is <code>/var/archivematica/storage_service/recover</code> in a Local Filesystem. This is only required if AIP recovery is used. | ||
+ | |||
+ | Needs clarification: Is this to stored the corrupted AIP, or stores a duplicated copy of the AIP so it can be recovered from, or something else? | ||
+ | |||
+ | === Storage Service Internal === | ||
+ | |||
+ | * '''Purpose''': Internal staging area for the Storage Service | ||
+ | * '''Required''': Yes | ||
+ | * '''Multiples allowed''': No | ||
+ | * '''Associated with a pipeline''': No | ||
+ | |||
+ | There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is <code>/var/archivematica/storage_service</code> in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines. | ||
+ | |||
+ | === FEDORA Deposit === | ||
+ | |||
+ | * '''Purpose''': Store deposited transfers from Archidora before starting as a Transfer. | ||
+ | * '''Required''': No | ||
+ | * '''Multiples allowed''': Yes | ||
+ | |||
+ | FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space. | ||
+ | |||
+ | |||
+ | == Pipelines == | ||
+ | |||
+ | Archivematica installations are tracked in the Storage Service as Pipelines. Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned. | ||
+ | |||
== Troubleshooting == | == Troubleshooting == | ||
− | The Storage service keeps a log at / | + | The Storage service keeps a log at /var/log/archivematica/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log |
== See also == | == See also == | ||
* [[Administrator manual 1.1#Storage service]] | * [[Administrator manual 1.1#Storage service]] | ||
+ | |||
+ | [[Category:Development documentation]] |
Revision as of 18:34, 10 March 2017
The Archivematica Storage Service is a standalone web application that handles moving files to Archivematica for processing, from Archivematica into long term storage, and keeps track of their location for later retrieval.
There are 2 main configuration levels in the Storage Service: Spaces and Locations.
- Space: where the files are stored. This is the protocol used to fetch and store the files in a storage system. Examples: Local filesystem, Duracloud. Spaces contain Locations.
- Location: why the files are there. This is what purpose Archivematica is using them for. Examples: Transfer Source, AIP Storage. Locations are inside Spaces.
Space
A storage Space contains all the information necessary to connect to the physical storage. It is where the files are stored. Protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored here. All locations must be contained in a space.
Because Spaces deal with many different protocols and transportation needs, there are many different types of them. Each different Space type defines its own Django model (class), which has an associated Space instance.
For path-based spaces, the Space is the immediate parent of the Location folders. For example, if you had transfer source locations at /home/artefactual/archivematica-sampledata-2013-10-10-09-17-20
and /home/artefactual/maildir_transfers
, the Space’s path could be /home/artefactual/
All protocols require a staging path. This is a temporary location on the Storage Service server that is used when moving files. The storage service moves files by first copying them to the destination Space's staging directory, and then to the actual destination space. This reduces complexity, because each Space only needs to know how to get files between the locally-accessible staging directory & its own protocol, not between all other protocols.
Improvement Note: Currently, the Spaces are distinct models with a OnetoOneField back to Space. This was done because of warnings against concrete inheritance in models [1] [2]. However, in the Storage Service we never want to access a child space without also accessing its parent, so that concern is probably not founded. A better future design would use concrete multi-table inheritance for the different types of Spaces. |
Improvement Note: When originally written, the only Spaces conceived of were path-based (local filesystem, NFS, etc), so the Space/Location information reflected that. However, most new Spaces are object based, or otherwise don’t use the Space.path & Location.relative_path, and shouldn’t use os.path.join to join the two. A better future design would move all path-related features out of Space into LocalFilesystem etc and remove the implicit os.path.join with Location.relative_path |
Arkivum
- Uses Space.path: Yes
- Supported purposes: AIP Storage
Dataverse
- Uses Space.path: No
- Supported purposes: Transfer Source
Duracloud
- Uses Space.path: No
- Supported purposes: Transfer Source, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery
DSpace
- Uses Space.path: No
- Supported purposes: AIP Storage
FEDORA via SWORD2
- Uses Space.path: Yes
- Supported purposes: FEDORA Deposity
Local Filesystem
- Uses Space.path: Yes
- Supported purposes: Transfer Source, Currently Processing, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery, Storage Service Internal
LOCKSS-o-matic
- Uses Space.path: Yes
- Supported purposes: AIP Storage
The Space.path is used as a staging location when making files available for harvesting.
NFS
- Uses Space.path: Yes
- Supported purposes: Transfer Source, Currently Processing, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery, Storage Service Internal
NFS is a stub space. It was intended to support auto mounting NFS shares, but is not significantly different from Local Filesystem. Currently, NFS handling should be done outside of Archivematica.
Pipeline Local Filesystem
- Uses Space.path: Yes
- Supported purposes: Transfer Source, Currently Processing, Transfer Backlog, AIP Storage , DIP Storage, AIP Recovery
Swift
- Uses Space.path: No
- Supported purposes: Transfer Source, Transfer Backlog, AIP Storage , DIP Storage
Locations
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. This is why the files are there. A Location allows Archivematica to query for only storage that has been marked for a particular purpose.
Each Location should be associated with at least one pipeline. A pipeline can have multiple instances of any location, except for Backlog and Currently Processing locations which should only be one of. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.
Not all Spaces support all Location purposes. For example, several Spaces only allow AIP storage, because they are only suitable for long term storage and do not provide temporary storage (eg Transfer backlog) or easy access to files (eg Transfer source). When creating a new Location, only allowed purposes are selectable in the menu.
Transfer Source
- Purpose: Input into Archivematica
- Required: Yes
- Multiples allowed: Yes
Trasfer source locations are where Transfers can be started from and where metadata files can be added to a unit from. Transfer source locations display in Archivematica’s Transfer tab. Any folder in a transfer source can be selected to become a Transfer. The default value is /home
in a Local Filesystem.
Currently Processing
- Purpose: For Archivematica's internal processing
- Required: Yes
- Multiples allowed: No
During processing, Archivematica uses the currently processing location associated with that pipeline. Exactly one currently processing location should be associated with a given pipeline. The default value is /var/archivematica/sharedDirectory
in a Local Filesystem. This is required for Archivematica to run.
Transfer Backlog
- Purpose: Store Transfers in backlog
- Required: No (Yes if using Backlog)
- Multiples allowed: No
Transfer backlog stores transfers until such a time that the user continues processing them. The default value is /var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog
in a Local Filesystem. This is required to store and retrieve transfers in backlog.
AIP Storage
- Purpose: Store AIPs for long term storage
- Required: Yes
- Multiples allowed: Yes
AIP storage locations are where the completed AIPs are put for long-term storage. The default value is /var/archivematica/sharedDirectory/www/AIPsStore
in a Local Filesystem. This is required to store and retrieve AIPs.
DIP Storage
- Purpose: Store DIPs before uploading to access systems
- Required: No
- Multiples allowed: Yes
DIP storage is used for storing DIPs until such a time that they can be uploaded to an access system. The default value is /var/archivematica/sharedDirectory/www/DIPsStore
in a Local Filesystem. This is required to store and retrieve DIPs. This is not required to upload DIPs to access systems.
AIP Recovery
- Purpose: Recover a corrupted AIP
- Required: No
- Multiples allowed: No
AIP Recovery is where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is /var/archivematica/storage_service/recover
in a Local Filesystem. This is only required if AIP recovery is used.
Needs clarification: Is this to stored the corrupted AIP, or stores a duplicated copy of the AIP so it can be recovered from, or something else?
Storage Service Internal
- Purpose: Internal staging area for the Storage Service
- Required: Yes
- Multiples allowed: No
- Associated with a pipeline: No
There should only be exactly one Storage Service Internal Processing location for each Storage Service installation. The default value is /var/archivematica/storage_service
in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the storage service. It should not be associated with any pipelines.
FEDORA Deposit
- Purpose: Store deposited transfers from Archidora before starting as a Transfer.
- Required: No
- Multiples allowed: Yes
FEDORA Deposit is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.
Pipelines
Archivematica installations are tracked in the Storage Service as Pipelines. Locations are associated with Pipelines, which gives them access to that storage. If a Location is not associated with a Pipeline, it doesn't exist as far as that pipeline is concerned.
Troubleshooting
The Storage service keeps a log at /var/log/archivematica/storage-service.log and will errors may also be logged to the nginx and uwsgi logs as well at: /var/log/uwsgi/app/storage.log