Difference between revisions of "Improvements/AIP Packaging"
(Add encryption use case) |
|||
Line 13: | Line 13: | ||
== Analysis: == | == Analysis: == | ||
− | Currently, Archivematica can only package a single AIP as a single bag. This has limitations in some repository environments and does not allow archivists/repository managers flexibility in how AIPs are stored and accessed. | + | Currently, Archivematica can only package a single AIP as a single bag. This bag can be stored as a folder (referred to as 'uncompressed' in the Archivematica UI) or as a single file (by default a 7zip file). |
+ | |||
+ | This has limitations in some repository environments and does not allow archivists/repository managers flexibility in how AIPs are stored and accessed. For example, some storage systems have a maximum file size limitation, which an individual AIP may exceed. In other cases, an organisation may have a requirement to encrypt all content at rest. | ||
=== Use case: AIP split into multiple parts === | === Use case: AIP split into multiple parts === | ||
− | An AIP is split into multiple zipped packages | + | An AIP is split into multiple pieces (zipped packages, loose files, binary chunks) for storage and retrieval purposes. There needs to be a way to record metadata that indicates the existence and locations all of the parts, record PREMIS events for each transformation that was applied to the AIP, and details about how to reverse each transformation. |
+ | |||
+ | ==== AIP Splitting scenarios ==== | ||
+ | |||
+ | Archivematica already creates pointer files, which are METS files that describe an AIP. Pointer files record a PREMIS event when an AIP is compressed, for example. They can also be used to record metadata about aip splitting. | ||
+ | |||
+ | Scenario 1: Simple splitting | ||
+ | |||
+ | An AIP is stored as a bag, and the bag is then turned into a .7z file. The .7z file is then split into multiple parts. This can be done with the unix split command ([http://man7.org/linux/man-pages/man1/split.1.html split man page]), with the -v argument to 7z ([https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm 7z volumes]) or by some other method. The result would look like: | ||
+ | |||
+ | └── AIP1 (folder) | ||
+ | ├── AIP1.7z.001 (binary chunk) | ||
+ | ├── AIP1.7z.002 (binary chunk) | ||
+ | ├── AIP1.7z.003 (binary chunk) | ||
+ | └── pointer.xml (xml file) | ||
+ | |||
+ | The pointer file would contain metadata outlining how to pt the 3 parts back together into a single .7z file and unpack it. The result of this would be the original bag containing the AIP. | ||
+ | |||
+ | Scenario 2: Splitting into a Bag | ||
+ | |||
+ | One problem with the first scenario is that the AIP1 folder is not structured according to any standard. Some storage systems may have a requirement to store content in bags. To satisfy this, this 2nd scenario adds an additional step - create a bag to hold the chunks: | ||
+ | |||
+ | . | ||
+ | └── AIP1 (folder) | ||
+ | ├── bag-info.txt | ||
+ | ├── bagit.txt | ||
+ | ├── data | ||
+ | │ ├── AIP1.7z.001 (binary chunk) | ||
+ | │ ├── AIP1.7z.002 (binary chunk) | ||
+ | │ ├── AIP1.7z.003 (binary chunk) | ||
+ | │ └── pointer.xml | ||
+ | │ | ||
+ | ├── manifest-md5.txt | ||
+ | ├── manifest-sha256.txt | ||
+ | ├── tagmanifest-md5.txt | ||
+ | └── tagmanifest-sha256.txt | ||
+ | |||
+ | In this scenario, there are actually 2 bags being created - one is holding the chunks (parts 1 to 3) and the pointer file. Once the chunks are stitched back together and unpacked, the result would be the original bag containing the AIP. | ||
+ | The outer bag is useful for allowing checksum/integrity checking, in a standards compliant manner (by validating the bag). It also allow metadata about the entire AIP to be recorded in the bag-info.txt, for example to conform to a storage systems requirement to use Bag Profiles. | ||
+ | |||
+ | Scenario 3: Splitting into many Bags | ||
+ | |||
+ | This scenario is a bit more complicated than Scenario 2. The only advantage it brings is the ability to further transform each bag (e.g. compress, encrypt). This might be a requirement if using an object storage system, where it is desirable to store each bag as a single file. This is not possible in scenario 2 without exceeding the maximum file size of the storage system. | ||
+ | . | ||
+ | └── AIP1 (folder) | ||
+ | ├── AIP1.001 (folder) | ||
+ | │ ├──bag-info.txt | ||
+ | │ ├── bagit.txt | ||
+ | │ ├── data | ||
+ | │ │ └── AIP1.7z.001 | ||
+ | │ ├── manifest-md5.txt | ||
+ | │ ├── manifest-sha256.txt | ||
+ | │ ├── tagmanifest-md5.txt | ||
+ | │ └── tagmanifest-sha256.txt | ||
+ | ├── AIP1.002 (folder) | ||
+ | │ ├──bag-info.txt | ||
+ | │ ├── bagit.txt | ||
+ | │ ├── data | ||
+ | │ │ └── AIP1.7z.002 | ||
+ | │ ├── manifest-md5.txt | ||
+ | │ ├── manifest-sha256.txt | ||
+ | │ ├── tagmanifest-md5.txt | ||
+ | │ └── tagmanifest-sha256.txt | ||
+ | ├── AIP1.003 (folder) | ||
+ | │ ├──bag-info.txt | ||
+ | │ ├── bagit.txt | ||
+ | │ ├── data | ||
+ | │ │ └── AIP1.7z.003 | ||
+ | │ ├── manifest-md5.txt | ||
+ | │ ├── manifest-sha256.txt | ||
+ | │ ├── tagmanifest-md5.txt | ||
+ | │ └── tagmanifest-sha256.txt | ||
+ | └─ pointer.xml | ||
+ | |||
+ | |||
+ | |||
=== Use case: Encryption === | === Use case: Encryption === |
Revision as of 12:32, 14 August 2018
User story
As a repository manager, I require flexibility in how AIPs are packaged so they can be stored as one or more physical entity.
Status
Analysis is ongoing.
Interest
If you'd like to get involved in this development, please feel free to contribute to this wiki page or start a discussion on our user forum.
Analysis:
Currently, Archivematica can only package a single AIP as a single bag. This bag can be stored as a folder (referred to as 'uncompressed' in the Archivematica UI) or as a single file (by default a 7zip file).
This has limitations in some repository environments and does not allow archivists/repository managers flexibility in how AIPs are stored and accessed. For example, some storage systems have a maximum file size limitation, which an individual AIP may exceed. In other cases, an organisation may have a requirement to encrypt all content at rest.
Use case: AIP split into multiple parts
An AIP is split into multiple pieces (zipped packages, loose files, binary chunks) for storage and retrieval purposes. There needs to be a way to record metadata that indicates the existence and locations all of the parts, record PREMIS events for each transformation that was applied to the AIP, and details about how to reverse each transformation.
AIP Splitting scenarios
Archivematica already creates pointer files, which are METS files that describe an AIP. Pointer files record a PREMIS event when an AIP is compressed, for example. They can also be used to record metadata about aip splitting.
Scenario 1: Simple splitting
An AIP is stored as a bag, and the bag is then turned into a .7z file. The .7z file is then split into multiple parts. This can be done with the unix split command (split man page), with the -v argument to 7z (7z volumes) or by some other method. The result would look like:
└── AIP1 (folder)
├── AIP1.7z.001 (binary chunk) ├── AIP1.7z.002 (binary chunk) ├── AIP1.7z.003 (binary chunk) └── pointer.xml (xml file)
The pointer file would contain metadata outlining how to pt the 3 parts back together into a single .7z file and unpack it. The result of this would be the original bag containing the AIP.
Scenario 2: Splitting into a Bag
One problem with the first scenario is that the AIP1 folder is not structured according to any standard. Some storage systems may have a requirement to store content in bags. To satisfy this, this 2nd scenario adds an additional step - create a bag to hold the chunks:
. └── AIP1 (folder)
├── bag-info.txt ├── bagit.txt ├── data │ ├── AIP1.7z.001 (binary chunk) │ ├── AIP1.7z.002 (binary chunk) │ ├── AIP1.7z.003 (binary chunk) │ └── pointer.xml │ ├── manifest-md5.txt ├── manifest-sha256.txt ├── tagmanifest-md5.txt └── tagmanifest-sha256.txt
In this scenario, there are actually 2 bags being created - one is holding the chunks (parts 1 to 3) and the pointer file. Once the chunks are stitched back together and unpacked, the result would be the original bag containing the AIP. The outer bag is useful for allowing checksum/integrity checking, in a standards compliant manner (by validating the bag). It also allow metadata about the entire AIP to be recorded in the bag-info.txt, for example to conform to a storage systems requirement to use Bag Profiles.
Scenario 3: Splitting into many Bags
This scenario is a bit more complicated than Scenario 2. The only advantage it brings is the ability to further transform each bag (e.g. compress, encrypt). This might be a requirement if using an object storage system, where it is desirable to store each bag as a single file. This is not possible in scenario 2 without exceeding the maximum file size of the storage system. . └── AIP1 (folder)
├── AIP1.001 (folder) │ ├──bag-info.txt │ ├── bagit.txt │ ├── data │ │ └── AIP1.7z.001 │ ├── manifest-md5.txt │ ├── manifest-sha256.txt │ ├── tagmanifest-md5.txt │ └── tagmanifest-sha256.txt ├── AIP1.002 (folder) │ ├──bag-info.txt │ ├── bagit.txt │ ├── data │ │ └── AIP1.7z.002 │ ├── manifest-md5.txt │ ├── manifest-sha256.txt │ ├── tagmanifest-md5.txt │ └── tagmanifest-sha256.txt ├── AIP1.003 (folder) │ ├──bag-info.txt │ ├── bagit.txt │ ├── data │ │ └── AIP1.7z.003 │ ├── manifest-md5.txt │ ├── manifest-sha256.txt │ ├── tagmanifest-md5.txt │ └── tagmanifest-sha256.txt └─ pointer.xml
Use case: Encryption
An AIP should be encrypted before storing, independent of where it is stored. The AIP pointer file needs to track information required to unencrypt the AIP on retrieval.