Improvements/AIP Packaging

From Archivematica
< Improvements
Revision as of 15:54, 11 February 2020 by Sallain (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

User story[edit]

As a repository manager, I require flexibility in how AIPs are packaged so they can be stored as one or more physical entity.

Status[edit]

Analysis is ongoing.

Interest[edit]

If you'd like to get involved in this development, please feel free to contribute to this wiki page or start a discussion on our user forum.

Analysis:[edit]

Currently, Archivematica can only package a single AIP as a single bag. This bag can be stored as a folder (referred to as 'uncompressed' in the Archivematica UI) or as a single file (by default a 7zip file).

This has limitations in some repository environments and does not allow archivists/repository managers flexibility in how AIPs are stored and accessed. For example, some storage systems have a maximum file size limitation, which an individual AIP may exceed. In other cases, an organisation may have a requirement to encrypt all content at rest.

Use case: AIP split into multiple parts[edit]

An AIP is split into multiple pieces (zipped packages, loose files, binary chunks) for storage and retrieval purposes. There needs to be a way to record metadata that indicates the existence and locations all of the parts, record PREMIS events for each transformation that was applied to the AIP, and details about how to reverse each transformation.

AIP Splitting scenarios[edit]

Archivematica already creates pointer files, which are METS files that describe an AIP. Pointer files record a PREMIS event when an AIP is compressed, for example. They can also be used to record metadata about aip splitting.

Scenario 1: Simple splitting

An AIP is stored as a bag, and the bag is then turned into a .7z file. The .7z file is then split into multiple parts. This can be done with the unix split command (split man page), with the -v argument to 7z (7z volumes) or by some other method. The result would look like:

.
└── AIP1 (folder)
    ├── AIP1.7z.001 (binary chunk)
    ├── AIP1.7z.002 (binary chunk)
    ├── AIP1.7z.003 (binary chunk)
    └── pointer.xml (xml file)

The pointer file would contain metadata outlining how to pt the 3 parts back together into a single .7z file and unpack it. The result of this would be the original bag containing the AIP.

Scenario 2: Splitting into a Bag

One problem with the first scenario is that the AIP1 folder is not structured according to any standard. Some storage systems may have a requirement to store content in bags. To satisfy this, this 2nd scenario adds an additional step - create a bag to hold the chunks:

.
└── AIP1 (folder)
    ├── bag-info.txt
    ├── bagit.txt
    ├── data
    │   ├── AIP1.7z.001 (binary chunk)
    │   ├── AIP1.7z.002 (binary chunk)
    │   ├── AIP1.7z.003 (binary chunk)
    │   └── pointer.xml 
    │
    ├── manifest-md5.txt
    ├── manifest-sha256.txt
    ├── tagmanifest-md5.txt
    └── tagmanifest-sha256.txt

In this scenario, there are actually 2 bags being created - one is holding the chunks (parts 1 to 3) and the pointer file. Once the chunks are stitched back together and unpacked, the result would be the original bag containing the AIP. The outer bag is useful for allowing checksum/integrity checking, in a standards compliant manner (by validating the bag). It also allow metadata about the entire AIP to be recorded in the bag-info.txt, for example to conform to a storage systems requirement to use Bag Profiles.

Scenario 3: Splitting into many Bags

This scenario is a bit more complicated than Scenario 2. The only advantage it brings is the ability to further transform each bag (e.g. compress, encrypt). This might be a requirement if using an object storage system, where it is desirable to store each bag as a single file. This is not possible in scenario 2 without exceeding the maximum file size of the storage system.

.
└── AIP1 (folder)
    ├── AIP1.001 (folder) 
    │   ├──bag-info.txt
    │   ├── bagit.txt
    │   ├── data
    │   │   └── AIP1.7z.001
    │   ├── manifest-md5.txt
    │   ├── manifest-sha256.txt
    │   ├── tagmanifest-md5.txt
    │   └── tagmanifest-sha256.txt
    ├── AIP1.002 (folder) 
    │   ├──bag-info.txt
    │   ├── bagit.txt
    │   ├── data
    │   │   └── AIP1.7z.002
    │   ├── manifest-md5.txt
    │   ├── manifest-sha256.txt
    │   ├── tagmanifest-md5.txt
    │   └── tagmanifest-sha256.txt
    ├── AIP1.003 (folder) 
    │   ├──bag-info.txt
    │   ├── bagit.txt
    │   ├── data
    │   │   └── AIP1.7z.003
    │   ├── manifest-md5.txt
    │   ├── manifest-sha256.txt
    │   ├── tagmanifest-md5.txt
    │   └── tagmanifest-sha256.txt
    └─ pointer.xml

Use case: Encryption[edit]

An AIP should be encrypted before storing, independent of where it is stored. The AIP pointer file needs to track information required to unencrypt the AIP on retrieval.

Use case: Oxford Common File Layout[edit]

https://ocfl.io/