Difference between revisions of "Improvements/AIP Packaging"

From Archivematica
Jump to navigation Jump to search
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information. </div> <p>
 +
 
== User story ==
 
== User story ==
  
Line 29: Line 31:
 
An AIP is stored as a bag, and the bag is then turned into a .7z file.  The .7z file is then split into multiple parts.  This can be done with the unix split command ([http://man7.org/linux/man-pages/man1/split.1.html split man page]), with the -v argument to 7z ([https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm 7z volumes]) or by some other method.  The result would look like:
 
An AIP is stored as a bag, and the bag is then turned into a .7z file.  The .7z file is then split into multiple parts.  This can be done with the unix split command ([http://man7.org/linux/man-pages/man1/split.1.html split man page]), with the -v argument to 7z ([https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm 7z volumes]) or by some other method.  The result would look like:
  
└── AIP1 (folder)
+
.
    ├── AIP1.7z.001 (binary chunk)
+
└── AIP1 (folder)
    ├── AIP1.7z.002 (binary chunk)
+
    ├── AIP1.7z.001 (binary chunk)
    ├── AIP1.7z.003 (binary chunk)
+
    ├── AIP1.7z.002 (binary chunk)
    └── pointer.xml (xml file)
+
    ├── AIP1.7z.003 (binary chunk)
 +
    └── pointer.xml (xml file)
  
 
The pointer file would contain metadata outlining how to pt the 3 parts back together into a single .7z file and unpack it.  The result of this would be the original bag containing the AIP.
 
The pointer file would contain metadata outlining how to pt the 3 parts back together into a single .7z file and unpack it.  The result of this would be the original bag containing the AIP.
Line 41: Line 44:
 
One problem with the first scenario is that the AIP1 folder is not structured according to any standard.  Some storage systems may have a requirement to store content in bags.  To satisfy this, this 2nd scenario adds an additional step - create a bag to hold the chunks:
 
One problem with the first scenario is that the AIP1 folder is not structured according to any standard.  Some storage systems may have a requirement to store content in bags.  To satisfy this, this 2nd scenario adds an additional step - create a bag to hold the chunks:
  
.
+
.
└── AIP1 (folder)
+
└── AIP1 (folder)
    ├── bag-info.txt
+
    ├── bag-info.txt
    ├── bagit.txt
+
    ├── bagit.txt
    ├── data
+
    ├── data
    │  ├── AIP1.7z.001 (binary chunk)
+
    │  ├── AIP1.7z.001 (binary chunk)
    │  ├── AIP1.7z.002 (binary chunk)
+
    │  ├── AIP1.7z.002 (binary chunk)
    │  ├── AIP1.7z.003 (binary chunk)
+
    │  ├── AIP1.7z.003 (binary chunk)
    │  └── pointer.xml  
+
    │  └── pointer.xml  
   
+
   
    ├── manifest-md5.txt
+
    ├── manifest-md5.txt
    ├── manifest-sha256.txt
+
    ├── manifest-sha256.txt
    ├── tagmanifest-md5.txt
+
    ├── tagmanifest-md5.txt
    └── tagmanifest-sha256.txt
+
    └── tagmanifest-sha256.txt
  
 
In this scenario, there are actually 2 bags being created - one is holding the chunks (parts 1 to 3) and the pointer file.  Once the chunks are stitched back together and unpacked, the result would be the original bag containing the AIP.
 
In this scenario, there are actually 2 bags being created - one is holding the chunks (parts 1 to 3) and the pointer file.  Once the chunks are stitched back together and unpacked, the result would be the original bag containing the AIP.
Line 62: Line 65:
  
 
This scenario is a bit more complicated than Scenario 2.  The only advantage it brings is the ability to further transform each bag (e.g. compress, encrypt).  This might be a requirement if using an object storage system, where it is desirable to store each bag as a single file. This is not possible in scenario 2 without exceeding the maximum file size of the storage system.
 
This scenario is a bit more complicated than Scenario 2.  The only advantage it brings is the ability to further transform each bag (e.g. compress, encrypt).  This might be a requirement if using an object storage system, where it is desirable to store each bag as a single file. This is not possible in scenario 2 without exceeding the maximum file size of the storage system.
.
+
.
└── AIP1 (folder)
+
└── AIP1 (folder)
    ├── AIP1.001 (folder)  
+
    ├── AIP1.001 (folder)  
    │  ├──bag-info.txt
+
    │  ├──bag-info.txt
    │  ├── bagit.txt
+
    │  ├── bagit.txt
    │  ├── data
+
    │  ├── data
    │  │  └── AIP1.7z.001
+
    │  │  └── AIP1.7z.001
    │  ├── manifest-md5.txt
+
    │  ├── manifest-md5.txt
    │  ├── manifest-sha256.txt
+
    │  ├── manifest-sha256.txt
    │  ├── tagmanifest-md5.txt
+
    │  ├── tagmanifest-md5.txt
    │  └── tagmanifest-sha256.txt
+
    │  └── tagmanifest-sha256.txt
    ├── AIP1.002 (folder)  
+
    ├── AIP1.002 (folder)  
    │  ├──bag-info.txt
+
    │  ├──bag-info.txt
    │  ├── bagit.txt
+
    │  ├── bagit.txt
    │  ├── data
+
    │  ├── data
    │  │  └── AIP1.7z.002
+
    │  │  └── AIP1.7z.002
    │  ├── manifest-md5.txt
+
    │  ├── manifest-md5.txt
    │  ├── manifest-sha256.txt
+
    │  ├── manifest-sha256.txt
    │  ├── tagmanifest-md5.txt
+
    │  ├── tagmanifest-md5.txt
    │  └── tagmanifest-sha256.txt
+
    │  └── tagmanifest-sha256.txt
    ├── AIP1.003 (folder)  
+
    ├── AIP1.003 (folder)  
    │  ├──bag-info.txt
+
    │  ├──bag-info.txt
    │  ├── bagit.txt
+
    │  ├── bagit.txt
    │  ├── data
+
    │  ├── data
    │  │  └── AIP1.7z.003
+
    │  │  └── AIP1.7z.003
    │  ├── manifest-md5.txt
+
    │  ├── manifest-md5.txt
    │  ├── manifest-sha256.txt
+
    │  ├── manifest-sha256.txt
    │  ├── tagmanifest-md5.txt
+
    │  ├── tagmanifest-md5.txt
    │  └── tagmanifest-sha256.txt
+
    │  └── tagmanifest-sha256.txt
    └─ pointer.xml
+
    └─ pointer.xml
 
 
 
 
 
  
 
=== Use case: Encryption ===
 
=== Use case: Encryption ===
Line 100: Line 100:
 
An AIP should be encrypted before storing, independent of where it is stored. The AIP pointer file needs to track information required to unencrypt the AIP on retrieval.
 
An AIP should be encrypted before storing, independent of where it is stored. The AIP pointer file needs to track information required to unencrypt the AIP on retrieval.
  
 +
=== Use case: Oxford Common File Layout ===
 +
 +
[https://ocfl.io/ https://ocfl.io/]
  
 
[[Category:Development documentation]]
 
[[Category:Development documentation]]

Latest revision as of 15:54, 11 February 2020

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

User story[edit]

As a repository manager, I require flexibility in how AIPs are packaged so they can be stored as one or more physical entity.

Status[edit]

Analysis is ongoing.

Interest[edit]

If you'd like to get involved in this development, please feel free to contribute to this wiki page or start a discussion on our user forum.

Analysis:[edit]

Currently, Archivematica can only package a single AIP as a single bag. This bag can be stored as a folder (referred to as 'uncompressed' in the Archivematica UI) or as a single file (by default a 7zip file).

This has limitations in some repository environments and does not allow archivists/repository managers flexibility in how AIPs are stored and accessed. For example, some storage systems have a maximum file size limitation, which an individual AIP may exceed. In other cases, an organisation may have a requirement to encrypt all content at rest.

Use case: AIP split into multiple parts[edit]

An AIP is split into multiple pieces (zipped packages, loose files, binary chunks) for storage and retrieval purposes. There needs to be a way to record metadata that indicates the existence and locations all of the parts, record PREMIS events for each transformation that was applied to the AIP, and details about how to reverse each transformation.

AIP Splitting scenarios[edit]

Archivematica already creates pointer files, which are METS files that describe an AIP. Pointer files record a PREMIS event when an AIP is compressed, for example. They can also be used to record metadata about aip splitting.

Scenario 1: Simple splitting

An AIP is stored as a bag, and the bag is then turned into a .7z file. The .7z file is then split into multiple parts. This can be done with the unix split command (split man page), with the -v argument to 7z (7z volumes) or by some other method. The result would look like:

.
└── AIP1 (folder)
    ├── AIP1.7z.001 (binary chunk)
    ├── AIP1.7z.002 (binary chunk)
    ├── AIP1.7z.003 (binary chunk)
    └── pointer.xml (xml file)

The pointer file would contain metadata outlining how to pt the 3 parts back together into a single .7z file and unpack it. The result of this would be the original bag containing the AIP.

Scenario 2: Splitting into a Bag

One problem with the first scenario is that the AIP1 folder is not structured according to any standard. Some storage systems may have a requirement to store content in bags. To satisfy this, this 2nd scenario adds an additional step - create a bag to hold the chunks:

.
└── AIP1 (folder)
    ├── bag-info.txt
    ├── bagit.txt
    ├── data
    │   ├── AIP1.7z.001 (binary chunk)
    │   ├── AIP1.7z.002 (binary chunk)
    │   ├── AIP1.7z.003 (binary chunk)
    │   └── pointer.xml 
    │
    ├── manifest-md5.txt
    ├── manifest-sha256.txt
    ├── tagmanifest-md5.txt
    └── tagmanifest-sha256.txt

In this scenario, there are actually 2 bags being created - one is holding the chunks (parts 1 to 3) and the pointer file. Once the chunks are stitched back together and unpacked, the result would be the original bag containing the AIP. The outer bag is useful for allowing checksum/integrity checking, in a standards compliant manner (by validating the bag). It also allow metadata about the entire AIP to be recorded in the bag-info.txt, for example to conform to a storage systems requirement to use Bag Profiles.

Scenario 3: Splitting into many Bags

This scenario is a bit more complicated than Scenario 2. The only advantage it brings is the ability to further transform each bag (e.g. compress, encrypt). This might be a requirement if using an object storage system, where it is desirable to store each bag as a single file. This is not possible in scenario 2 without exceeding the maximum file size of the storage system.

.
└── AIP1 (folder)
    ├── AIP1.001 (folder) 
    │   ├──bag-info.txt
    │   ├── bagit.txt
    │   ├── data
    │   │   └── AIP1.7z.001
    │   ├── manifest-md5.txt
    │   ├── manifest-sha256.txt
    │   ├── tagmanifest-md5.txt
    │   └── tagmanifest-sha256.txt
    ├── AIP1.002 (folder) 
    │   ├──bag-info.txt
    │   ├── bagit.txt
    │   ├── data
    │   │   └── AIP1.7z.002
    │   ├── manifest-md5.txt
    │   ├── manifest-sha256.txt
    │   ├── tagmanifest-md5.txt
    │   └── tagmanifest-sha256.txt
    ├── AIP1.003 (folder) 
    │   ├──bag-info.txt
    │   ├── bagit.txt
    │   ├── data
    │   │   └── AIP1.7z.003
    │   ├── manifest-md5.txt
    │   ├── manifest-sha256.txt
    │   ├── tagmanifest-md5.txt
    │   └── tagmanifest-sha256.txt
    └─ pointer.xml

Use case: Encryption[edit]

An AIP should be encrypted before storing, independent of where it is stored. The AIP pointer file needs to track information required to unencrypt the AIP on retrieval.

Use case: Oxford Common File Layout[edit]

https://ocfl.io/