Difference between revisions of "Improvements/aipreadme"

From Archivematica
Jump to navigation Jump to search
Line 80: Line 80:
 
'''AIP structure'''
 
'''AIP structure'''
  
An Archivematica AIP is packaged into a bag in accordance with the Library of Congress BagIt specification, and contains some contents not described in the sections above. This tree structure depicts a typical Archivematica AIP:
+
An Archivematica AIP is packaged into a bag in accordance with the IETF Trust ''BagIt File Packaging Format'', and contains some contents not described in the sections above. This tree structure depicts a typical Archivematica AIP:
  
 
<pre>
 
<pre>
Line 97: Line 97:
  
 
(1) AIP root directory, with an appended UUID</p>  
 
(1) AIP root directory, with an appended UUID</p>  
(2)-(5): Standard packaging files produced in accordance with the Bagit Specification. See the specification for more information</p>  
+
(2)-(5): Standard packaging files produced in accordance with the IETF Trust ''BagIt File Packaging Format'' specification.</p>  
 
(6): data directory - contains subdirectories logs, objects, and thumbnails, as well as the METS file</p>  
 
(6): data directory - contains subdirectories logs, objects, and thumbnails, as well as the METS file</p>  
 
(7) logs directory - contains the log outputs of the various tools that Archivematica uses in generating the AIP (8) objects directory - contains the original digital objects as well as normalized versions</p>  
 
(7) logs directory - contains the log outputs of the various tools that Archivematica uses in generating the AIP (8) objects directory - contains the original digital objects as well as normalized versions</p>  

Revision as of 13:03, 20 June 2017

User story

As a repository manager, I would like AIP's to be as self describing as possible, so that future users, with little or no information about Archivematica or what an AIP is, will be able to understand the structure and contents of the AIP's I produce now.

Status

2017-06-01 - New Proposal

Interest

If you'd like to get involved in this development, please feel free to contribute to this wiki page or start a discussion on our user forum.

Analysis:

Currently, Archivematica AIP's are structured as a Bag (https://tools.ietf.org/html/draft-kunze-bagit-14) and contain a METS file, which describes the contents of the AIP. Details about the Archivematica AIP structure are here: https://www.archivematica.org/en/docs/archivematica-1.6/user-manual/archival-storage/aip-structure/

METS files are machine readable, but are not human friendly formats.

Adding a human readable index or description into an AIP would improve the chances of a future user understanding the structure.

Archivematica structures AIP's in a specific way, but that is not documented within the AIP. Adding more explicit documentation about the structure would help users test that AIP's are valid, and help them to understand the structure.

There is a similar proposal outlined here: https://github.com/UTS-eResearch/datacrate

Use case: Add a README to each AIP

In the data/ directory (beside the mets file) add a README.html or README.md file. This would be intended as the first file to be opened by a human being trying to examine an AIP.

The README file would include

  • some boilerplate text, describing what an AIP is
  • links to the Archivematica documentation, to METS documentation, to PREMIS docs, etc.
  • a link to the METS file
  • optionally a link to a CATALOG.html file, that includes more detailed information about the contents of the AIP.

Sample README file text

This readme file describes the basic structure of an AIP generated by Archivematica.

Acronyms

AIP = Archival Information Package

METS = Metadata Encoding and Transmission Standard

OAIS = Open Archival Information System

PDI = Preservation Description Information

PREMIS = Preservation Metadata Implementation Strategies

UUID = Unique Universal Identifier

Introduction

Archivematica is an open-source suite of tools designed to ingest diverse digital content and prepare AIPs for long-term storage. Once an AIP is generated it is not dependent on Archivematica for retrieval, and can be opened using any standard file browser. The concept of an AIP is derived from the ISO 14721:2012 Reference Model for an Open Archival Information System (OAIS), which defines it as “[a]n Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.”

Content Information

In an Archivematica AIP, the Content Information consists primarily of the originally ingested digital objects and any preservation versions of the objects created to mitigate the risk of format obsolescence over time. The preservation copies typically have the same filenames as the original objects but with different file extensions and with UUIDs appended to the filename. For example, for an original file named BBhelmet.ai the preservation version may be named BBhelmet-e3a3988d-8149-49ea-adc5-c255fb68d4f9.pdf.

The originally ingested digital objects and any preservation versions are located in the objects directory of the AIP. There will be nested subdirectories in the objects directory if these subdirectories were included in the original transfer or added during SIP arrangement. The objects directory also includes a submissionDocumentation folder and a metadata folder. The submissionDocumentation folder contains documentation such as donor agreements and transfer forms, if these are included in the AIP, as well as a METS file that records the contents of the original transfer(s) from which the AIP was created. The metadata folder holds any metadata files included in the original transfer(s), and any OCR text files generated during processing.

Preservation Description Information (PDI)

The PDI in an Archivematica AIP is recorded in a METS XML file. METS is maintained by the Library of Congress, which defines it as “a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium.” In the Archivematica AIP the METS filename is composed of the name METS with a UUID file extension and an XML file extension; for example METS.0ad8cdab-dbbf-4863-8a4d-9a675c227216.xml. The METS file typically consists of the following standard METS sections:

<mets:metsHdr> (METS header): basic information about the METS file;

<mets:dmdSec> (descriptive metadata section): descriptive metadata about the digital objects;

<mets:amdSec> (administrative metadata section): technical and provenance information about the digital objects;

<mets:fileSec> (file section): a list of the digital objects and an indication of their role in the AIP (original, preservation, metadata, submission documentation, license etc.);

<mets:structMap> (structural map): a physical or logical ordering of the digital objects.

The technical and provenance information in the METS amdSec is recorded as PREMIS metadata. PREMIS is also a Library of Congress standard, and is described as "the international standard for metadata to support the preservation of digital objects and ensure their long-term usability." The PREMIS entities are wrapped in the METS file as follows:

<mets:amdSec>

--<mets:techMD> (technical metadata)

----<premis:object> e.g. UUID, size, checksum, format, original name, extracted technical metadata

--<mets:digiprovMD> (digital provenance metadata)

----<premis:event> e.g. ingestion, message digest calculation, virus scan, format identification, validation, normalization, fixity check

----<premis:agent> for each PREMIS Event there are three associated Agents: the organization, the digital preservation system (e.g. Archivematica 1.x) and the logged-in user

--<mets:rightsMD> (rights metadata)

----<premis:rights> Rights pertaining to the preservation, reproduction and use of the preserved digital objects (only included if the user added rights metadata prior to or during ingest)

The fileSec and structMap use identifier attributes to link a digital object to its amdSec and (if used) dmdSec. For example, if a file entry in the fileSec has the attribute ADMID="amdSec_1" this means that the amdSec with the identifier amdSec_1 contains the administrative (i.e. technical and provenance) metadata for that file. The fileSec also uses a group identifier attribute to indicate relationships between digital objects. For example, if file A in fileGrp "USE=original" and file B in fileGrp "USE="preservation" both have the group identifier attribute "Group-269b494d-01cb-451b-8d5e-590d57126d3d", then file B is a preservation version generated from file A.

AIP structure

An Archivematica AIP is packaged into a bag in accordance with the IETF Trust BagIt File Packaging Format, and contains some contents not described in the sections above. This tree structure depicts a typical Archivematica AIP:

(1) AIP-name-e3a3988d-8149-49ea-adc5-c255fb68d4f9 
(2)  ├── bag-info.txt 
(3)  ├── bagit.txt 
(4)  ├── manifest-sha512.txt 
(5)  ├── tagmanifest-md5.txt 
(6)  └── data 
(7)        ├── logs 
(8)        ├── objects 
(9)        ├── thumbnails 
(10)       ├── METS.0ad8cdab-dbbf-4863-8a4d-9a675c227216.xml 
(11)  	   └── README.txt

(1) AIP root directory, with an appended UUID

(2)-(5): Standard packaging files produced in accordance with the IETF Trust BagIt File Packaging Format specification.

(6): data directory - contains subdirectories logs, objects, and thumbnails, as well as the METS file

(7) logs directory - contains the log outputs of the various tools that Archivematica uses in generating the AIP (8) objects directory - contains the original digital objects as well as normalized versions

(9) thumbnails directory - contains thumbnails generated from the original object for use in the Archivematica user interface

(10) the Archivematica METS file

(11) this README text file

Use case: Create and Use a Bag Profile

https://github.com/ruebot/bagit-profiles

Archivematica could define a bag profile and reference this in the AIP's it produces. This would help make AIP's more easily machine readable.