Difference between revisions of "Dataset preservation"

From Archivematica
Jump to navigation Jump to search
Line 1: Line 1:
 
=Workflow=
 
=Workflow=
 
*'''Composition of AIPs''': Large datasets may be divided into multiple transfers prior to ingest, so that one dataset ultimately consists of a number of AIPs. See '''Hierarchical AIC/AIP structure''', below.
 
*'''Composition of AIPs''': Large datasets may be divided into multiple transfers prior to ingest, so that one dataset ultimately consists of a number of AIPs. See '''Hierarchical AIC/AIP structure''', below.
 +
** Note: a related Archivematica 1.1 requirement is to break up large files that exceed a configurable maximum file size into multiple AIPs tracked by an AIC
 
*'''Metadata ingest''': Metadata will be created outside of Archivematica prior to ingest, and may be referenced from the dmdSec of the AIP METS file as an xlink reference. See '''Metadata''', below.
 
*'''Metadata ingest''': Metadata will be created outside of Archivematica prior to ingest, and may be referenced from the dmdSec of the AIP METS file as an xlink reference. See '''Metadata''', below.
 
*'''Normalization''':Some types of data files may require manual normalization: see https://projects.artefactual.com/issues/1499.
 
*'''Normalization''':Some types of data files may require manual normalization: see https://projects.artefactual.com/issues/1499.

Revision as of 16:02, 24 June 2013

Workflow

  • Composition of AIPs: Large datasets may be divided into multiple transfers prior to ingest, so that one dataset ultimately consists of a number of AIPs. See Hierarchical AIC/AIP structure, below.
    • Note: a related Archivematica 1.1 requirement is to break up large files that exceed a configurable maximum file size into multiple AIPs tracked by an AIC
  • Metadata ingest: Metadata will be created outside of Archivematica prior to ingest, and may be referenced from the dmdSec of the AIP METS file as an xlink reference. See Metadata, below.
  • Normalization:Some types of data files may require manual normalization: see https://projects.artefactual.com/issues/1499.


Metadata

METS and DDI/FGDC

  • DDI is Data Documentation Initiative, a metadata specification for the social and behavioral sciences; see http://www.ddialliance.org/.
  • FGDC is Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]; see http://www.fgdc.gov/metadata/csdgm/
  • DDI and FGDC are considered descriptive metadata (dmdSec) in METS. From http://www.loc.gov/standards/mets/METSOverview.v2.html: "Valid values for the MDTYPE element [in dmdSec] include...DDI (Data Documentation Initiative), FGDC (Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]."
    • In the Archivematica METS file, a DDI or FGDC file could be referenced from the dmdSec using mdRef, for example as follows: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="DDI" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>.


METS and other metadata standards


Hierarchical AIC/AIP structure

  • Because datasets can be large and heterogeneous, one "dataset" may be broken into multiple AIPs. In such cases, the multiple AIPs can be intellectually combined into one AIC, or Archival Information Collection, defined by the OAIS reference model as "[a]n Archival Information Package whose Content Information is an aggregation of other Archival Information Packages." (OAIS 1-9).
    • The AIC consists of a METS file containing a fileSec and a logical structMap listing all child AIPs (Note that this is based on Option 1 under Possible AIC/AIP designs, below).
    • In storage, a pointer.xml file gives storage and compression information for each AIC and AIP.
  • This diagram shows a storage area with standalone AIPs, an AIC with child AIPs, and related pointer.xml files.


Archival storage area containing pointer files, AICs and AIPs


Possible AIC/AIP designs

Option 1 (preferred)


AIP1G.png


Description: An AIC consisting of only a fileSec and structMap; AIPs consisting of data files and metadata for those data files; an AIP consisting of project/program-level (i.e. dataset) metadata and documentation.


Workflow:

  1. User creates X number of AIPs and puts them in archival storage
    • One of these AIPs consists only of metadata and documentation about the program/project as a whole
    • The AIPs must have one or more common metadata elements that allows them to be identified as being related
  2. User searches for AIPs in archival storage tab (using the common metadata element in the AIPs in the search query)
  3. Once search results are retrieved, user clicks "Create AIC" button
  4. AIC is created, containing only a METS structMap listing all AIPs
  5. Over time, user can add new AIPs and re-create the AIC at any time; the new AIC will either replace or update the old one
  6. Over time, if needed the user either updates the existing documentation AIP or adds new documentation AIPs (i.e. there can be more than one documentation AIP per dataset)


Pros:

  • Don't have to duplicate program/project-level documentation in each AIP
  • Simple workflow for creating AIC
  • Easy to add new AIPs
  • If program/project documentation needs updating, only one AIP has to be re-processed, or user can add new documentation AIP(s)


Cons:

  • There is only a one-way link between the AIC and child AIPs - i.e. the AIC has a structMap listing all child AIPs, but there is nothing in a child AIP to indicate that it belongs to a given AIC.


Sample AIC METS file


METS AIC AIP.png


Sample pointer.xml file


Pointer6G.png
Pointer7G.png


Option 2


AIP2G.png


Description: An AIC consisting of a METS structMap and project/program-level (i.e. dataset) metadata and documentation; content AIPs consisting of data files and metadata about the data files. AIPs have information in the METS files (in the structMap?) linking them to the parent AIC.


Workflow: To be determined - probably a dashboard tab with a gui to allow users to arrange existing AIPs into an AIC


Pros:

  • Don't have to duplicate program/project-level documentation in each AIP
  • AIPs have a link up to the AIC, so if an AIP is orphaned the relationship to the AIC can easily be reconstructed
  • If program/project-level metadata and documentation needs to be updated, only the AIC needs to be re-processed


Cons:

  • Workflow to create this structure may be complex
  • No obvious mechanism for adding new AIPs over time


Option 3


AIP3G.png


Description: An AIC with a unique identifier consisting of project/program-level (i.e. dataset) metadata and documentation only (no structMap); AIPs consisting of data files, metadata for those data files, and the same identifier as the AIC. The relationship between the AIC and AIPs in this scenario is inferred from the matching identifiers.


Workflow:

  1. User creates an AIC consisting of project/program-level (i.e. dataset) metadata and documentation
    • The AIC contains an identifier that distinguishes it from other AICs
  2. User creates AIPs consisting of data files and metadata for those data files
    • User includes the AIC identifier in each AIP
  3. Over time, if needed the user can add more AIPs with the same identifier


Pros:

  • Don't have to duplicate program/project-level documentation in each AIP
  • Simple workflow
  • Minimal development requirements, just new metadata field for identifier added to transfer tab, corresponding entry in AIC/AIP METS files and ability to search by AIC identifier in archival storage tab
  • If program/project-level metadata and documentation needs to be updated, only the AIC needs to be re-processed
  • Easy to add more AIPs to the same AIC over time


Cons:

  • No structMap in the AIC means that there is no single source of information about how many AIPs are in the AIC


Option 4


AIP4G.png


Description: No AIC; project/program-level metadata and documentation duplicated in all AIPs; links between AIPs belonging to one dataset inferred from metadata only


Workflow: User creates any number of AIPs with complete copies of the project/program-leve (i.e. dataset) metadata and documentation in each AIP


Pros:

  • Minimal Archivematica development required, just ensuring that matching metadata elements are parsed to the AIP METS files or otherwise made available to ElasticSearch index
  • Easy to add new AIPs over time


Cons:

  • User has to maintain copies of project/program-level metadata and documentation outside of Archivematica so they can be added to each AIP
  • Updating the project/program-level metadata and documentation would require re-processing the AIPs
  • Relationships between AIPs would have to be inferred from matching metadata elements alone; if an AIP were lost, there would be no list of AIPs belonging to the dataset which would reveal the loss