Difference between revisions of "Dataset preservation"

From Archivematica
Jump to navigation Jump to search
Line 24: Line 24:
  
 
==Hierarchical AIC/AIP structure==
 
==Hierarchical AIC/AIP structure==
 +
 +
*Because datasets can be large and heterogeneous, one "dataset" may be broken into multiple AIPs. In such cases, the multiple AIPs can be intellectually combined into one AIC, or Archival Information Collection, defined by the OAIS reference model as "[a]n Archival Information Package whose Content Information is an aggregation of other Archival Information Packages." (OAIS 1-9).
 +
**The AIC will consist entirely of a METS file with aggregate-level descriptive metadata (eg metadata for the dataset or study as a whole) plus a logical structMap listing all child AIPs.
 +
**Each child AIP will include a logical structMap pointing to the parent AIC. The aggregate-level descriptive metadata will NOT be duplicated in the child AIP.
 +
**In storage, a gives the uri and extraction (eg unzipping) information for an AIC or stand-alone AIP. '''Question:''' does the pointer.xml file give the uri and extraction info for AIPs that are children of AICs, or is that information captured in the AIC?
 +
  
 
[[File:dataset_structure.png|680px|thumb|center|]]
 
[[File:dataset_structure.png|680px|thumb|center|]]

Revision as of 18:20, 14 February 2013

Workflow

  • Metadata ingest: Metadata will be created outside of Archivematica prior to ingest and added to the metadata folder of the transfer. See Metadata, below.
  • Metadata validation: Archivematica should include a micro-service to validate metadata on ingest, using something like xmllint. Sample validation command: xmllint --schema ddi:instance:3_1 metadata/CCRI-CDN-Census1911V20110628.xml.
  • Normalization:Some datasets may require manual normalization: see https://projects.artefactual.com/issues/1499.


Metadata

METS and DDI/FGDC

  • DDI is Data Documentation Initiative, a metadata specification for the social and behavioral sciences; see http://www.ddialliance.org/.
  • FGDC is Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]; see http://www.fgdc.gov/metadata/csdgm/
  • DDI and FGDC are considered descriptive metadata (mdSec) in METS. From http://www.loc.gov/standards/mets/METSOverview.v2.html: "Valid values for the MDTYPE element [in mdSec] include...DDI (Data Documentation Initiative), FGDC (Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]."
    • In the Archivematica METS file, a DDI or FGDC file could be referenced from the mdSec using mdRef, for example as follows: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="DDI" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>.


METS and other metadata standards


Hierarchical AIC/AIP structure

  • Because datasets can be large and heterogeneous, one "dataset" may be broken into multiple AIPs. In such cases, the multiple AIPs can be intellectually combined into one AIC, or Archival Information Collection, defined by the OAIS reference model as "[a]n Archival Information Package whose Content Information is an aggregation of other Archival Information Packages." (OAIS 1-9).
    • The AIC will consist entirely of a METS file with aggregate-level descriptive metadata (eg metadata for the dataset or study as a whole) plus a logical structMap listing all child AIPs.
    • Each child AIP will include a logical structMap pointing to the parent AIC. The aggregate-level descriptive metadata will NOT be duplicated in the child AIP.
    • In storage, a gives the uri and extraction (eg unzipping) information for an AIC or stand-alone AIP. Question: does the pointer.xml file give the uri and extraction info for AIPs that are children of AICs, or is that information captured in the AIC?


Dataset structure.png