Dataset preservation
Workflow
- Metadata ingest: Metadata will be created outside of Archivematica prior to ingest and added to the metadata folder of the transfer. See Metadata, below.
- Metadata validation: Archivematica should include a micro-service to validate metadata on ingest, using something like xmllint. Sample validation command: xmllint --schema ddi:instance:3_1 metadata/CCRI-CDN-Census1911V20110628.xml.
- Normalization:Some datasets may require manual normalization: see https://projects.artefactual.com/issues/1499.
Metadata
METS and DDI/FGDC
- DDI is Data Documentation Initiative, a metadata specification for the social and behavioral sciences; see http://www.ddialliance.org/.
- FGDC is Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]; see http://www.fgdc.gov/metadata/csdgm/
- DDI and FGDC are considered descriptive metadata (mdSec) in METS. From http://www.loc.gov/standards/mets/METSOverview.v2.html: "Valid values for the MDTYPE element [in mdSec] include...DDI (Data Documentation Initiative), FGDC (Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]."
- In the Archivematica METS file, a DDI or FGDC file could be referenced from the mdSec using mdRef, for example as follows: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="DDI" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>.
METS and other metadata standards
- Other metadata standards that could be used for ingested datasets include:
- North American Profile (NAP) of ISO 19119, for geospatial metadata: http://www.fgdc.gov/metadata/geospatial-metadata-standards
- SDMX for aggregate data: http://sdmx.org/?page_id=10
- EML, the Ecological Metadata Language: http://knb.ecoinformatics.org/software/eml/eml-2.1.1/index.html
- If these standards are used, the mdRef in the METS file would need to use OTHER as MDTYPE, for example: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="OTHER" OTHERMDTYPE="SDMX" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
Hierarchical AIC/AIP structure
- Because datasets can be large and heterogeneous, one "dataset" may be broken into multiple AIPs. In such cases, the multiple AIPs can be intellectually combined into one AIC, or Archival Information Collection, defined by the OAIS reference model as "[a]n Archival Information Package whose Content Information is an aggregation of other Archival Information Packages." (OAIS 1-9).
- The AIC may consist of a METS file with aggregate-level descriptive metadata (eg metadata for the dataset or study as a whole) plus a logical structMap listing all child AIPs (see options and variations in Possible AIC/AIP designs, below.
- In storage, a pointer.xml file gives the uri and extraction (eg unzipping) information for an AIC or stand-alone AIP. Question: does the pointer.xml file give the uri and extraction info for AIPs that are children of AICs, or is that information captured in the AIC?
Possible AIC/AIP designs
Option 1
Option 2
Option 3
Option 4