Dataset preservation
Jump to navigation
Jump to search
Workflow
- Metadata ingest: Metadata will be created outside of Archivematica prior to ingest and added to the metadata folder of the transfer. See Metadata, below.
- Metadata validation: Archivematica should include a micro-service to validate metadata on ingest, using something like xmllint. Sample validation command: xmllint --schema ddi:instance:3_1 metadata/CCRI-CDN-Census1911V20110628.xml.
- Normalization:Some datasets may require manual normalization: see https://projects.artefactual.com/issues/1499.
Metadata
METS and DDI/FGDC
- DDI is Data Documentation Initiative, a metadata specification for the social and behavioral sciences; see http://www.ddialliance.org/.
- FGDC is Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]; see http://www.fgdc.gov/metadata/csdgm/
- DDI and FGDC are considered descriptive metadata (mdSec) in METS. From http://www.loc.gov/standards/mets/METSOverview.v2.html: "Valid values for the MDTYPE element [in mdSec] include...DDI (Data Documentation Initiative), FGDC (Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]."
- In the Archivematica METS file, a DDI or FGDC file could be referenced from the mdSec using mdRef, for example as follows: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="DDI" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>.
METS and other metadata standards
- Other metadata standards that could be used for ingested datasets include:
- North American Profile (NAP) of ISO 19119, for geospatial metadata: http://www.fgdc.gov/metadata/geospatial-metadata-standards
- SDMX for aggregate data: http://sdmx.org/?page_id=10
- EML, the Ecological Metadata Language: http://knb.ecoinformatics.org/software/eml/eml-2.1.1/index.html
- If these standards are used, the mdRef in the METS file would need to use OTHER as MDTYPE, for example: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="OTHER" OTHERMDTYPE="SDMX" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
Hierarchical AIC/AIP structure
- Because datasets can be large and heterogeneous, one "dataset" may be broken into multiple AIPs. In such cases, the multiple AIPs can be intellectually combined into one AIC, or Archival Information Collection, defined by the OAIS reference model as "[a]n Archival Information Package whose Content Information is an aggregation of other Archival Information Packages." (OAIS 1-9).
- The AIC will consist entirely of a METS file with aggregate-level descriptive metadata (eg metadata for the dataset or study as a whole) plus a logical structMap listing all child AIPs.
- Each child AIP will include a logical structMap pointing to the parent AIC. The aggregate-level descriptive metadata will NOT be duplicated in the child AIP.
- In storage, a pointer.xml file gives the uri and extraction (eg unzipping) information for an AIC or stand-alone AIP. Question: does the pointer.xml file give the uri and extraction info for AIPs that are children of AICs, or is that information captured in the AIC?
- Below is a sample METS file for a simple AIC consisting of one aggregate (i.e. dataset-level) metadata file and two AIPs. The dmdSec consists of a link to the metadata file (which is packaged as part of the AIC), and the structMap consists of links to the METS files in the constituent AIPs.