Difference between revisions of "Dataset preservation"

From Archivematica
Jump to navigation Jump to search
Line 42: Line 42:
  
 
[[File:AIP1G.png|680px|thumb|center]]
 
[[File:AIP1G.png|680px|thumb|center]]
 +
 +
</br>
  
 
'''Workflow''':
 
'''Workflow''':
 
#User creates X number of AIPs and puts them in archival storage
 
#User creates X number of AIPs and puts them in archival storage
 
#*One of these AIPs consists only of metadata and documentation about the program/project as a whole
 
#*One of these AIPs consists only of metadata and documentation about the program/project as a whole
 +
#*The AIPs must have one or more common metadata elements that allows them to be identified as being related
 
#User searches for AIPs in archival storage tab
 
#User searches for AIPs in archival storage tab
 
#Once search results are retrieved, user clicks "Create AIC" button
 
#Once search results are retrieved, user clicks "Create AIC" button
Line 51: Line 54:
 
#Over time, user can add new AIPs and re-create the AIC at any time; the new AIC should replace the old one
 
#Over time, user can add new AIPs and re-create the AIC at any time; the new AIC should replace the old one
 
#Over time, if needed the user either updates the existing documentation AIP or adds new documentation AIPs (i.e. there can be more than one documentation AIP per dataset)
 
#Over time, if needed the user either updates the existing documentation AIP or adds new documentation AIPs (i.e. there can be more than one documentation AIP per dataset)
 +
 +
</br>
  
 
'''Pros''':
 
'''Pros''':
Line 57: Line 62:
 
*Easy to add new AIPs
 
*Easy to add new AIPs
 
*If program/project documentation needs updating, only one AIP has to be re-processed, or user can add new documentation AIP(s)
 
*If program/project documentation needs updating, only one AIP has to be re-processed, or user can add new documentation AIP(s)
 +
 +
</br>
  
 
'''Cons''':
 
'''Cons''':
 
*There is only a one-way link between the AIC and child AIPs - i.e. the AIC has a structMap listing all child AIPs, but there is nothing in a child AIP to indicate that it belongs to a given AIC.
 
*There is only a one-way link between the AIC and child AIPs - i.e. the AIC has a structMap listing all child AIPs, but there is nothing in a child AIP to indicate that it belongs to a given AIC.
 +
 +
</br>
  
 
'''Questions''':
 
'''Questions''':
Line 69: Line 78:
  
 
'''Description''': An AIC consisting of a METS structMap and project/program-level (i.e. dataset) metadata and documentation; content AIPs consisting of data files and metadata about the data files. AIPs have information in the METS files linking them to the parent AIC.
 
'''Description''': An AIC consisting of a METS structMap and project/program-level (i.e. dataset) metadata and documentation; content AIPs consisting of data files and metadata about the data files. AIPs have information in the METS files linking them to the parent AIC.
 +
 +
</br>
  
 
[[File:AIP2G.png|680px|thumb|center]]
 
[[File:AIP2G.png|680px|thumb|center]]
 +
 +
</br>
  
 
'''Workflow''':
 
'''Workflow''':
 
To be determined - probably a dashboard tab with a gui to allow users to arrange existing AIPs into an AIC
 
To be determined - probably a dashboard tab with a gui to allow users to arrange existing AIPs into an AIC
 +
 +
</br>
  
 
'''Pros''':
 
'''Pros''':
Line 79: Line 94:
 
*AIPs have a link up to the AIC, so if an AIP is orphaned the relationship to the AIC can easily be reconstructed
 
*AIPs have a link up to the AIC, so if an AIP is orphaned the relationship to the AIC can easily be reconstructed
 
*If program/project-level metadata and documentation needs to be updated, only the AIC needs to be re-processed
 
*If program/project-level metadata and documentation needs to be updated, only the AIC needs to be re-processed
 +
 +
</br>
  
 
'''Cons''':
 
'''Cons''':
 
*Workflow to create this structure may be complex
 
*Workflow to create this structure may be complex
 
*No obvious mechanism for adding new AIPs over time
 
*No obvious mechanism for adding new AIPs over time
 
  
 
</br>
 
</br>
  
 
===Option 3===
 
===Option 3===
 +
 +
'''Description''': An AIC with a unique identifier consisting of project/program-level (i.e. dataset) metadata and documentation only (no structMap); AIPs consisting of data files, metadata for those data files, and the same identifier as the AIC. The relationship between the AIC and AIPs in this scenario is inferred from the matching identifiers.
 +
 +
</br>
  
 
[[File:AIP3G.png|680px|thumb|center]]
 
[[File:AIP3G.png|680px|thumb|center]]
  
 
</br>
 
</br>
 +
 +
'''Workflow''':
  
 
===Option 4===
 
===Option 4===

Revision as of 17:56, 12 June 2013

Workflow

  • Metadata ingest: Metadata will be created outside of Archivematica prior to ingest and added to the metadata folder of the transfer. See Metadata, below.
  • Metadata validation: Archivematica should include a micro-service to validate metadata on ingest, using something like xmllint. Sample validation command: xmllint --schema ddi:instance:3_1 metadata/CCRI-CDN-Census1911V20110628.xml.
  • Normalization:Some datasets may require manual normalization: see https://projects.artefactual.com/issues/1499.


Metadata

METS and DDI/FGDC

  • DDI is Data Documentation Initiative, a metadata specification for the social and behavioral sciences; see http://www.ddialliance.org/.
  • FGDC is Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]; see http://www.fgdc.gov/metadata/csdgm/
  • DDI and FGDC are considered descriptive metadata (mdSec) in METS. From http://www.loc.gov/standards/mets/METSOverview.v2.html: "Valid values for the MDTYPE element [in mdSec] include...DDI (Data Documentation Initiative), FGDC (Federal Geographic Data Committee Metadata Standard [FGDC-STD-001-1998]."
    • In the Archivematica METS file, a DDI or FGDC file could be referenced from the mdSec using mdRef, for example as follows: <mdRef LABEL="CCRI-CDN-Census1911V20110628.xml-73b93b28-be1b-433f-861e-03bc321dfe7e" xlink:href="metadata/CCRI-CDN-Census1911V20110628.xml" MDTYPE="DDI" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>.


METS and other metadata standards


Hierarchical AIC/AIP structure

  • Because datasets can be large and heterogeneous, one "dataset" may be broken into multiple AIPs. In such cases, the multiple AIPs can be intellectually combined into one AIC, or Archival Information Collection, defined by the OAIS reference model as "[a]n Archival Information Package whose Content Information is an aggregation of other Archival Information Packages." (OAIS 1-9).
    • The AIC may consist of a METS file with aggregate-level descriptive metadata (eg metadata for the dataset or study as a whole) plus a logical structMap listing all child AIPs (see options and variations in Possible AIC/AIP designs, below.
    • In storage, a pointer.xml file gives the uri and extraction (eg unzipping) information for an AIC or stand-alone AIP. Question: does the pointer.xml file give the uri and extraction info for AIPs that are children of AICs, or is that information captured in the AIC?


Archival storage area containing pointer files, AICs and AIPs


Possible AIC/AIP designs

Option 1

Description: An AIC consisting of only a structMap; AIPs consisting of data files and metadata for those data files; an AIP consisting of project/program-level (i.e. dataset) metadata and documentation.


AIP1G.png


Workflow:

  1. User creates X number of AIPs and puts them in archival storage
    • One of these AIPs consists only of metadata and documentation about the program/project as a whole
    • The AIPs must have one or more common metadata elements that allows them to be identified as being related
  2. User searches for AIPs in archival storage tab
  3. Once search results are retrieved, user clicks "Create AIC" button
  4. AIC is created, containing only a METS structMap listing all AIPs
  5. Over time, user can add new AIPs and re-create the AIC at any time; the new AIC should replace the old one
  6. Over time, if needed the user either updates the existing documentation AIP or adds new documentation AIPs (i.e. there can be more than one documentation AIP per dataset)


Pros:

  • Don't have to duplicate program/project-level documentation in each AIP
  • Simple workflow for creating AIC
  • Easy to add new AIPs
  • If program/project documentation needs updating, only one AIP has to be re-processed, or user can add new documentation AIP(s)


Cons:

  • There is only a one-way link between the AIC and child AIPs - i.e. the AIC has a structMap listing all child AIPs, but there is nothing in a child AIP to indicate that it belongs to a given AIC.


Questions:

  • How do we distinguish the documentation AIP from the content AIPs? Maybe through transfer naming conventions?


Option 2

Description: An AIC consisting of a METS structMap and project/program-level (i.e. dataset) metadata and documentation; content AIPs consisting of data files and metadata about the data files. AIPs have information in the METS files linking them to the parent AIC.


AIP2G.png


Workflow: To be determined - probably a dashboard tab with a gui to allow users to arrange existing AIPs into an AIC


Pros:

  • Don't have to duplicate program/project-level documentation in each AIP
  • AIPs have a link up to the AIC, so if an AIP is orphaned the relationship to the AIC can easily be reconstructed
  • If program/project-level metadata and documentation needs to be updated, only the AIC needs to be re-processed


Cons:

  • Workflow to create this structure may be complex
  • No obvious mechanism for adding new AIPs over time


Option 3

Description: An AIC with a unique identifier consisting of project/program-level (i.e. dataset) metadata and documentation only (no structMap); AIPs consisting of data files, metadata for those data files, and the same identifier as the AIC. The relationship between the AIC and AIPs in this scenario is inferred from the matching identifiers.


AIP3G.png


Workflow:

Option 4

AIP4G.png