Difference between revisions of "Research data management"

From Archivematica
Jump to navigation Jump to search
(Update to GH link)
 
(27 intermediate revisions by 4 users not shown)
Line 10: Line 10:
 
*[[Dataverse]]
 
*[[Dataverse]]
 
*[http://digital-archiving.blogspot.ca Digital Archiving blog] written by archivists at University of York
 
*[http://digital-archiving.blogspot.ca Digital Archiving blog] written by archivists at University of York
 +
*[https://github.com/archivematica/archivematica-case-studies/blob/master/resources/2016-02-01-JiscDataSpringFinalReportphase2.pdf Filling the Digital Preservation Gap] written by archivists at the Universities of York and Hull
 +
 +
</br>
  
 
=Automated DIP generation=
 
=Automated DIP generation=
Line 22: Line 25:
 
These developments are necessary for research data management to meet use cases when research data is stored without the expectation that it will be re-used, but then subsequently a need for re-use arises and/or is approved by the creator of the data.
 
These developments are necessary for research data management to meet use cases when research data is stored without the expectation that it will be re-used, but then subsequently a need for re-use arises and/or is approved by the creator of the data.
  
==University of York workflow==
+
</br>
 +
 
 +
==Workflow==
  
 
[[File:York_DIP_gen_v1.png|820px]]
 
[[File:York_DIP_gen_v1.png|820px]]
Line 32: Line 37:
 
*When DIP stored by Archivematica,return DIP message goes to RDMonitor
 
*When DIP stored by Archivematica,return DIP message goes to RDMonitor
 
*If DIP creation/storage fails, failure message sent to RDMonitor.
 
*If DIP creation/storage fails, failure message sent to RDMonitor.
 +
 +
</br>
  
 
=METS parsing=
 
=METS parsing=
Line 40: Line 47:
 
|-
 
|-
 
!style="width:25%"|'''Question'''
 
!style="width:25%"|'''Question'''
!style="width:40%"|'''METS source source'''
+
!style="width:40%"|'''Information source'''
 +
!style="width:25%%"|'''Sample result'''
 +
|-
 +
|How many files are in this package?
 +
|
 +
|integer
 +
|-
 +
|How many original/preservation/metadata files are in this package?
 +
|mets:fileGrp USE="original" / mets:fileGrp USE="preservation" / mets:fileGrp USE="metadata"
 +
|integer
 +
|-
 +
|What is the total volume of files in this package?
 +
|premis:size
 +
|integer and unit, eg 17 GB
 +
|-
 +
|Is there a DIP for this package?
 +
|Storage Service
 +
|Location of DIP
 +
|-
 +
|How many files with PRONOM puid X are in this package?
 +
|premis:formatRegistryKey
 +
|integer
 +
|-
 +
|How many files with format name X are in this package?
 +
|premis:formatName
 +
|integer
 +
|-
 +
|How many files with PRONOM puid/format name X have been normalized?
 +
|premis:formatRegistryKey / premis:formatName; files '''with''' matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"
 +
|integer
 +
|-
 +
|How many files with PRONOM puid/format name X have not been normalized?
 +
|premis:formatRegistryKey / premis:formatName; files '''without''' matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"
 +
|integer
 +
|-
 +
|Does the package include descriptive metadata?
 +
|mets:dmdSec
 +
|Yes or No
 +
|-
 +
|Does the package include rights metadata?
 +
|mets:rightsMD
 +
|Yes or No
 +
|-
 +
|How many files are invalid/not well-formed?
 +
|premis:formatRegistryKey; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome>
 +
|integer
 +
|-
 +
|How many files with PRONOM puid/format name X are invalid/not well-formed?
 +
|premis:formatRegistryKey / premis:formatName; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome>
 +
|integer
 +
|-
 +
|What is the directory structure of the files in this package?
 +
|mets:structMap
 +
|?
 +
|-
 +
|What is the size of file X?
 +
|premis:size
 +
|integer and unit, eg 2.2 GB
 +
|-
 +
|What is the checksum format for file X?
 +
|premis:messageDigestAlgorithm
 +
|md5, sha256
 +
|-
 +
|What is the checksum for file X?
 +
|premis:messageDigest
 +
|d32d41f7481afc1ab48779e2608v08d93b5d05cd217a4372e6a93957767ae651
 +
|-
 +
|Has file X been normalized for preservation?
 +
|matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"
 +
|Yes or no
 +
|-
 +
|To what format has file X been normalized for preservation?
 +
|matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"; premis:formatName for preservation copy
 +
|TIFF
 +
|-
 +
|When was file X normalized for preservation?
 +
|matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"; premis:eventType:creation and premis:eventDateTime for preservation copy
 +
|date
 +
|-
 +
|-
 +
|}
 +
 
 +
</br>
 +
 
 +
=Generic search REST API=
 +
 
 +
==METS questions==
 +
 
 +
{| border="1" cellpadding="10" cellspacing="0" width="100%"
 +
|-
 +
!style="width:25%"|'''Question'''
 +
!style="width:40%"|'''Information source'''
 
!style="width:25%%"|'''Sample result'''
 
!style="width:25%%"|'''Sample result'''
 
|-
 
|-
 
|How many files are in archival storage?
 
|How many files are in archival storage?
|
+
|Storage Service
 +
|integer
 +
|-
 +
|How many AIPs are in archival storage?
 +
|Storage Service
 +
|integer
 +
|-
 +
|How many files are in a specified AIP?
 +
|Storage Service
 
|integer
 
|integer
 
|-
 
|-
Line 56: Line 162:
 
|-
 
|-
 
|How many [video, image, plain text etc.] files are in archival storage?
 
|How many [video, image, plain text etc.] files are in archival storage?
|fits:mimetype, File:MIMEType, other mimetype sources?
+
|fits:mimetype, File:MIMEType, other mimetype sources? Alternatively, FPR groups?
 
|integer
 
|integer
 
|-
 
|-
Line 68: Line 174:
 
|-
 
|-
 
|What is the total volume of [video, image, plain text etc.] files in archival storage?
 
|What is the total volume of [video, image, plain text etc.] files in archival storage?
|fits:mimetype, File:MIMEType, other mimetype sources?; premis:size
+
|fits:mimetype, File:MIMEType, other mimetype sources? FPR groups?; premis:size
 
|integer and unit, eg 452 GB
 
|integer and unit, eg 452 GB
 
|-
 
|-
Line 95: Line 201:
 
|integer
 
|integer
 
|-
 
|-
|
+
|How many files were ingested between date X and date Y?
|
+
|<premis:eventType>ingestion</premis:eventType>; premis:eventDateTime
|
+
|integer
 
|-
 
|-
|
+
|How many files with PRONOM puid X are invalid/not well-formed?
|
+
|premis:formatRegistryKey; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome>
|
+
|integer
 
|-
 
|-
|
+
|How many files with format name X are invalid/not well-formed?
|
+
|premis:formatName; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome>
|
+
|integer
 
|-
 
|-
|
+
|How many AIPs have corresponding DIPs?
|
+
|Storage Service
|
+
|integer
 
|-
 
|-
|
+
|How many AIPs do not have corresponding DIPs?
|
+
|Storage Service
|
+
|integer
|-
 
|
 
|
 
|
 
|-
 
|
 
|
 
|
 
 
|-
 
|-
 
|}
 
|}
  
=Generic search REST API=
+
</br>
  
=Multiple checkum algorithms=
+
=Multiple checksum algorithms=
  
 
=Enhance PRONOM integration=
 
=Enhance PRONOM integration=
 +
 +
Priorities for this phase are (in order of priority):
 +
 +
1. Provide report of non-identified files in a SIP or AIP, with access to the file identification tool output
 +
 +
2. Provide direct access to the PRONOM submission form from within Archivematica
 +
 +
Alternatives being discussed:
 +
 +
- Post tool output and optionally a sample file to the FPR, making that available publically and to PRONOM
 +
 +
3. Allow a user to manually assign pronom IDs to non-identified files; record manual selection in the AIP METS file
  
 
=Automation tools documentation=
 
=Automation tools documentation=
 +
 +
Archivematica has features that enable quite a bit of automation and interaction from 3rd party applications, but most of that functionality is hidden and not well documented.  The goal of this part of the work on improving Research data management workflows is about adding documentation, in written form and as screencasts/videos, aimed at developers, and other technical users of Archivematica.  This will make it easier for developers working on other applications to integrate Archivematica into their organizations digital preservation workflows.
 +
 +
Potential topics
 +
 +
* introduction to the Archivematica
 +
* setting up an Archivematica development
 +
* automation tools introduction
 +
* implementing an automated workflow in
 +
* managing an Archivematica installation ­
 +
* configuration and troubleshooting
 +
* api documentation

Latest revision as of 08:36, 28 April 2020

Main Page > Documentation > Requirements > Research data management

About[edit]

This page describes requirements for enhancements to Archivematica to better handle research data management. It is funded by Jisc, through University of York and University of Hull.

See also


Automated DIP generation[edit]

The tasks related to this phase of development are:

  • change workflow so that the ‘upload DIP’ choice can be preconfigured.
  • update AIP reingest workflow to allow uncompressed AIPs to be reingested.
  • enhance the callback functionality in the Storage Service, to notify third party apps when a DIP is ready to be used.


These developments are necessary for research data management to meet use cases when research data is stored without the expectation that it will be re-used, but then subsequently a need for re-use arises and/or is approved by the creator of the data.


Workflow[edit]

York DIP gen v1.png

  • Access request is initiated through a staff alert to RDMonitor
  • Request for DIP sent to Storage Service
    • If DIP exists already in storage, return DIP
    • If DIP does not exist, send wait response and initiate AIP re-ingest in Archivematica pipeline
  • When DIP stored by Archivematica,return DIP message goes to RDMonitor
  • If DIP creation/storage fails, failure message sent to RDMonitor.


METS parsing[edit]

METS questions[edit]

Question Information source Sample result
How many files are in this package? integer
How many original/preservation/metadata files are in this package? mets:fileGrp USE="original" / mets:fileGrp USE="preservation" / mets:fileGrp USE="metadata" integer
What is the total volume of files in this package? premis:size integer and unit, eg 17 GB
Is there a DIP for this package? Storage Service Location of DIP
How many files with PRONOM puid X are in this package? premis:formatRegistryKey integer
How many files with format name X are in this package? premis:formatName integer
How many files with PRONOM puid/format name X have been normalized? premis:formatRegistryKey / premis:formatName; files with matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many files with PRONOM puid/format name X have not been normalized? premis:formatRegistryKey / premis:formatName; files without matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
Does the package include descriptive metadata? mets:dmdSec Yes or No
Does the package include rights metadata? mets:rightsMD Yes or No
How many files are invalid/not well-formed? premis:formatRegistryKey; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome> integer
How many files with PRONOM puid/format name X are invalid/not well-formed? premis:formatRegistryKey / premis:formatName; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome> integer
What is the directory structure of the files in this package? mets:structMap ?
What is the size of file X? premis:size integer and unit, eg 2.2 GB
What is the checksum format for file X? premis:messageDigestAlgorithm md5, sha256
What is the checksum for file X? premis:messageDigest d32d41f7481afc1ab48779e2608v08d93b5d05cd217a4372e6a93957767ae651
Has file X been normalized for preservation? matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" Yes or no
To what format has file X been normalized for preservation? matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"; premis:formatName for preservation copy TIFF
When was file X normalized for preservation? matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation"; premis:eventType:creation and premis:eventDateTime for preservation copy date


Generic search REST API[edit]

METS questions[edit]

Question Information source Sample result
How many files are in archival storage? Storage Service integer
How many AIPs are in archival storage? Storage Service integer
How many files are in a specified AIP? Storage Service integer
How many files with PRONOM puid X are in archival storage? premis:formatRegistryKey integer
How many files with format name X are in archival storage? premis:formatName integer
How many [video, image, plain text etc.] files are in archival storage? fits:mimetype, File:MIMEType, other mimetype sources? Alternatively, FPR groups? integer
What is the total volume of files with PRONOM puid X in archival storage? premis:formatRegistryKey; premis:size integer and unit, eg 452 GB
What is the total volume of files with format name X in archival storage? premis:formatName; premis:size integer and unit, eg 452 GB
What is the total volume of [video, image, plain text etc.] files in archival storage? fits:mimetype, File:MIMEType, other mimetype sources? FPR groups?; premis:size integer and unit, eg 452 GB
How many files with PRONOM puid X have been normalized? premis:formatRegistryKey; files with matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many files with PRONOM puid X have not been normalized? premis:formatRegistryKey; files without matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many files with format name X have been normalized? premis:premis:formatName; files with matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many files with format name X have not been normalized? premis:premis:formatName; files without matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many [video, image, plain text etc.] files have been normalized? fits:mimetype, File:MIMEType, other mimetype sources?; files with matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many [video, image, plain text etc.] files have not been normalized? fits:mimetype, File:MIMEType, other mimetype sources?; files without matching GROUPID attributes in mets:fileGrp USE="original" and mets:fileGrp USE="preservation" integer
How many files were ingested between date X and date Y? <premis:eventType>ingestion</premis:eventType>; premis:eventDateTime integer
How many files with PRONOM puid X are invalid/not well-formed? premis:formatRegistryKey; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome> integer
How many files with format name X are invalid/not well-formed? premis:formatName; <premis:eventType>validation</premis:eventType>; <premis:eventOutcome>fail</premis:eventOutcome> integer
How many AIPs have corresponding DIPs? Storage Service integer
How many AIPs do not have corresponding DIPs? Storage Service integer


Multiple checksum algorithms[edit]

Enhance PRONOM integration[edit]

Priorities for this phase are (in order of priority):

1. Provide report of non-identified files in a SIP or AIP, with access to the file identification tool output

2. Provide direct access to the PRONOM submission form from within Archivematica

Alternatives being discussed:

- Post tool output and optionally a sample file to the FPR, making that available publically and to PRONOM

3. Allow a user to manually assign pronom IDs to non-identified files; record manual selection in the AIP METS file

Automation tools documentation[edit]

Archivematica has features that enable quite a bit of automation and interaction from 3rd party applications, but most of that functionality is hidden and not well documented. The goal of this part of the work on improving Research data management workflows is about adding documentation, in written form and as screencasts/videos, aimed at developers, and other technical users of Archivematica. This will make it easier for developers working on other applications to integrate Archivematica into their organizations digital preservation workflows.

Potential topics

  • introduction to the Archivematica
  • setting up an Archivematica development
  • automation tools introduction
  • implementing an automated workflow in
  • managing an Archivematica installation ­
  • configuration and troubleshooting
  • api documentation