Difference between revisions of "Dataverse"

From Archivematica
Jump to navigation Jump to search
(→‎Current Status: Added ref to iPRES paper)
 
(115 intermediate revisions by 10 users not shown)
Line 1: Line 1:
 
[[Main Page]] > [[Documentation]] > [[Requirements]] > Dataverse
 
[[Main Page]] > [[Documentation]] > [[Requirements]] > Dataverse
  
This page tracks development of a proof of concept integration of Archivematica with [http://dataverse.org Dataverse].
+
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information.</div><p>
  
===See also===
+
This page sets out the requirements and designs for integration with [http://dataverse.org Dataverse]. As of Archivematica [https://wiki.archivematica.org/Archivematica_1.8_and_Storage_Service_0.13_release_notes v. 1.8], integration with the Dataverse repository as a transfer source type supports the selection and processing of Dataverse research datasets.  Archivematica [https://wiki.archivematica.org/Archivematica_1.9_and_Storage_Service_0.14_release_notes v. 1.9] introduced two fixes to the workflow. 
  
* [[Sword API]]
+
This page was originally created as part of an early proof-of-concept integration project in 2015, which was only made available in a development branch of Archivematica. Phase 2 of this project improved on that original integration work and merged it into a public release of Archivematica (v1.8). This work was sponsored by [https://scholarsportal.info/ Scholars Portal], a service of the Ontario Council of University Libraries (OCUL).
* [[Dataset preservation]]
+
 
 +
[[Category:Feature requirements]]
 +
 
 +
 
 +
==Current Status==
 +
 
 +
''April 8, 2020''
 +
 
 +
Outstanding issues relating to the representation of tabular derivatives in METS are documented in the Archivematica issues repository with the tag [https://github.com/archivematica/Issues/labels/OCUL%3A%20AM-Dataverse "OCUL: AM-Dataverse"]
 +
* One major outstanding issue relates to changes in Dataverse between version 4.10 and 4.17 that altered the way files are named; this conflicts with the hard-coding of these names in Archivematica's scripts. [https://github.com/archivematica/Issues/issues/1057 Issue 1057]
 +
* A second, related issue documents the treatment of RData derivatives, which are still causing conflicts with Archivematica's regular 'extract packages' workflows. [https://github.com/archivematica/Issues/issues/1058 Issue 1058]
 +
 
 +
''March 19, 2019''
 +
 
 +
The integration has been released as part of Archivematica version 1.8. Version 1.9 (released March 6, 2019) integrated the following fixes:
 +
 
 +
* Multiple authors are not captured in the Dataverse METS - only the first author listed is. [https://github.com/archivematica/Issues/issues/278 Issue 278]
 +
* It is not possible to delete packages after extraction using the Dataverse transfer type if the package contains derivatives. [https://github.com/archivematica/Issues/issues/269 Issue 269]
 +
 
 +
This [https://drive.google.com/open?id=1XlHZF2Sryg_79qzw7G-R4PeWmMcPgRug screencast] provides a demonstration of the current implementation.
 +
 
 +
OCUL/Scholars Portal is currently hosting a demonstration sandbox for interested users to test the integration. Please visit the [https://spotdocs.scholarsportal.info/display/DAT/Archivematica+Demo+Sandbox sandbox page on the OCUL Confluence site] for information on how to access it. You can read more about the project and its outcomes in Meghan Goodchild and Grant Hurley's 2019 iPres paper and presentation slides here: https://osf.io/wqbvy/.
 +
 
 +
== Overview of Dataverse to Archivematica Integration ==
 +
 
 +
===Setting up the Integration===
 +
In order to set up and use the integration, users should consult Archivematica’s documentation, particularly the [https://www.archivematica.org/en/docs/storage-service-0.14/administrators/#dataverse Dataverse section] of the Archivematica Storage Service documentation for setup and the section on [https://www.archivematica.org/en/docs/archivematica-1.9/user-manual/transfer/dataverse/#dataverse-transfers Dataverse transfers]. If users are unfamiliar with Archivematica, they should consult Archivematica’s [https://www.archivematica.org/en/docs/archivematica-1.9/#overview overview] and [https://www.archivematica.org/en/docs/archivematica-1.9/getting-started/quick-start/quick-start/#quick-start quick start documentation]. If users are unfamiliar with Dataverse, they should consult Dataverse's [http://guides.dataverse.org/en/latest/ guides]. Users will need access to both an appropriately provisioned Archivematica instance, and to an installation of Dataverse for which they have an account.
 +
 
 +
===Scope of the Integration===
 +
As per the details on feature files below, the integration was designed with the following scope of use:
 +
* The current integration presumes a user who has an account with a Dataverse instance and has generated an associated [http://guides.dataverse.org/en/latest/user/account.html?highlight=api%20token API key], and the same (or a different, authorized) user who has access to an Archivematica instance and storage service that is connected to that Dataverse via the API key. You can read more in the [http://guides.dataverse.org/en/latest/installation/config.html#id40 Dataverse documentation under "Root Dataverse Permissions"] about users, admins and superuser categories that might impact access to Dataverse datasets via the API.
 +
* It is assumed the user has obtained the necessary rights to process and store dataset files in Dataverse for preservation and has appropriate access to the dataset and/or associated files based on the rights related to their Dataverse API key (see above).
 +
* It is assumed that the preserver is interested in selecting specific datasets in a Dataverse for preservation. SIPs and their resulting AIPs are created from current versions of Dataverse datasets with one or more associated files in that dataset. A dataset is therefore equivalent to a SIP. Individual files cannot be selected for preservation, nor can older versions of files. However, users may make use of Archivematica’s [https://www.archivematica.org/en/docs/archivematica-1.9/user-manual/appraisal/appraisal/#appraisal Appraisal] functions to select individual files in a particular dataset to create a final AIP.
 +
* At present, a function to automate the ingest of all datasets in a Dataverse has not been developed.
 +
 
 +
=== Feature Files ===
 +
On this project we are using [http://docs.behat.org/en/v2.5/guides/1.gherkin.html Gherkin] feature files to define the desired behaviour of preserving a dataset from a Dataverse.  Feature files are also known as Acceptance Tests, because they specify the behaviour that we will test at the end of the project. The draft versions & comments are documented in this [https://docs.google.com/document/d/1KqhpTuiSY2_B5oAM1cgXHAA72hmiUa8SBh4laylTkGo/edit feature file].
 +
 
 +
'''Feature: Preserve a Dataverse dataset'''
 +
 +
  Alma is an Archivematica user
 +
  And they want to preserve a dataset published in a Dataverse
 +
    ''Definitions'' 
 +
    Dataverse Dataset: A dataset that has been published in a Dataverse, including all
 +
    original files uploaded to dataverse, and any derivative files created by Dataverse. 
 +
    Dataverse METS: A metadata file using the METS standard that describes a dataset;
 +
    including descriptive metadata, list of all objects in the dataset, their structure
 +
    and relationships to each other.
 +
  ''Scenario: Manual Selection of Dataset''
 +
    Given the Storage Service is configured to connect to a Dataverse Repository
 +
      And the dataset has been published in Dataverse
 +
  When the user selects the transfer type “Dataverse”
 +
    And the user selects the dataset to be preserved 
 +
    And the user enters the <Transfer Name>
 +
    And the user enters the (optional) <Accession number>
 +
    And the users clicks the “Start Transfer” Button
 +
  Then Archivematica copies the files from Dataverse to a local processing directory 
 +
    And the Approve Transfer microservice asks the user to approve the transfer
 +
    And the user selects yes
 +
    And the Verify Transfer Compliance microservice creates the Dataverse METS
 +
    And the Dataverse metadata files are generated and included in a metadata directory
 +
    And the Verify Transfer Compliance microservice confirms this is a valid Dataverse Transfer
 +
    And the Verify Transfer Checksums microservice confirms the checksums provided by dataverse match those generated for each file in the dataset
 +
    And the AIP Mets File includes the Dataverse generated events
 +
    And the completed AIP is stored in the specified Dataverse storage location
 +
 +
===Dataverse Workflow===
 +
 
 +
[[File:Dataverse_Workflow_overview.png|800px|thumb|center]]
 +
 
 +
 
 +
'''1) User Selects Dataset'''
 +
When the Storage Service is configured to connect to Dataverse, the Transfer Browser in the Dashboard will display a list of all Dataverse Transfer Source Locations. Transfer Source locations can be configured to filter on search terms, or on a particular dataverse. See (TODO - add link to SS documentation). Users can browse through the datasets available, select one and set the Transfer type to Dataverse.
 +
 
 +
'''2) Storage Service Retrieves Dataset'''
 +
The storage services uses the Dataverse API to retrieve the selected dataset. API credentials are stored in the Storage Service Space.
 +
 
 +
'''3) Prepare Transfer'''
 +
 
 +
Archivematica creates a metadata file called agents.json that includes the agent information configured in the storage service. This information is used to populate the PREMIS agent details in the METS files. See [[Dataverse#agents.json]] for more details.
 +
 
 +
When a dataset includes a "bundle" of related files for tabular data, it is provided as a .zip file. Archivematica extracts all of the files in bundles at this stage. Other .zip files are not affected, and can be extracted or not using the standard processing configuration options. See TO DO - ADD LINK TO dataset section
 +
 
 +
'''4) Transfer & Ingest'''
 +
 
 +
Archivematica performs transfer and ingest processes using the standard processing configuration options. Additional processing for Dataverse datasets include
 +
* creating a Dataverse METS that describes the dataset as provided by Dataverse
 +
* fixity check of files using checksums provided by Dataverse
 +
* including Dataverse metadata (from the Dataverse METS) in the final AIP METS
 +
 
 +
'''5) Store the AIP'''
 +
 
 +
The AIP is stored in whatever location has been configured. Scholar's Portal intend to store their AIPs in an S3 location (which is a standard configuration option as of Storage Service version 0.12).
 +
 
 +
===Packages-related Workflows===
 +
''User-submitted packages''
 +
It is common for Dataverse users to “double-zip” files when uploading files to datasets. This is the practice of packaging files and then packaging them again a second time. Dataverse always unpacks submitted packages, but if users double-zip, they can save the labour of uploading many files one-by-one. Archivematica users may choose whether they wish to have these packages extracted and/or deleted afterward by setting the appropriate corresponding processing configuration.
 +
 
 +
''Dataverse-created derivative bundles''
 +
A second set of packages are created by Dataverse in the form of derivative bundles. Derivatives are copies of files in tabular format that Dataverse creates from user-submitted files. Dataverse delivers these packages to Archivematica as zip packages. See [[Bundles for tabular data files]] for more details below. See Dataverse’s [http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html guide on tabular ingest] for additional documentation. These packages are always extracted by Archivematica by default. Setting the processing configuration to not extract packages will not function for this type of transfer.
 +
 
 +
==Known Issues Impacting Transfers==
 +
 
 +
The following table summarizes known issues that impact the success of individual transfers. For a full list of known issues, consult the [https://waffle.io/artefactual/archivematica?label=OCUL:%20AM-Dataverse Waffle board for this feature].
 +
 
 +
{| class="wikitable"
 +
|-
 +
! Issue
 +
! Description
 +
! Failure Step
 +
! Message (last line)
 +
|-
 +
| Dataset has no files
 +
| Datasets that do not contain files (i.e., metadata only) will result in a failed transfer
 +
| Verify transfer compliance: Convert Dataverse Structure
 +
| ConvertDataverseError: Error adding Dataset files to METS
 +
|-
 +
| Dataset has files with blank checksum values
 +
| A failed transfer will result if the dataset has files with blank checksum values (a known issue for certain types of files that were deposited in Dataverse v3.6 or earlier). A user may work around this issue by selecting the “Standard” transfer type and processing the transfer as usual. However, the METS file will not contain descriptive metadata and the Dataverse checksums will not be validated. Administrators may wish to troubleshoot blank checksums in their Dataverse instances to fix this issue.
 +
| Verify transfer compliance: Convert Dataverse Structure
 +
| ValueError: Must provide both checksum and checksumtype, or neither. Provided values:  and MD5
 +
|-
 +
| Dataset that has files which failed during the Dataverse tabular ingest upload
 +
| A failed transfer will result if there are files which have failed the [http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html tabular ingest] upload process in Dataverse (e.g., results in missing .RData and derivative .tab file).
 +
| Parse External Files: Parse Dataverse METS XML
 +
| ParseDataverseError: Exiting. Returning the database objects for our Dataverse files has failed.
 +
|-
 +
| Dataset has derivative packages and “Delete packages after extraction” is set in the processing configuration
 +
| If the user is running Archivematica version 1.8, and the transfer contains derivative files (i.e., files that have been uploaded through the tabular ingest process in Dataverse) and the option “delete packages after the extraction” is selected in the Archivematica processing configuration, the transfer will fail. This is because the .RData files contained as part of derivative files are themselves packages and will be deleted. This issue was fixed in Archivematica 1.9. The work-around if running Archivematica 1.8 is to select ‘no’ as the option in the processing configuration.
 +
| Parse External Files: Parse Dataverse METS XML
 +
| IntegrityError: (1048, "Column 'eventOutcomeDetailNote' cannot be null")
 +
|-
 +
| User attempts to process dataset with restricted files
 +
| Permissions to process datasets through Archivematica correspond to role permissions associated with a Dataverse via an API token. Therefore, restricted files must be processed using an administrator or superuser API token for any restricted datasets that are selected for transfer. Otherwise, processing of these datasets will fail.
 +
| Parse External Files: Parse Dataverse METS XML
 +
| ParseDataverseError: Exiting. Returning the database objects for our Dataverse files has failed.
 +
|-
 +
| User does not select “Dataverse” transfer type
 +
| When processing a Dataverse dataset, users must select the “Dataverse” transfer type from the drop-down menu when initiating the transfer. If a “standard” transfer type is selected, the dataset may be processed without descriptive metadata. If another transfer type is selected, the transfer will fail. Note: Dataset Terms of Use may exist for restricted files, in these cases it is expected that Terms of Use are respected by the person(s) processing the files in Archivematica. License information for datasets with restricted files is not currently mapped to METS.
 +
| Various
 +
| Various
 +
 
 +
|}
 +
 
 +
== Dataverse Datasets ==
 +
 
 +
Dataverse datasets as delivered to Archivematica contain the following
 +
- The original user-submitted files
 +
- An agents.json and dataset.json metadata files that describe the files.
 +
- If the user submitted tabular data, a set of derivatives of the original tab files in several formats, alongside metadata files describing the tabular files. See the Dataverse [http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html documentation] for more information on tabular ingest.
 +
 
 +
=== Dataset Metadata file - dataset.json ===
 +
This file is provided by Dataverse. It contains citation and other study-level metadata, an entity_id field that is used to identify the study in Dataverse, version information, a list of data files with their own entity_id values, and md5 checksums for each (original) data file. (It does not currently provide checksums for derivatives or metadata files created by dataverse)
 +
 
 +
 
 +
=== Agents Metadata file - agents.json ===
 +
This file is created by Archivematica. It includes the Agent information that is entered into the Storage Service when configuring a Dataverse Location. To do: add link to final docs once they are updated.
 +
 +
 
 +
=== Bundles for tabular data files ===
 +
 
 +
When Dataverse [http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html ingests some forms of tabular data], it creates derivatives of the original data file and additional metadata files. All of these files are provided in a [http://guides.dataverse.org/en/latest/user/dataset-management.html?highlight=bundle bundle] as a zipped package, containing:
 +
 
 +
* The original file uploaded by the user;
 +
* Different derivative (alternative) formats of the original file (e.g. tab-delimited file, R data file)
 +
* Variable Metadata (as a DDI Codebook XML file);
 +
* Data File Citation (currently in either RIS or EndNote XML format);
 +
 
 +
'''TO DO''' - update notes on how bundles are retrieved. the original version of this documentation included these notes which need to be updated / corrected:
 +
 
 +
[4] If json file has content_type of tab separated values, Archivematica issues API call for multiple file ("bundled") content download. This returns a zipped package for tsv files containing the .tab file, the original uploaded file, several other derivative formats, a DDI XML file and file citations in Endnote and RIS formats.
 +
 
 +
 
 +
== Dataverse METS file ==
 +
 
 +
Archivematica generates a Dataverse METS file that describes the contents of the dataset as retrieved from Dataverse. The Dataverse METS includes:
 +
* descriptive metadata about the dataset, mapped to the [https://www.ddialliance.org/Specification/DDI-Codebook/2.5/ DDI standard]
 +
* a <mets:fileSec> section that lists all files provided, grouped by type (original, metadata or derivative)
 +
* a <mets:structMap> section that describes the structure of the files as provided by Dataverse (particularly helpful for understanding which files were provided in 'bundles')
 +
 
 +
The Dataverse METS is found in the final AIP in this location: <AIP Name>/data/objects/metadata/transfers/<transfer name>/METS.xml
 +
(This is also where you will find the dataset.json metadata file provided by Dataverse, and the agents.json metadata file created by Archivematica).
 +
 
 +
=== Sample Dataverse METS file ===
 +
 
 +
<b>Original Dataverse study retrieved through API call:</b>
 +
 
 +
*dataset.json (a JSON file generated by Dataverse consisting of study-level metadata and information about data files)
 +
*Study_info.pdf (a non-tabular data file)
 +
*A zipped bundle consisting of the following:
 +
**YVR_weather_data.sav (an SPSS SAV file uploaded by the researcher)
 +
**YVR_weather_data.tab (a TAB file generated from the SPSS SAV file by Dataverse)
 +
**YVR weather_data.RData (an R file generated from the SPSS SAV file by Dataverse)
 +
**YVR_weather_data-ddi.xml, YVR_weather_datacitation-endnote.xml, and YVR_weather_datacitation-ris.ris (three metadata files generated for the TAB file by Dataverse)
 +
 
 +
</br>
 +
<b>Resulting Dataverse METS file</b>
 +
 
 +
*The fileSec in the METS file consists of three file groups, USE="original" (the PDF and SAV files); USE="derivative" (the TAB and R files); and USE="metadata" (the JSON file and the three metadata files from the zipped bundle).
 +
*All of the files unpacked from the Dataverse bundle have a GROUPID attribute to indicate the relationship between them. If the transfer had consisted of more than one bundle, each set of unpacked files would have its own GROUPID.
 +
*Three dmdSecs have been generated:
 +
**dmdSec_1, consisting of a small number of study-level DDI terms
 +
**dmdSec_2, consisting of an mdRef to the JSON file
 +
**dmdSec_3, consisting of an mdRef to the DDI XML file
 +
*In the structMap, dmdSec_1 and dmdSec_2 are linked to the study as a whole, while dmdSec_3 is linked to the TAB file. The endnote and ris files have not been made into dmdSecs because they contain small subsets of metadata which are already captured in dmdSec_1 and the DDI xml file.
 +
 
 +
</br>
 +
 
 +
[[File:METS1G.png|900px|thumb|center]]
 +
[[File:METS2G.png|900px|thumb|center]]
 +
[[File:METS3G.png|900px|thumb|center]]
 +
 
 +
</br>
 +
 
 +
<b>Metadata sources for METS file</b>
 +
The table below shows how elements in the METS files are populated from metadata or files provided with Dataverse Datasets.
 +
 
 +
More metadata from dataverse could be mapped into the METS files. Scholar's Portal would like to see more metadata in the AIP to enable better indexing & search / discovery of datasets. To show which fields could be used, we took a version of the Dataverse metadata crosswalk, and created our own version that includes Archivematica. The [https://docs.google.com/spreadsheets/d/18Xn4yR-nvbZV5lfrxVNQ8GHM18ilZ_IPocP9UeOtCY4/edit?usp=sharing Dataverse 4.0+ to Archivematica Metadata Crosswalk] provides the same details in the table below but also highlights additional fields that should ultimately be mapped into METS.
 +
 
 +
Note that if a user enters descriptive metadata via the Archivematica interface during the transfer process (by going to the transfer report pane > Metadata > Add), the addition of this new metadata will overwrite any imported DDI metadata from Dataverse in the final Archivematica METS file.
 +
 
 +
</br>
 +
 
 +
{| border="1" cellpadding="10" cellspacing="0" width="100%"
 +
|-
 +
!style="width:15%"|'''METS element'''
 +
!style="width:25%"|'''Information source'''
 +
!style="width:40%"|'''Notes'''
 +
|-
 +
|ddi:titl
 +
|json: citation/typeName: "title", value: [value]
 +
|
 +
|-
 +
|ddi:IDNo
 +
|json: authority, identifier
 +
|json example: "authority": "10.5072/FK2/", "identifier": "0MOPJM"
 +
|-
 +
|ddi:IDNo agency attribute
 +
|json: protocol
 +
|json example: "protocol": "doi"
 +
|-
 +
|ddi:AuthEntity
 +
|json: citation/typeName: "authorName"
 +
|
 +
|-
 +
|ddi:distrbtr
 +
|json: "publisher": "Root Dataverse"
 +
|
 +
|-
 +
|ddi:version date attribute
 +
|json: "releaseTime"
 +
|
 +
|-
 +
|ddi:version type attribute
 +
|json: "versionState"
 +
|
 +
|-
 +
|ddi:version
 +
|json: "versionNumber", "versionMinorNumber"
 +
|
 +
|-
 +
|ddi:restrctn
 +
|json: "termsOfUse"
 +
|
 +
|-
 +
|fileGrp USE="original"
 +
|json: datafile
 +
|Each non-tabular data file is listed as a datafile in the files section. Each TAB file derived by Dataverse for uploaded tabular file formats is also listed as a datafile, with the original file uploaded by the researcher indicated by "originalFileFormat".
 +
|-
 +
|fileGrp USE="derivative"
 +
|All files that are included in a bundle, except for the original file and the metadata files (see below).
 +
|
 +
|-
 +
|fileGrp USE="metadata"
 +
|Any files with .json or .ris extension, any -ddi.xml files and -endnote.xml files
 +
|
 +
|-
 +
|CHECKSUM
 +
|json: datafile/"md5": [value]
 +
|
 +
|-
 +
|CHECKSUMTYPE
 +
|json: datafile/"md5"
 +
|
 +
|-
 +
|GROUPID
 +
|Generated by ingest tool. Each file unpacked from a bundle is given the same group id.
 +
|
 +
|-
 +
|}
 +
 
 +
</br>
 +
 
 +
== Transfer METS file ==
 +
During transfer processing, a Transfer METS file is created. This is found in the final AIP in this location: <AIP Name>/data/objects/submissionDocumentation/<transfer name>/METS.xml
 +
 
 +
This is an existing (standard) process that hasn't been changed in this project.
 +
 
 +
== AIP METS file ==
 +
 
 +
=== Basic METS file structure ===
 +
 
 +
The Archival Information Package (AIP) METS file will follow the basic structure for a standard Archivematica AIP METS file described at [[METS]]. A new fileGrp USE="derivative" will be added to indicate TAB, RData and other derivatives generated by Dataverse for uploaded tabular data format files.
 +
 
 +
=== dmdSecs in AIP METS file ===
 +
 
 +
The dmdSecs in the Dataverse METS file will be copied over to the AIP METS file.
 +
 
 +
=== Additions to PREMIS for derivative files ===
 +
 
 +
In the PREMIS Object entity, relationships between original and derivative tabular format files from Dataverse will be described using PREMIS relationship semantic units. A PREMIS derivation event will be added to indicate the derivative file was generated from the original file, and a Dataverse Agent will be added to indicate the Event was carried out by Dataverse prior to ingest, rather than by Archivematica.
 +
 
 +
'''Note''' We originally considered adding a creation event for the derivative files as well, but decided that it's not necessary as the event can be inferred from the derivation event and the PREMIS object relationships.
 +
 
 +
'''Note''' "Derivation" is not an event type on the Library of Congress controlled vocabulary list at http://id.loc.gov/vocabulary/preservation/eventType.html. However, we have submitted it as a proposed new term (November 2015) at http://premisimplementers.pbworks.com/w/page/102413902/Preservation%20Events%20Controlled%20Vocabulary - a list of new terms that is being considered by the PREMIS Editorial Committee.
 +
 
 +
'''Update''' ''April 2018'': The most recently available Event Type Controlled List (June 2017) does not yet have derivation as a controlled type, https://www.loc.gov/standards/premis/v3/preservation-events.pdf
 +
 
 +
Example:
 +
 
 +
Original SPSS SAV file
 +
<pre>
 +
<premis:relationship>
 +
  <premis:relationshipType>derivation</premis:relationshipType>
 +
    <premis:relationshipSubType>is source of</premis:relationshipSubType>
 +
  <premis:relatedObjectIdentification>                 
 +
    <premis:relatedObjectIdentifierType>UUID</premis:relatedObjectIdentifierType>
 +
  <premis:relatedObjectIdentifierValue>[TAB file UUID]</premis:relatedObjectIdentifierValue>
 +
<premis:relationship>
 +
...
 +
<premis:eventIdentifier>
 +
  <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
 +
  <premis:eventIdentifierValue>[Event UUID assigned by Archivematica]</premis:eventIdentifierValue>
 +
</premis:eventIdentifier>
 +
<premis:eventType>derivation</premis:eventType>
 +
<premis:eventDateTime>2015-08-21</premis:eventDateTime>
 +
<premis:linkingAgentIdentifier>
 +
  <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType>
 +
  <premis:linkingAgentIdentifierValue>http://dataverse.scholarsportal.info/dvn/
 +
</premis:linkingAgentIdentifierValue>
 +
</premis:linkingAgentIdentifier>
 +
...
 +
<premis:agentIdentifier>
 +
  <premis:agentIdentifierType>URI</premis:agentIdentifierType>
 +
  <premis:agentIdentifierValue>http://dataverse.scholarsportal.info/dvn/</premis:agentIdentifierValue>
 +
</premis:agentIdentifier>
 +
<premis:agentName>SP Dataverse Network</premis:agentName>
 +
<premis:agentType>organization</premis:agentType>
 +
</pre>
 +
 
 +
Derivative TAB file
 +
<pre>
 +
<premis:relationship>
 +
  <premis:relationshipType>derivation</premis:relationshipType>
 +
    <premis:relationshipSubType>has source</premis:relationshipSubType>
 +
  <premis:relatedObjectIdentification>                 
 +
    <premis:relatedObjectIdentifierType>UUID</premis:relatedObjectIdentifierType>
 +
  <premis:relatedObjectIdentifierValue>[SPSS SAV file UUID]</premis:relatedObjectIdentifierValue>
 +
<premis:relationship>
 +
</pre>
 +
 
 +
=== Fixity check for checksums received from Dataverse ===
 +
 
 +
<pre>
 +
<premis:eventIdentifier>
 +
  <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
 +
  <premis:eventIdentifierValue>[Event UUID assigned by Archivematica]</premis:eventIdentifierValue>
 +
</premis:eventIdentifier>
 +
<premis:eventType>fixity check</premis:eventType>
 +
<premis:eventDateTime>2015-08-21</premis:eventDateTime>
 +
<premis:eventDetail>program="python"; module="hashlib.sha256()"</premis:eventDetail>
 +
<premis:eventOutcomeInformation>
 +
  <premis:eventOutcome>Pass</premis:EventOutcome>
 +
  <premis:eventOutcomeDetail>
 +
    <premis:eventOutcomeDetailNote>Dataverse checksum 91b65277959ec273763d28ef002e83a6b3fba57c7a3[...]
 +
verified</premis:eventOutcomeDetailNote>
 +
  </premis:eventOutcomeDetail>
 +
<premis:eventOutcomeInformation>
 +
</premis:linkingAgentIdentifier>
 +
  <premis:linkingAgentIdentifierType>preservation system</premis:linkingAgentIdentifierType>
 +
  <premis:linkingAgentIdentifierValue>Archivematica 1.4.1</premis:linkingAgentIdentifierValue>
 +
</premis:linkingAgentIdentifier>
 +
</pre>
 +
 
 +
 
 +
== AIP structure ==
 +
 
 +
An Archival Information Package derived from a Dataverse ingest will have the same basic structure as a generic Archivematica AIP, described at [[AIP_structure]]. There are additional metadata files that are included in a Dataverse-derived AIP, and each zipped bundle that is included in the ingest will result in a separate directory in the AIP. The following is a sample structure.
 +
 
 +
'''Bag structure'''
 +
 
 +
The Archival Information Package (AIP) is packaged in the Library of Congress BagIt format, and may be stored compressed or uncompressed:
 +
 
 +
<pre>
 +
Pacific_weather_patterns_study-dfb0b75d-6555-4e99-a8d8-95bed0f6303f.7z
 +
├── bag-info.txt
 +
├── bagit.txt
 +
├── manifest-sha512.txt│  
 +
├── tagmanifest-md5.txt
 +
└── data [standard bag directory containing contents of the AIP]</pre>
 +
 
 +
'''AIP structure'''
 +
 
 +
All of the contents of the AIP reside within the data directory:
 +
 
 +
<pre>
 +
 
 +
├── data
 +
│   ├── logs [log files generated during processing]
 +
│   │   ├── fileFormatIdentification.log
 +
│   │   └── transfers
 +
│   │      └── Pacific_weather_patterns_study-1a0f309a-d3ec-43ee-bb48-a868cd5ca85c
 +
│   │          └── logs
 +
│   │              ├── extractContents.log
 +
│   │              ├── fileFormatIdentification.log
 +
│   │              └── filenameCleanup.log
 +
│   ├── METS.dfb0b75d-6555-4e99-a8d8-95bed0f6303f.xml [the AIP METS file]
 +
│   ├── objects [a directory containing the digital objects being preserved, plus their metadata]
 +
│       ├── chelan_052.jpg [an original file from Dataverse]
 +
│       ├── Weather_data.sav [an original file from Dataverse]
 +
│       ├── Weather_data [a bundle retrieved from Dataverse]
 +
│       │   ├── Weather_data.xml
 +
│       │   ├── Weather_data.ris
 +
│       │   ├── Weather_data-ddi.xml
 +
│       │   └── Weather_data.tab [a TAB derivative file generated by Dataverse]
 +
│       ├── metadata
 +
│       │   └── transfers
 +
│       │      └── Pacific_weather_patterns_study-1a0f309a-d3ec-43ee-bb48-a868cd5ca85c
 +
│       │          ├── agents.json [see Dataverse#agents.json]
 +
│       │          ├── dataset.json [see Dataverse#dataverse.json]
 +
│       │          └── METS.xml [see Dataverse#Dataverse_METS_file]
 +
│       └── submissionDocumentation
 +
│          └── transfer-58-1a0f309a-d3ec-43ee-bb48-a868cd5ca85c
 +
│              └── METS.xml [the standard Transfer METS file described above]
 +
</pre>
 +
 
 +
'''AIP METS file structure'''
 +
 
 +
The AIP METS file records information a bout the contents of the AIP, and indicates the relationships between the various files in the AIP. A sample AIP METS file would be structured as follows:
 +
 
 +
<pre>
 +
METS header
 +
-Date METS file was created
 +
METS dmdSec [descriptive metadata section]
 +
-DDI XML metadata taken from the METS transfer file, as follows
 +
--ddi:title
 +
--ddi:IDno
 +
--ddi:authEnty
 +
--ddi:distrbtr
 +
--ddi:version
 +
--ddi:restrctn
 +
METS dmdSec [descriptive metadata section]
 +
-link to dataset.json
 +
METS dmdSec [descriptive metadata section]
 +
-link to DDI.XML file created for derivative file as part of bundle
 +
METS amdSec [administrative metadata section, one for each original, derivative and normalized file in the AIP]
 +
-techMD [technical metadata]
 +
--PREMIS technical metadata about a digital object, including file format information and extracted metadata
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: derivation (for derived formats)
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event:ingestion
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: unpacking (for bundled files)
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: message digest calculation
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: virus check
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: format identification
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: fixity check (if file comes from Dataverse with a checksum)
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: normalization (if file is normalized to a preservation format during Archivematica processing)
 +
-digiprovMD [digital provenance metadata]
 +
--PREMIS event: creation (if file is a normalized preservation master generated during Archivematica processing)
 +
-digiprovMD
 +
--PREMIS agent: organization
 +
-digiprovMD
 +
--PREMIS agent: software
 +
-digiprovMD
 +
--PREMIS agent: Archivematica user
 +
METS fileSec [file section]
 +
-fileGrp USE="original" [file group]
 +
--original files uploaded to Dataverse
 +
-fileGrp USE="derivative"
 +
--derivative tabular files generated by Dataverse
 +
-fileGrp USE="submissionDocumentation"
 +
--METS.XML (standard Archivematica transfer METS file listing contents of transfer)
 +
-fileGrp USE="preservation"
 +
--normalized preservation masters generated during Archivematica processing
 +
-fileGrp USE="metadata"
 +
--dataset.json
 +
--DDI.XML
 +
--xcitation-endnote.xml
 +
--xcitation-ris.ris
 +
METS structMap [structural map]
 +
-directory structure of the contents of the AIP</pre>
 +
 
 +
== Future Requirements & Considerations ==
 +
This section includes working notes for future phases, as interesting opportunities or questions arise. At the end of the current phase we will be documenting the integration as well as future opportunities.
  
==Overview==
+
===Improvements to Current Functionality===
This wiki captures requirements for ingesting studies (datasets) from Dataverse into Archivematica for long-term preservation.
 
  
==Workflow==
+
* That Dataverse-created zips for derivatives are extracted and the created packages deleted, but the directories maintained [https://github.com/archivematica/Issues/issues/79 issue 79, number 11]. This would preserve something closer to the original file arrangement users see in Dataverse in the final AIP.
*The proposed workflow consists of issuing API calls to Dataverse, receiving content (data files and metadata) for ingest into Archivematica, preparing standard Archivematica Archival Information Packages (AIPs) and placing them in archival storage, and updating the Dataverse study with the AIP UUIDs.  
+
* As above, that user-submitted zips are extracted and/or retained according to the selected processing configuration in Archivematica, and that these are maintained as directories.
*Analysis is based on Dataverse tests using [https://apitest.dataverse.org/ https://apitest.dataverse.org/] and [https://dataverse-demo.iq.harvard.edu/ https://dataverse-demo.iq.harvard.edu/], online documentation at http://guides.dataverse.org/en/latest/api/index.html and discussions with Dataverse developers and users.
+
* Warnings to users that would prevent transfer failures, such as when a user selects a metadata-only dataset, a restricted transfer, or transfer that does not verify compliance due to user not selecting ‘dataverse’ transfer type [https://github.com/archivematica/Issues/issues/79#issuecomment-415114349 issue 79, number 9] and [https://github.com/archivematica/Issues/issues/57 issue 57].
*Proposed integration is for Archivematica 1.5 and higher and Dataverse 4.x.
+
* Download performance improvement [https://github.com/archivematica/Issues/issues/61 issue 61]
 +
* Addition of three additional descriptive metadata fields, the first two of which are required by Dataverse: Description ("abstract" in DDI / "dsDescriptionValue" in json); Subject ("subject" in DDI / " 'typeName': 'subject' " in json); Publication date ("distDate" in DDI / "dateOfDeposit" in json)
  
===Workflow diagram===
+
===New Features===
 +
* A separate micro-service for verifying a Dataverse transfer before the transfer begins would make it easier to identify issues and ensure compliance for Dataverse transfer types.
 +
* Use of AICs or other method to relate datasets that are part of a Dataverse collection.
 +
* Automatic transfer title naming from DOI or other useful name.
 +
* Ability to select and transfer past versions of Datasets.
 +
* Enhancements to the transfer browser pane (search capability, showing versions, etc). [https://github.com/artefactual/archivematica-storage-service/issues/363 issue 363].
 +
* Enabling AIP reingest [https://github.com/archivematica/Issues/issues/107 issue 107]. [https://www.archivematica.org/en/docs/archivematica-1.8/user-manual/ingest/ingest/#reingest AIP reingest] functions are used for changing an aspect of already stored AIPs, such as the opportunity to re-normalize files if policies change, or files in an AIP are found to be at risk of obsolescence. AIP reingest would not apply to changes made externally to the source dataset in Dataverse. Any changes to the source dataset, such as the addition of files or metadata, would require the creation of a new AIP.
 +
* Improving rights information (i.e. mapping to PREMIS rights)
  
[[File:Dataverse-Archivematica_workflow.png|800px|thumb|center]]
+
=== Notes from Feature File review meeting on May 1 2018 (2pm EST) ===
 +
'''Choice & Versioning of Dataverse API:'''
 +
The dataverse Search and Access APIs are not currently versioned.
 +
The Native API is versioned: http://guides.dataverse.org/en/latest/api/native-api.html
 +
There is an OAI-PMH interface (although it is not mentioned in the dataverse API guide). Amber said there were idiosyncrasies in the way dataverse implemented PMH, and wasn’t sure it would be a ‘safe’ option.
 +
Amaz would like to see that we are either using a standard API (like OAI-PMH) or a versioned API.
 +
Amaz thought wondered whether we could use PMH with the polling part of the solution; but given what Amber said, it doesn’t seem like a good way to go)
 +
So as part of the project we need to see whether we could use the Native API (even if we don’t actually use it), or we need to raise it as an issue to discuss with the dataverse team.  
  
===Workflow diagram notes===
+
'''Relationships between Datasets'''
 +
Amber pointed out that they are not currently clear exactly what datasets should be preserved, and expects this will vary quite a bit by institution.
 +
We discussed the question of whether all datasets in a dataverse would be preserved (not currently known), which brought up the question of how to relate datasets.
 +
We talked about AICs as one possible solution. But agreed that it’s a new feature and needs to be thought through… there could be other solutions than AIC.
  
[1] A new or updated study is one that has been published, either for the first time or as a new version, since the last API call.
+
'''Improving agent info in event history in METS'''
 +
We pointed out that having an agent other than Archivematica in the METS is a new feature
 +
Discussed the fact that we could make this even more specific by adding more agents. For instance, differentiating between the researcher who uploaded files from the research data manager who published the dataset.  
  
[2] The json file contains citation and other study-level metadata, an entity_id field that is used to identify the study in Dataverse, version information, a list of data files with their own entity_id values, and md5 checksums for each data file.
+
'''Notes from Dataverse Testing:'''
  
[3] If json file has content_type of tab separated values, Archivematica issues API call for multiple file ("bundled") content download. This returns a zipped package for tsv files containing the .tab file, the original uploaded file, several other derivative formats, a DDI XML file and file citations in Endnote and RIS formats.
+
Should a preserved dataset include an equivalent of fixity check on any UNFs created by Dataverse?
 +
https://dataverse.scholarsportal.info/guides/en/4.8.6/developers/unf/index.html#unf
 +
Universal Numerical Fingerprint (UNF) is a unique signature of the semantic content of a digital object. It is not simply a checksum of a binary data file. Instead, the UNF algorithm approximates and normalizes the data stored within. A cryptographic hash of that normalized (or canonicalized) representation is then computed.
  
[4] Standard and pre-configured micro-services to include: assign UUID, verify checksums, generate checksums, extract packages, scan for viruses, clean up filenames, identify formats, validate formats, extract metadata and normalize for preservation.
+
== See also ==
  
[5] DC metadata parsed for the study only, not for individual data files.
+
* [[Sword API]]
 +
* [[Dataset preservation]]

Latest revision as of 10:28, 8 April 2020

Main Page > Documentation > Requirements > Dataverse

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

This page sets out the requirements and designs for integration with Dataverse. As of Archivematica v. 1.8, integration with the Dataverse repository as a transfer source type supports the selection and processing of Dataverse research datasets. Archivematica v. 1.9 introduced two fixes to the workflow.

This page was originally created as part of an early proof-of-concept integration project in 2015, which was only made available in a development branch of Archivematica. Phase 2 of this project improved on that original integration work and merged it into a public release of Archivematica (v1.8). This work was sponsored by Scholars Portal, a service of the Ontario Council of University Libraries (OCUL).


Current Status[edit]

April 8, 2020

Outstanding issues relating to the representation of tabular derivatives in METS are documented in the Archivematica issues repository with the tag "OCUL: AM-Dataverse"

  • One major outstanding issue relates to changes in Dataverse between version 4.10 and 4.17 that altered the way files are named; this conflicts with the hard-coding of these names in Archivematica's scripts. Issue 1057
  • A second, related issue documents the treatment of RData derivatives, which are still causing conflicts with Archivematica's regular 'extract packages' workflows. Issue 1058

March 19, 2019

The integration has been released as part of Archivematica version 1.8. Version 1.9 (released March 6, 2019) integrated the following fixes:

  • Multiple authors are not captured in the Dataverse METS - only the first author listed is. Issue 278
  • It is not possible to delete packages after extraction using the Dataverse transfer type if the package contains derivatives. Issue 269

This screencast provides a demonstration of the current implementation.

OCUL/Scholars Portal is currently hosting a demonstration sandbox for interested users to test the integration. Please visit the sandbox page on the OCUL Confluence site for information on how to access it. You can read more about the project and its outcomes in Meghan Goodchild and Grant Hurley's 2019 iPres paper and presentation slides here: https://osf.io/wqbvy/.

Overview of Dataverse to Archivematica Integration[edit]

Setting up the Integration[edit]

In order to set up and use the integration, users should consult Archivematica’s documentation, particularly the Dataverse section of the Archivematica Storage Service documentation for setup and the section on Dataverse transfers. If users are unfamiliar with Archivematica, they should consult Archivematica’s overview and quick start documentation. If users are unfamiliar with Dataverse, they should consult Dataverse's guides. Users will need access to both an appropriately provisioned Archivematica instance, and to an installation of Dataverse for which they have an account.

Scope of the Integration[edit]

As per the details on feature files below, the integration was designed with the following scope of use:

  • The current integration presumes a user who has an account with a Dataverse instance and has generated an associated API key, and the same (or a different, authorized) user who has access to an Archivematica instance and storage service that is connected to that Dataverse via the API key. You can read more in the Dataverse documentation under "Root Dataverse Permissions" about users, admins and superuser categories that might impact access to Dataverse datasets via the API.
  • It is assumed the user has obtained the necessary rights to process and store dataset files in Dataverse for preservation and has appropriate access to the dataset and/or associated files based on the rights related to their Dataverse API key (see above).
  • It is assumed that the preserver is interested in selecting specific datasets in a Dataverse for preservation. SIPs and their resulting AIPs are created from current versions of Dataverse datasets with one or more associated files in that dataset. A dataset is therefore equivalent to a SIP. Individual files cannot be selected for preservation, nor can older versions of files. However, users may make use of Archivematica’s Appraisal functions to select individual files in a particular dataset to create a final AIP.
  • At present, a function to automate the ingest of all datasets in a Dataverse has not been developed.

Feature Files[edit]

On this project we are using Gherkin feature files to define the desired behaviour of preserving a dataset from a Dataverse. Feature files are also known as Acceptance Tests, because they specify the behaviour that we will test at the end of the project. The draft versions & comments are documented in this feature file.

Feature: Preserve a Dataverse dataset

 Alma is an Archivematica user 
 And they want to preserve a dataset published in a Dataverse
   Definitions  
   Dataverse Dataset: A dataset that has been published in a Dataverse, including all 
   original files uploaded to dataverse, and any derivative files created by Dataverse.  
   Dataverse METS: A metadata file using the METS standard that describes a dataset; 
   including descriptive metadata, list of all objects in the dataset, their structure 
   and relationships to each other. 
 Scenario: Manual Selection of Dataset
   Given the Storage Service is configured to connect to a Dataverse Repository 
     And the dataset has been published in Dataverse 
 When the user selects the transfer type “Dataverse” 
   And the user selects the dataset to be preserved  
   And the user enters the <Transfer Name>
   And the user enters the (optional) <Accession number> 
   And the users clicks the “Start Transfer” Button
 Then Archivematica copies the files from Dataverse to a local processing directory   
   And the Approve Transfer microservice asks the user to approve the transfer
   And the user selects yes 
   And the Verify Transfer Compliance microservice creates the Dataverse METS
   And the Dataverse metadata files are generated and included in a metadata directory 
   And the Verify Transfer Compliance microservice confirms this is a valid Dataverse Transfer
   And the Verify Transfer Checksums microservice confirms the checksums provided by dataverse match those generated for each file in the dataset
   And the AIP Mets File includes the Dataverse generated events
   And the completed AIP is stored in the specified Dataverse storage location

Dataverse Workflow[edit]

Dataverse Workflow overview.png


1) User Selects Dataset When the Storage Service is configured to connect to Dataverse, the Transfer Browser in the Dashboard will display a list of all Dataverse Transfer Source Locations. Transfer Source locations can be configured to filter on search terms, or on a particular dataverse. See (TODO - add link to SS documentation). Users can browse through the datasets available, select one and set the Transfer type to Dataverse.

2) Storage Service Retrieves Dataset The storage services uses the Dataverse API to retrieve the selected dataset. API credentials are stored in the Storage Service Space.

3) Prepare Transfer

Archivematica creates a metadata file called agents.json that includes the agent information configured in the storage service. This information is used to populate the PREMIS agent details in the METS files. See Dataverse#agents.json for more details.

When a dataset includes a "bundle" of related files for tabular data, it is provided as a .zip file. Archivematica extracts all of the files in bundles at this stage. Other .zip files are not affected, and can be extracted or not using the standard processing configuration options. See TO DO - ADD LINK TO dataset section

4) Transfer & Ingest

Archivematica performs transfer and ingest processes using the standard processing configuration options. Additional processing for Dataverse datasets include

  • creating a Dataverse METS that describes the dataset as provided by Dataverse
  • fixity check of files using checksums provided by Dataverse
  • including Dataverse metadata (from the Dataverse METS) in the final AIP METS

5) Store the AIP

The AIP is stored in whatever location has been configured. Scholar's Portal intend to store their AIPs in an S3 location (which is a standard configuration option as of Storage Service version 0.12).

Packages-related Workflows[edit]

User-submitted packages It is common for Dataverse users to “double-zip” files when uploading files to datasets. This is the practice of packaging files and then packaging them again a second time. Dataverse always unpacks submitted packages, but if users double-zip, they can save the labour of uploading many files one-by-one. Archivematica users may choose whether they wish to have these packages extracted and/or deleted afterward by setting the appropriate corresponding processing configuration.

Dataverse-created derivative bundles A second set of packages are created by Dataverse in the form of derivative bundles. Derivatives are copies of files in tabular format that Dataverse creates from user-submitted files. Dataverse delivers these packages to Archivematica as zip packages. See Bundles for tabular data files for more details below. See Dataverse’s guide on tabular ingest for additional documentation. These packages are always extracted by Archivematica by default. Setting the processing configuration to not extract packages will not function for this type of transfer.

Known Issues Impacting Transfers[edit]

The following table summarizes known issues that impact the success of individual transfers. For a full list of known issues, consult the Waffle board for this feature.

Issue Description Failure Step Message (last line)
Dataset has no files Datasets that do not contain files (i.e., metadata only) will result in a failed transfer Verify transfer compliance: Convert Dataverse Structure ConvertDataverseError: Error adding Dataset files to METS
Dataset has files with blank checksum values A failed transfer will result if the dataset has files with blank checksum values (a known issue for certain types of files that were deposited in Dataverse v3.6 or earlier). A user may work around this issue by selecting the “Standard” transfer type and processing the transfer as usual. However, the METS file will not contain descriptive metadata and the Dataverse checksums will not be validated. Administrators may wish to troubleshoot blank checksums in their Dataverse instances to fix this issue. Verify transfer compliance: Convert Dataverse Structure ValueError: Must provide both checksum and checksumtype, or neither. Provided values: and MD5
Dataset that has files which failed during the Dataverse tabular ingest upload A failed transfer will result if there are files which have failed the tabular ingest upload process in Dataverse (e.g., results in missing .RData and derivative .tab file). Parse External Files: Parse Dataverse METS XML ParseDataverseError: Exiting. Returning the database objects for our Dataverse files has failed.
Dataset has derivative packages and “Delete packages after extraction” is set in the processing configuration If the user is running Archivematica version 1.8, and the transfer contains derivative files (i.e., files that have been uploaded through the tabular ingest process in Dataverse) and the option “delete packages after the extraction” is selected in the Archivematica processing configuration, the transfer will fail. This is because the .RData files contained as part of derivative files are themselves packages and will be deleted. This issue was fixed in Archivematica 1.9. The work-around if running Archivematica 1.8 is to select ‘no’ as the option in the processing configuration. Parse External Files: Parse Dataverse METS XML IntegrityError: (1048, "Column 'eventOutcomeDetailNote' cannot be null")
User attempts to process dataset with restricted files Permissions to process datasets through Archivematica correspond to role permissions associated with a Dataverse via an API token. Therefore, restricted files must be processed using an administrator or superuser API token for any restricted datasets that are selected for transfer. Otherwise, processing of these datasets will fail. Parse External Files: Parse Dataverse METS XML ParseDataverseError: Exiting. Returning the database objects for our Dataverse files has failed.
User does not select “Dataverse” transfer type When processing a Dataverse dataset, users must select the “Dataverse” transfer type from the drop-down menu when initiating the transfer. If a “standard” transfer type is selected, the dataset may be processed without descriptive metadata. If another transfer type is selected, the transfer will fail. Note: Dataset Terms of Use may exist for restricted files, in these cases it is expected that Terms of Use are respected by the person(s) processing the files in Archivematica. License information for datasets with restricted files is not currently mapped to METS. Various Various

Dataverse Datasets[edit]

Dataverse datasets as delivered to Archivematica contain the following - The original user-submitted files - An agents.json and dataset.json metadata files that describe the files. - If the user submitted tabular data, a set of derivatives of the original tab files in several formats, alongside metadata files describing the tabular files. See the Dataverse documentation for more information on tabular ingest.

Dataset Metadata file - dataset.json[edit]

This file is provided by Dataverse. It contains citation and other study-level metadata, an entity_id field that is used to identify the study in Dataverse, version information, a list of data files with their own entity_id values, and md5 checksums for each (original) data file. (It does not currently provide checksums for derivatives or metadata files created by dataverse)


Agents Metadata file - agents.json[edit]

This file is created by Archivematica. It includes the Agent information that is entered into the Storage Service when configuring a Dataverse Location. To do: add link to final docs once they are updated.


Bundles for tabular data files[edit]

When Dataverse ingests some forms of tabular data, it creates derivatives of the original data file and additional metadata files. All of these files are provided in a bundle as a zipped package, containing:

  • The original file uploaded by the user;
  • Different derivative (alternative) formats of the original file (e.g. tab-delimited file, R data file)
  • Variable Metadata (as a DDI Codebook XML file);
  • Data File Citation (currently in either RIS or EndNote XML format);

TO DO - update notes on how bundles are retrieved. the original version of this documentation included these notes which need to be updated / corrected:

[4] If json file has content_type of tab separated values, Archivematica issues API call for multiple file ("bundled") content download. This returns a zipped package for tsv files containing the .tab file, the original uploaded file, several other derivative formats, a DDI XML file and file citations in Endnote and RIS formats.


Dataverse METS file[edit]

Archivematica generates a Dataverse METS file that describes the contents of the dataset as retrieved from Dataverse. The Dataverse METS includes:

  • descriptive metadata about the dataset, mapped to the DDI standard
  • a <mets:fileSec> section that lists all files provided, grouped by type (original, metadata or derivative)
  • a <mets:structMap> section that describes the structure of the files as provided by Dataverse (particularly helpful for understanding which files were provided in 'bundles')

The Dataverse METS is found in the final AIP in this location: <AIP Name>/data/objects/metadata/transfers/<transfer name>/METS.xml (This is also where you will find the dataset.json metadata file provided by Dataverse, and the agents.json metadata file created by Archivematica).

Sample Dataverse METS file[edit]

Original Dataverse study retrieved through API call:

  • dataset.json (a JSON file generated by Dataverse consisting of study-level metadata and information about data files)
  • Study_info.pdf (a non-tabular data file)
  • A zipped bundle consisting of the following:
    • YVR_weather_data.sav (an SPSS SAV file uploaded by the researcher)
    • YVR_weather_data.tab (a TAB file generated from the SPSS SAV file by Dataverse)
    • YVR weather_data.RData (an R file generated from the SPSS SAV file by Dataverse)
    • YVR_weather_data-ddi.xml, YVR_weather_datacitation-endnote.xml, and YVR_weather_datacitation-ris.ris (three metadata files generated for the TAB file by Dataverse)


Resulting Dataverse METS file

  • The fileSec in the METS file consists of three file groups, USE="original" (the PDF and SAV files); USE="derivative" (the TAB and R files); and USE="metadata" (the JSON file and the three metadata files from the zipped bundle).
  • All of the files unpacked from the Dataverse bundle have a GROUPID attribute to indicate the relationship between them. If the transfer had consisted of more than one bundle, each set of unpacked files would have its own GROUPID.
  • Three dmdSecs have been generated:
    • dmdSec_1, consisting of a small number of study-level DDI terms
    • dmdSec_2, consisting of an mdRef to the JSON file
    • dmdSec_3, consisting of an mdRef to the DDI XML file
  • In the structMap, dmdSec_1 and dmdSec_2 are linked to the study as a whole, while dmdSec_3 is linked to the TAB file. The endnote and ris files have not been made into dmdSecs because they contain small subsets of metadata which are already captured in dmdSec_1 and the DDI xml file.


METS1G.png
METS2G.png
METS3G.png


Metadata sources for METS file The table below shows how elements in the METS files are populated from metadata or files provided with Dataverse Datasets.

More metadata from dataverse could be mapped into the METS files. Scholar's Portal would like to see more metadata in the AIP to enable better indexing & search / discovery of datasets. To show which fields could be used, we took a version of the Dataverse metadata crosswalk, and created our own version that includes Archivematica. The Dataverse 4.0+ to Archivematica Metadata Crosswalk provides the same details in the table below but also highlights additional fields that should ultimately be mapped into METS.

Note that if a user enters descriptive metadata via the Archivematica interface during the transfer process (by going to the transfer report pane > Metadata > Add), the addition of this new metadata will overwrite any imported DDI metadata from Dataverse in the final Archivematica METS file.


METS element Information source Notes
ddi:titl json: citation/typeName: "title", value: [value]
ddi:IDNo json: authority, identifier json example: "authority": "10.5072/FK2/", "identifier": "0MOPJM"
ddi:IDNo agency attribute json: protocol json example: "protocol": "doi"
ddi:AuthEntity json: citation/typeName: "authorName"
ddi:distrbtr json: "publisher": "Root Dataverse"
ddi:version date attribute json: "releaseTime"
ddi:version type attribute json: "versionState"
ddi:version json: "versionNumber", "versionMinorNumber"
ddi:restrctn json: "termsOfUse"
fileGrp USE="original" json: datafile Each non-tabular data file is listed as a datafile in the files section. Each TAB file derived by Dataverse for uploaded tabular file formats is also listed as a datafile, with the original file uploaded by the researcher indicated by "originalFileFormat".
fileGrp USE="derivative" All files that are included in a bundle, except for the original file and the metadata files (see below).
fileGrp USE="metadata" Any files with .json or .ris extension, any -ddi.xml files and -endnote.xml files
CHECKSUM json: datafile/"md5": [value]
CHECKSUMTYPE json: datafile/"md5"
GROUPID Generated by ingest tool. Each file unpacked from a bundle is given the same group id.


Transfer METS file[edit]

During transfer processing, a Transfer METS file is created. This is found in the final AIP in this location: <AIP Name>/data/objects/submissionDocumentation/<transfer name>/METS.xml

This is an existing (standard) process that hasn't been changed in this project.

AIP METS file[edit]

Basic METS file structure[edit]

The Archival Information Package (AIP) METS file will follow the basic structure for a standard Archivematica AIP METS file described at METS. A new fileGrp USE="derivative" will be added to indicate TAB, RData and other derivatives generated by Dataverse for uploaded tabular data format files.

dmdSecs in AIP METS file[edit]

The dmdSecs in the Dataverse METS file will be copied over to the AIP METS file.

Additions to PREMIS for derivative files[edit]

In the PREMIS Object entity, relationships between original and derivative tabular format files from Dataverse will be described using PREMIS relationship semantic units. A PREMIS derivation event will be added to indicate the derivative file was generated from the original file, and a Dataverse Agent will be added to indicate the Event was carried out by Dataverse prior to ingest, rather than by Archivematica.

Note We originally considered adding a creation event for the derivative files as well, but decided that it's not necessary as the event can be inferred from the derivation event and the PREMIS object relationships.

Note "Derivation" is not an event type on the Library of Congress controlled vocabulary list at http://id.loc.gov/vocabulary/preservation/eventType.html. However, we have submitted it as a proposed new term (November 2015) at http://premisimplementers.pbworks.com/w/page/102413902/Preservation%20Events%20Controlled%20Vocabulary - a list of new terms that is being considered by the PREMIS Editorial Committee.

Update April 2018: The most recently available Event Type Controlled List (June 2017) does not yet have derivation as a controlled type, https://www.loc.gov/standards/premis/v3/preservation-events.pdf

Example:

Original SPSS SAV file

 
<premis:relationship>
  <premis:relationshipType>derivation</premis:relationshipType>
    <premis:relationshipSubType>is source of</premis:relationshipSubType>
  <premis:relatedObjectIdentification>                  
    <premis:relatedObjectIdentifierType>UUID</premis:relatedObjectIdentifierType>
  <premis:relatedObjectIdentifierValue>[TAB file UUID]</premis:relatedObjectIdentifierValue>
<premis:relationship>
...
<premis:eventIdentifier>
  <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
  <premis:eventIdentifierValue>[Event UUID assigned by Archivematica]</premis:eventIdentifierValue>
</premis:eventIdentifier>
<premis:eventType>derivation</premis:eventType>
<premis:eventDateTime>2015-08-21</premis:eventDateTime>
<premis:linkingAgentIdentifier>
  <premis:linkingAgentIdentifierType>URI</premis:linkingAgentIdentifierType>
  <premis:linkingAgentIdentifierValue>http://dataverse.scholarsportal.info/dvn/
</premis:linkingAgentIdentifierValue>
</premis:linkingAgentIdentifier>
...
<premis:agentIdentifier>
  <premis:agentIdentifierType>URI</premis:agentIdentifierType>
  <premis:agentIdentifierValue>http://dataverse.scholarsportal.info/dvn/</premis:agentIdentifierValue>
</premis:agentIdentifier>
<premis:agentName>SP Dataverse Network</premis:agentName>
<premis:agentType>organization</premis:agentType>

Derivative TAB file

 
<premis:relationship>
  <premis:relationshipType>derivation</premis:relationshipType>
    <premis:relationshipSubType>has source</premis:relationshipSubType>
  <premis:relatedObjectIdentification>                  
    <premis:relatedObjectIdentifierType>UUID</premis:relatedObjectIdentifierType>
  <premis:relatedObjectIdentifierValue>[SPSS SAV file UUID]</premis:relatedObjectIdentifierValue>
<premis:relationship>

Fixity check for checksums received from Dataverse[edit]

<premis:eventIdentifier>
  <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
  <premis:eventIdentifierValue>[Event UUID assigned by Archivematica]</premis:eventIdentifierValue>
</premis:eventIdentifier>
<premis:eventType>fixity check</premis:eventType>
<premis:eventDateTime>2015-08-21</premis:eventDateTime>
<premis:eventDetail>program="python"; module="hashlib.sha256()"</premis:eventDetail>
<premis:eventOutcomeInformation>
  <premis:eventOutcome>Pass</premis:EventOutcome>
  <premis:eventOutcomeDetail>
    <premis:eventOutcomeDetailNote>Dataverse checksum 91b65277959ec273763d28ef002e83a6b3fba57c7a3[...] 
verified</premis:eventOutcomeDetailNote>
  </premis:eventOutcomeDetail>
<premis:eventOutcomeInformation>
</premis:linkingAgentIdentifier>
  <premis:linkingAgentIdentifierType>preservation system</premis:linkingAgentIdentifierType>
  <premis:linkingAgentIdentifierValue>Archivematica 1.4.1</premis:linkingAgentIdentifierValue>
</premis:linkingAgentIdentifier>


AIP structure[edit]

An Archival Information Package derived from a Dataverse ingest will have the same basic structure as a generic Archivematica AIP, described at AIP_structure. There are additional metadata files that are included in a Dataverse-derived AIP, and each zipped bundle that is included in the ingest will result in a separate directory in the AIP. The following is a sample structure.

Bag structure

The Archival Information Package (AIP) is packaged in the Library of Congress BagIt format, and may be stored compressed or uncompressed:

Pacific_weather_patterns_study-dfb0b75d-6555-4e99-a8d8-95bed0f6303f.7z
├── bag-info.txt
├── bagit.txt 
├── manifest-sha512.txt│   
├── tagmanifest-md5.txt
└── data [standard bag directory containing contents of the AIP]

AIP structure

All of the contents of the AIP reside within the data directory:


├── data
│   ├── logs [log files generated during processing]
│   │   ├── fileFormatIdentification.log
│   │   └── transfers
│   │       └── Pacific_weather_patterns_study-1a0f309a-d3ec-43ee-bb48-a868cd5ca85c
│   │           └── logs
│   │               ├── extractContents.log
│   │               ├── fileFormatIdentification.log
│   │               └── filenameCleanup.log
│   ├── METS.dfb0b75d-6555-4e99-a8d8-95bed0f6303f.xml [the AIP METS file]
│   ├── objects [a directory containing the digital objects being preserved, plus their metadata]
│       ├── chelan_052.jpg [an original file from Dataverse]
│       ├── Weather_data.sav [an original file from Dataverse]
│       ├── Weather_data [a bundle retrieved from Dataverse]
│       │   ├── Weather_data.xml
│       │   ├── Weather_data.ris
│       │   ├── Weather_data-ddi.xml
│       │   └── Weather_data.tab [a TAB derivative file generated by Dataverse]
│       ├── metadata
│       │   └── transfers
│       │       └── Pacific_weather_patterns_study-1a0f309a-d3ec-43ee-bb48-a868cd5ca85c
│       │           ├── agents.json [see Dataverse#agents.json] 
│       │           ├── dataset.json [see Dataverse#dataverse.json] 
│       │           └── METS.xml [see Dataverse#Dataverse_METS_file]
│       └── submissionDocumentation
│           └── transfer-58-1a0f309a-d3ec-43ee-bb48-a868cd5ca85c
│               └── METS.xml [the standard Transfer METS file described above]

AIP METS file structure

The AIP METS file records information a bout the contents of the AIP, and indicates the relationships between the various files in the AIP. A sample AIP METS file would be structured as follows:

METS header
-Date METS file was created
METS dmdSec [descriptive metadata section]
-DDI XML metadata taken from the METS transfer file, as follows
--ddi:title
--ddi:IDno
--ddi:authEnty
--ddi:distrbtr
--ddi:version
--ddi:restrctn
METS dmdSec [descriptive metadata section]
-link to dataset.json
METS dmdSec [descriptive metadata section]
-link to DDI.XML file created for derivative file as part of bundle
METS amdSec [administrative metadata section, one for each original, derivative and normalized file in the AIP]
-techMD [technical metadata]
--PREMIS technical metadata about a digital object, including file format information and extracted metadata
-digiprovMD [digital provenance metadata]
--PREMIS event: derivation (for derived formats)
-digiprovMD [digital provenance metadata]
--PREMIS event:ingestion
-digiprovMD [digital provenance metadata]
--PREMIS event: unpacking (for bundled files)
-digiprovMD [digital provenance metadata]
--PREMIS event: message digest calculation
-digiprovMD [digital provenance metadata]
--PREMIS event: virus check
-digiprovMD [digital provenance metadata]
--PREMIS event: format identification
-digiprovMD [digital provenance metadata]
--PREMIS event: fixity check (if file comes from Dataverse with a checksum)
-digiprovMD [digital provenance metadata]
--PREMIS event: normalization (if file is normalized to a preservation format during Archivematica processing)
-digiprovMD [digital provenance metadata]
--PREMIS event: creation (if file is a normalized preservation master generated during Archivematica processing)
-digiprovMD
--PREMIS agent: organization
-digiprovMD
--PREMIS agent: software
-digiprovMD
--PREMIS agent: Archivematica user
METS fileSec [file section]
-fileGrp USE="original" [file group]
--original files uploaded to Dataverse
-fileGrp USE="derivative"
--derivative tabular files generated by Dataverse
-fileGrp USE="submissionDocumentation"
--METS.XML (standard Archivematica transfer METS file listing contents of transfer)
-fileGrp USE="preservation"
--normalized preservation masters generated during Archivematica processing
-fileGrp USE="metadata"
--dataset.json
--DDI.XML
--xcitation-endnote.xml
--xcitation-ris.ris
METS structMap [structural map]
-directory structure of the contents of the AIP

Future Requirements & Considerations[edit]

This section includes working notes for future phases, as interesting opportunities or questions arise. At the end of the current phase we will be documenting the integration as well as future opportunities.

Improvements to Current Functionality[edit]

  • That Dataverse-created zips for derivatives are extracted and the created packages deleted, but the directories maintained issue 79, number 11. This would preserve something closer to the original file arrangement users see in Dataverse in the final AIP.
  • As above, that user-submitted zips are extracted and/or retained according to the selected processing configuration in Archivematica, and that these are maintained as directories.
  • Warnings to users that would prevent transfer failures, such as when a user selects a metadata-only dataset, a restricted transfer, or transfer that does not verify compliance due to user not selecting ‘dataverse’ transfer type issue 79, number 9 and issue 57.
  • Download performance improvement issue 61
  • Addition of three additional descriptive metadata fields, the first two of which are required by Dataverse: Description ("abstract" in DDI / "dsDescriptionValue" in json); Subject ("subject" in DDI / " 'typeName': 'subject' " in json); Publication date ("distDate" in DDI / "dateOfDeposit" in json)

New Features[edit]

  • A separate micro-service for verifying a Dataverse transfer before the transfer begins would make it easier to identify issues and ensure compliance for Dataverse transfer types.
  • Use of AICs or other method to relate datasets that are part of a Dataverse collection.
  • Automatic transfer title naming from DOI or other useful name.
  • Ability to select and transfer past versions of Datasets.
  • Enhancements to the transfer browser pane (search capability, showing versions, etc). issue 363.
  • Enabling AIP reingest issue 107. AIP reingest functions are used for changing an aspect of already stored AIPs, such as the opportunity to re-normalize files if policies change, or files in an AIP are found to be at risk of obsolescence. AIP reingest would not apply to changes made externally to the source dataset in Dataverse. Any changes to the source dataset, such as the addition of files or metadata, would require the creation of a new AIP.
  • Improving rights information (i.e. mapping to PREMIS rights)

Notes from Feature File review meeting on May 1 2018 (2pm EST)[edit]

Choice & Versioning of Dataverse API: The dataverse Search and Access APIs are not currently versioned. The Native API is versioned: http://guides.dataverse.org/en/latest/api/native-api.html There is an OAI-PMH interface (although it is not mentioned in the dataverse API guide). Amber said there were idiosyncrasies in the way dataverse implemented PMH, and wasn’t sure it would be a ‘safe’ option. Amaz would like to see that we are either using a standard API (like OAI-PMH) or a versioned API. Amaz thought wondered whether we could use PMH with the polling part of the solution; but given what Amber said, it doesn’t seem like a good way to go) So as part of the project we need to see whether we could use the Native API (even if we don’t actually use it), or we need to raise it as an issue to discuss with the dataverse team.

Relationships between Datasets Amber pointed out that they are not currently clear exactly what datasets should be preserved, and expects this will vary quite a bit by institution. We discussed the question of whether all datasets in a dataverse would be preserved (not currently known), which brought up the question of how to relate datasets. We talked about AICs as one possible solution. But agreed that it’s a new feature and needs to be thought through… there could be other solutions than AIC.

Improving agent info in event history in METS We pointed out that having an agent other than Archivematica in the METS is a new feature Discussed the fact that we could make this even more specific by adding more agents. For instance, differentiating between the researcher who uploaded files from the research data manager who published the dataset.

Notes from Dataverse Testing:

Should a preserved dataset include an equivalent of fixity check on any UNFs created by Dataverse? https://dataverse.scholarsportal.info/guides/en/4.8.6/developers/unf/index.html#unf Universal Numerical Fingerprint (UNF) is a unique signature of the semantic content of a digital object. It is not simply a checksum of a binary data file. Instead, the UNF algorithm approximates and normalizes the data stored within. A cryptographic hash of that normalized (or canonicalized) representation is then computed.

See also[edit]