Difference between revisions of "DSpace exports"

From Archivematica
Jump to navigation Jump to search
 
(26 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > DSpace exports
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > DSpace exports
  
This page analyzes the structure of DSpace exports from an uncustomized (i.e. out of the box) DSpace installation.
+
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information. </div> <p>
  
= Collection export =
+
=Analysis=
Used the following command (from DSpace [http://www.dspace.org/1_7_1Documentation/AIP%20Backup%20and%20Restore.html#AIPBackupandRestore-ExportingAIPHierarchy user documentation]) to export a two-item collection with the handle 123456789-6:
+
 
 +
This page analyzes the structure of a DSpace collection export from an uncustomized (i.e. out of the box) DSpace installation. See also [[Transfer and SIP creation#Workflow.28DSpace export.29| draft workflow]] for transferring and ingesting DSpace exports.
 +
 
 +
Used the following command (from DSpace [https://wiki.duraspace.org/display/DSDOC/AIP+Backup+and+Restore#AIPBackupandRestore-ExportingAIPs user documentation]) to export a two-item collection with the handle 123456789-6:
  
 
<pre>./dspace packager -d -a -t AIP -e <user name> -i 123456789-6 calamy.zip</pre>
 
<pre>./dspace packager -d -a -t AIP -e <user name> -i 123456789-6 calamy.zip</pre>
Line 13: Line 16:
 
*ITEM@123456789-8.zip
 
*ITEM@123456789-8.zip
  
The extracted contents of each zipped file are shown in this screenshot. Note that the bitstream in the collection-level directory (calamy) is a logo added to the collection description in DSpace. The text file bitstreams in the other directories are licenses.
+
The extracted contents of each zipped file are shown in this screenshot:
  
 
[[File:export.png|680px|thumb|center|]]
 
[[File:export.png|680px|thumb|center|]]
  
== Collection-level mets.xml file ==
+
==Item-level METS files==
 +
 
 +
=== Link to object ===
 +
*The mets.xml file is linked to the object by the handle of the original zipped file:
 +
 
 +
[[File:metsID.png|680px|thumb|center|]]
 +
 
 +
=== Licenses ===
 +
 
 +
The text file bitstreams in the two item-level directories are licenses. Note that they are not identified by filename as license files - Archivematica will need to recognize license files from each object's METS file (i.e. from fileSec). Here is an example of the fileSec showing the object to be preserved (bitstream_12.png) and its license file (bitstream_13):
 +
 
 +
[[File:fileSec.png|680px|thumb|center|]]
 +
 
 +
Archivematica should move the license file to the metadata/submissionDocumentation directory; the text can be parsed to the rights entity in the PREMIS metadata. See [[PREMIS metadata: rights#License-based]].
 +
 
 +
=== OCR text ===
 +
 
 +
If the AIP contains a scanned PDF file there will also be an accompanying ocr text file with a filename like ''bitstream_39476.txt''. This text is identified in fileSec of the METS file as USE=TEXT:
 +
 
 +
[[File:metsOCR.png|680px|thumb|center|]]
 +
</br>
 +
The OCR text file should remain in the objects directory of the AIP.
 +
 
 +
=== RightsMD ===
 +
 
 +
Each object also has an amdSec containing rightsMD data (populated automatically according to DSpace configuration settings):
 +
 
 +
[[File:rights.png|680px|thumb|center|]]
 +
 
 +
This metadata can be added to the PREMIS rights entity in the rightsExtension field. See See [[PREMIS metadata: rights#From_DSpace_METS]].
 +
 
 +
=== Descriptive metadata ===
  
The mets.xml file for the collection is structured as follows:
+
*Each object has two dmdSecs: MODS and [https://wiki.duraspace.org/display/DSPACE/DSpaceIntermediateMetadata DSpace Intermediate Metadata (DIM)].
*<mets ID="DSpace_COLLECTION_123456789-6" OBJID="hdl:123456789/6" TYPE="DSpace COLLECTION" PROFILE="http://www.dspace.org/schema/aip/mets_aip_1_0.xsd" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
+
**The DIM metadata is not intended for use outside of DSpace: according to the DSpace website, "[DIM] is used by XsltCrosswalk. It is called the Intermediate format because it is intended solely as an intermediate stage in XML-translation-based crosswalks. To reiterate, This is an INTERMEDIATE format, it is NOT for exporting or harvesting metadata!" However, in uncustomized DSpace all the metadata in the DIM fields are mapped to DC, so there may be no harm in referencing the DIM metadata in the Archivematica METS file.
*<metsHdr>
 
*<dmdSec> (contains MODS metadata for collection-level description)
 
*<dmdSec> (contains [https://wiki.duraspace.org/display/DSPACE/DSpaceIntermediateMetadata DSpace Intermediate Metadata (DIM)] for collection-level description; all mapped to dc; some overlap with MODS metadata)
 
*<amdSec> (contains information on DSpace users and groups associated with the collection)
 
*<fileSec> (references the collection's logo, if there is one)
 
*<structMap> (links the collection to its logo, if there is one, plus its two child items)
 
*<structMap> (links the collection to the DSpace Community)
 
  
== Item-level mets.xml file ==
+
*We should add dmdSecs to the Archivematica METS file to link each object to its descriptive metadata in the DSpace METS files (i.e. using mdRef).
*<metsHdr>
 
*<dmdSec_1> (contains MODS metadata for item)
 
*<dmdSec_2> (contains DIM metadata for item; all mapped to dc; some overlap with MODS metadata)
 
*<amdSec> (contains rights metadata)
 
*<amdSec> (contains rights metadata)
 
*<amdSec> (contains PREMIS object metadata; rights metadata; DIM metadata for the item)
 
*<amdSec> (contains rights metadata)
 
*<amdSec> (contains PREMIS object metadata; rights metadata; DIM metadata for the licence)
 
*<fileSec> (links the item to the license)
 
*<structMap> (links the bitstream to the logical object)
 
*<structMap> (links the item to the collection)
 
  
= Parsing a DSpace collection export in Archivematica =
+
=== Checksums ===
Requirements:
+
Each object and license has an MD5 checksum recorded in the fileSec.
*Map the elements of the DSpace AIPs to the Archivematica AIP
 
*Structure the Archivematica mets.xml file to point to the DSpace mets.xml files
 
*Index the metadata in all the xml files
 
  
== Map the elements of the DSpace AIPs to the Archivematica AIP ==
+
[[File:fileSec.png|680px|thumb|center|]]
*The digital objects get placed in the objects directory
 
*The license.txt files get placed in the metadata/submissiondocumentation directory; the text is parsed to the <rights> container in the PREMIS metadata. See [[PREMIS metadata: rights#License-based]]
 
*The mets.xml files get placed in the metadata/submissionDocumentation directory...hmm, why not put them in the metadata directory?
 
  
== Structure the Archivematica mets.xml file ==
+
Archivematica should verify these checksums after transfer.
{| border="1" cellpadding="10" cellspacing="0"
 
|-
 
|- style="background-color:#cccccc;"
 
!style="width:25%"|'''METS file section'''
 
!style="width:75%"|'''Description/notes'''
 
|-
 
|<dmdSec>
 
|DC metadata added during transfer/ingest; SIP-level only
 
|-
 
|<amdSec>
 
|PREMIS metadata
 
|-
 
|<fileSec>
 
|Lists all the files in the objects directory of the AIP
 
|-
 
|<structMap>
 
|Groups the contents in the objects directory of the AIP to reflect the folder structure of the AIP
 
|}
 
  
Question: how do we link the object to the DSpace METS file? Give the METS file a UUID and make the link in the PREMIS relationships field?
+
== Collection-level mets files ==
 +
The collection-level mets file contains MODS and DIM metadata for the collection; the descriptive metadata should be linked to the Archivematica mets file in the dmdSec using mdRef.
  
 
[[Category:Development documentation]]
 
[[Category:Development documentation]]

Latest revision as of 16:44, 11 February 2020

Main Page > Development > Development documentation > DSpace exports

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Analysis[edit]

This page analyzes the structure of a DSpace collection export from an uncustomized (i.e. out of the box) DSpace installation. See also draft workflow for transferring and ingesting DSpace exports.

Used the following command (from DSpace user documentation) to export a two-item collection with the handle 123456789-6:

./dspace packager -d -a -t AIP -e <user name> -i 123456789-6 calamy.zip

This results in the export of three zipped packages: one for the collection and one for each of the items:

  • calamy.zip
  • ITEM@123456789-7.zip
  • ITEM@123456789-8.zip

The extracted contents of each zipped file are shown in this screenshot:

Export.png

Item-level METS files[edit]

Link to object[edit]

  • The mets.xml file is linked to the object by the handle of the original zipped file:
MetsID.png

Licenses[edit]

The text file bitstreams in the two item-level directories are licenses. Note that they are not identified by filename as license files - Archivematica will need to recognize license files from each object's METS file (i.e. from fileSec). Here is an example of the fileSec showing the object to be preserved (bitstream_12.png) and its license file (bitstream_13):

FileSec.png

Archivematica should move the license file to the metadata/submissionDocumentation directory; the text can be parsed to the rights entity in the PREMIS metadata. See PREMIS metadata: rights#License-based.

OCR text[edit]

If the AIP contains a scanned PDF file there will also be an accompanying ocr text file with a filename like bitstream_39476.txt. This text is identified in fileSec of the METS file as USE=TEXT:

MetsOCR.png


The OCR text file should remain in the objects directory of the AIP.

RightsMD[edit]

Each object also has an amdSec containing rightsMD data (populated automatically according to DSpace configuration settings):

Rights.png

This metadata can be added to the PREMIS rights entity in the rightsExtension field. See See PREMIS metadata: rights#From_DSpace_METS.

Descriptive metadata[edit]

  • Each object has two dmdSecs: MODS and DSpace Intermediate Metadata (DIM).
    • The DIM metadata is not intended for use outside of DSpace: according to the DSpace website, "[DIM] is used by XsltCrosswalk. It is called the Intermediate format because it is intended solely as an intermediate stage in XML-translation-based crosswalks. To reiterate, This is an INTERMEDIATE format, it is NOT for exporting or harvesting metadata!" However, in uncustomized DSpace all the metadata in the DIM fields are mapped to DC, so there may be no harm in referencing the DIM metadata in the Archivematica METS file.
  • We should add dmdSecs to the Archivematica METS file to link each object to its descriptive metadata in the DSpace METS files (i.e. using mdRef).

Checksums[edit]

Each object and license has an MD5 checksum recorded in the fileSec.

FileSec.png

Archivematica should verify these checksums after transfer.

Collection-level mets files[edit]

The collection-level mets file contains MODS and DIM metadata for the collection; the descriptive metadata should be linked to the Archivematica mets file in the dmdSec using mdRef.