Metadata elements

From Archivematica
Revision as of 17:08, 3 September 2010 by Evelyn McLellan (talk | contribs)
Jump to navigation Jump to search

Main Page > Development > Development documentation > Metadata elements

This page identifies a minimum set of metadata elements designed to ensure authenticity and interoperability of preserved objects and to facilitate their retrieval.

This process involves:

  1. Using the InterPARES Chain of Preservation (COP) model and the CoP/PREMIS crosswalk to identify required elements for objects preserved by Archivematica
  2. Analyzing existing metadata in the Archivematica AIP log files and METS.xml file in order to map them to METS and PREMIS elements
  3. Comparing 1) to 2) in order to determine what gaps exist in Archivematica
  4. Filling in the gaps - eg by modifying workflow to produce and/or capture missing elements
  5. Structuring the required elements into the Repository eXchange Package (RXP) specification
  6. Determining what metadata belongs in the DIP(s)

Map of Archivematica 0.6 metadata to PREMIS elements

Source: /data/logs/MD5checksum.txt
Process: Produced when quarantine period expires. Provides checksums for each object in the SIP. Note that if zipped files are present, a checksum is generated for the zipped file and not for each object within it.
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
Checksum Object 1.5.2 Fixity 1.5.2.2 messageDigest 326e0206ae83f815e4be5f28464f6ac6
Source: /data/logs/filenameCleanup.log
Process: Produced when quarantine period expires, prior to unpacking of any zipped files. If prohibited characters were present in filenames, provides crosswalk between original and "cleaned up" filenames.
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
Original filename Object 1.6 originalName none Syllabus final.doc
Cleaned-up filename Event 2.5.2 eventOutcomeDetail 2.5.2.1 eventOutcomeDetailNote Syllabus_final.doc
Source: /data/logs/virusScan.log
Process: Produced when ingested files are scanned for viruses and malware
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
Scan result Event 2.5 eventOutcomeInformation 2.5.1 eventOutcome OK
Source: /data/logs/fileUUIDs.log
Process: Produced after prohibited characters are removed from filenames and any zipped files have been unpacked. Provides a crosswalk between cleaned-up filenames and UUIDs.
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
Universal unique identifier (UUID) Object 1.1 objectIdentifier 1.1.2 objectIdentifierValue 270bd067-0483-4c5f-bdec-f2cbd6e651aa
Source: /data/logs/FITS-[UUID]-[SIP].xml (FITS output reports)
Process: Produced when FITS tool identifies and validates formats and extracts technical metadata
FITS element PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
format Object 1.5.4.1 formatDesignation 1.5.4.1.1 formatName
  • Tagged Image File Format
  • Waveform Audio
  • Microsoft Powerpoint Presentation
version Object 1.5.4.1 formatDesignation 1.5.4.1.2 formatVersion 6.0
externalIdentifier Object 1.5.4.2 formatRegistry 1.5.4.2.2 formatRegistryKey fmt/10
Size Object 1.5 objectCharacteristics 1.5.3 size 125968
ImageWidth (image files and video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 2464
ImageHeight (image files and video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 3248
SamplesPerPixel (image files and video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 3
XResolution (image files and video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 300
YResolution (image and video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 300
duration (audio files and video files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 0:2:26:16
bitDepth/bitsPerSample (image files, audio files, video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 16
sampleRate (audio files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 48000.0
channels (audio files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 2
aes:channelAssignment (audio files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue
  • channelNum="0" mapLocation="LEFT"
  • channelNum="1" mapLocation="RIGHT"
VideoFrameRate (video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue
  • 30.0
  • 29.97 fps
AspectRatio (video streams) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 1:1
AudioFormat (audio streams in video files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue raw
AudioChannels (audio streams in video files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 2
AudioBitsPerSample (audio streams in video files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 8
AudioSampleRate (audio streams in video files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 44100
PageCount (text files, office documents, pdf files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 16
WordCount (text files, office documents) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 876
Paragraphs (text files, office documents) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 19
Slides (presentation files) Object 1.4 significantProperties 1.4.2 significantPropertiesValue 27
Source: /data/logs/normalization.log
Process: Produced during normalization to preservation and access formats
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
Name of normalization tool Agent 3.2 agentName none FFmpeg version SVN-r19352-4:0.5+svn20090706-2ubuntu2.2
Event description Event 2.2 eventType none Normalizing
Processing status Event 2.5 eventOutcomeInformation 2.5.1 eventOutcome Processing completed
Normalization result Event 2.5.2 eventOutcomeDetail 2.5.2.1 eventOutcomeDetailNote
  • Already in preservation format. No need to normalize.
  • No default normalization tool defined.
  • Output #0, wav, to '/tmp/MultimediaSIP-9ece5881-640e-4bdc-9863-4ff50046a0bd/objects/sample.wav': Stream #0.0: Audio: pcm_s16le, 8000 Hz, stereo, s16, 256 kb/s
Source: /data/logs/MD5checksum.txtprepareAIP_check.log
Process: Produced after file normalization process. Checks that checksums for files in the SIP have not changed during normalization.
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
Pass/fail notification Event 2.5 eventOutcomeInformation 2.5.1 eventOutcome
  • PASSED
  • FAILED
Source: /data/logs/AIP.MD5checksum.txt
Process: Produced during BagIt process. Provides checksums for the AIP and for each original and normalized file in the AIP.
Description PREMIS entity PREMIS semantic unit PREMIS semantic component Sample value(s)
AIP checksum Object 1.5.2 Fixity 1.5.2.2 messageDigest 12b86e038bf0bddd5aba110c35f288b8
File checksum Object 1.5.2 Fixity 1.5.2.2 messageDigest 326e0206ae83f815e4be5f28464f6ac6


Events requiring metadata

Receive SIP (SIP gets placed in 1-receiveSIP)

Metadata for the SIP as a whole

Semantic component Sample value(s) Automated? Notes
2.1.1 eventIdentifierType Y
2.1.2 eventIdentifierValue Y
3.1.1 agentIdentifierType user account Y
3.1.2 agentIdentifierValue demo Y
3.1.1 agentIdentifierType workstation id Y
3.1.2 agentIdentifierValue archivematica-1 Y



Check checksums

Metadata for each file in the SIP

Semantic component Sample value(s) Automated? Notes
2.1.1 eventIdentifierType Y
2.1.2 eventIdentifierValue Y
2.2 eventType Y
2.3 eventDateTime Y
3.1.1 agentIdentifierType software Y
3.1.2 agentIdentifierValue MD5sum Y
2.5.1 eventOutcome Pass; fail Y
2.5.2 eventOutcomeDetail j6059_02.wav FAILED Y


Generate checksums

Metadata for each file in the SIP for which a checksum is generated by Archivematica

Semantic component Sample value(s) Automated? Notes
2.1.1 eventIdentifierType Y
2.1.2 eventIdentifierValue Y
2.2 eventType Y
2.3 eventDateTime Y
3.1.1 agentIdentifierType software Y
3.1.2 agentIdentifierValue MD5sum Y
1.5.2.1 messageDigestAlgorithm MD5 Y
1.5.2.2 messageDigest fa10ee76a575bafe43335abf6cd60bae Y
1.5.2.3 messageDigestOriginator City of Vancouver Y



Review SIP

Semantic component Sample value(s) Automated? Notes
2.1.1 eventIdentifierType Y
2.1.2 eventIdentifierValue Y
2.2 eventType Y
2.3 eventDateTime Y
3.1.1 agentIdentifierType user account Y
3.1.2 agentIdentifierValue demo Y
2.5.1 eventOutcome {pass; conditional pass} {pass; conditional pass} N If it fails, it doesn't move on to become an AIP, so failure is not an option
2.5.2 eventOutcomeDetail Some files missing; appraisal required Some files missing; appraisal required N This field is mandatory if eventOutcome = conditional pass


Quarantine SIP

-when it went in and when it came out

Unpack zipped files

-tool used, time unpacked, event outcome (successful?), map of zipped file to unzipped contents (map for each unzipped file + link to event)

Assign UUIDs

-the usual stuff, map from original name to UUID

Remove prohibited characters=

-the usual stuff, map from original name to sanitized name

Virus scan

-the usual stuff, result for each file (include eventOutcomeDetail to describe type of fail such as the type of malware found)

File characterization

-identification: format name, format version, registry name, registry key -validation: well formed? Valid?

Appraise SIP

-usual event stuff -event outcome (no files removed; some files removed) -list of files removed

Normalization to preservation formats

-everything already in the table plus identification information: format name, format version, registry name, registry key

Normalization to access formats


Mandatory PREMIS elements (mandatory semantic units + mandatory components)

Entity Semantic unit Semantic component Present in Archivematica?
Object 1.1 objectIdentifier 1.1.1 objectIdentifierType No
Object 1.1 objectIdentifier 1.1.2 objectIdentifierValue Yes
Object 1.2 objectCategory none No
Object 1.5 objectCharacteristics 1.5.1 Composition level No
Object 1.5.4 objectCharacteristics/format Either 1.5.4.1 formatDesignation or 1.5.4.2 formatRegistry must be used
  • 1.5.4.1.1 formatName Yes
  • 1.5.4.2.1 formatRegistryName No
  • 1.5.4.2.2 formatRegistryKey Yes
Object 1.7 Storage Either 1.7.1 contentLocation or 1.7.2 storageMedium must be used. However, "if the preservation repository uses the objectIdentifier as a handle for retrieving data, contentLocation is implicit and does not need to be recorded." No, but retrieval may be managed through UUIDs.
Event 2.1 eventIdentifer 2.1.1 eventIdentifierType No
Event 2.1 eventIdentifer 2.1.2 eventIdentifierValue No
Event 2.2 eventType none Partial
Event 2.3 eventDateTime none Partial
Agent 3.1 agentIdentifier 3.1.1 agentIdentifierType No
Agent 3.1 agentIdentifier 3.1.2 agentIdentifierValue No


PREMIS elements relating to derived objects

Since AIPs are constructed from both original and normalized files, we need to determine what PREMIS elements should be used to describe the normalized files and their relationship to the originals.

Original file metadata

Entity Semantic unit Semantic component Example
Object 1.10 relationship 1.10.1 relationship type derivation
Object 1.10 relationship 1.10.2 relationshipSubType is source of
Object 1.10.3 relatedObjectIdentification 1.10.3.1 relatedObjectIdentifierType UUID
Object 1.10.3 relatedObjectIdentification 1.10.3.2 relatedObjectIdentifierValue (UUID of the normalized file)
Event 2.1 eventIdentifer 2.1.1 eventIdentifierType
Event 2.1 eventIdentifer 2.1.2 eventIdentifierValue
Event 2.2 eventType none Normalization
Event 2.3 eventDateTime none 2010:05:19 00:49:15+00:00
Event 2.5 eventOutcomeInformation 2.5.1 eventOutcome Processing completed
Event 2.5.2 eventOutcomeDetail 2.5.2.1 eventOutcomeDetailNote Output #0, wav, to '/tmp/MultimediaSIP-9ece5881-640e-4bdc-9863-4ff50046a0bd/objects/sample.wav': Stream #0.0: Audio: pcm_s16le, 8000 Hz, stereo, s16, 256 kb/s
Agent 3.1 agentIdentifier 3.1.1 agentIdentifierType
Agent 3.1 agentIdentifier 3.1.2 agentIdentifierValue
Agent 3.2 agentName none FFmpeg version SVN-r19352-4:0.5+svn20090706-2ubuntu2.2


Normalized file metadata

Entity Semantic unit Semantic component Example
Object 1.1 objectIdentifier 1.1.1 objectIdentifierType UUID
Object 1.1 objectIdentifier 1.1.2 objectIdentifierValue 270bd067-0483-4c5f-bdec-f2cbd6e651aa
Object 1.10 relationship 1.10.1 relationship type derivation
Object 1.10 relationship 1.10.2 relationshipSubType has source
Object 1.10.3 relatedObjectIdentification 1.10.3.1 relatedObjectIdentifierType UUID
Object 1.10.3 relatedObjectIdentification 1.10.3.2 relatedObjectIdentifierValue (UUID of the original file)
Object 1.10.4 relatedEventIdentification 1.10.4.1 relatedEventIdentifierType
Object 1.10.4 relatedEventIdentification 1.10.4.2 relatedEventIdentifierValue
Object 1.5.2 fixity 1.5.2.1 messageDigestAlgorithm MD5
Object 1.5.2 fixity 1.5.2.2 messageDigest 537e0206ae83f815e4fg5f28464f6rt7