Transcoder

From Archivematica
Jump to navigation Jump to search

Transcode: convert (language or information) from one form of coded representation to another.[ source: Oxford English Dictionary ]

Overview

The transcoder is developed by artefactual, for the purpose of normalization and generating access copies in the archivematica system. In earlier versions it was called normalizer. It will try to identify the file type by the file extension, or other metadata, and look for matching configured actions for those identified. It will then perform those actions, and exit with a zero status if it believes those actions have been completed successfully.

Development

Presently to manage the complexity of automating the link between file identification and actions, a database based implementation of the transcoder is being built to replace the current xml one.

Configuration

Configuration files are located in the /etc/transcoder/ directory.

transcoderConfig.conf

transcoderConfig.conf is the primary transcoder configuration file. It is a bash script which defines the variables used in the various file format policy XML files; it primarily contains paths to conversion tools and standard file names.

Variables are stored as standard bash shell script variables. Variables can be added or edited using any text editor; any new variables added become available for use in format policy XML files. They use the format:

variableName="variable contents"

Default variables:

Variable Description Default value
formatPoliciesPath Directory containing format policy XML files /etc/transcoder/archivematicaFormatPolicies/
transcoderScriptsDir Directory containing transcoder normalization scripts /usr/lib/transcoder/transcoderScripts/
convertPath Path to ImageMagick for image conversion. Requires a space at the end. /usr/bin/convert
ffmpegPath Path to ffmpeg for audio and video. Requires a space at the end. /usr/bin/ffmpeg
theoraPath Path to ffmpeg2theora script to create Ogg Theora and Vorbis files. Currently unused. Requires a space at the end. /usr/bin/ffmpeg2theora
unoconvPath Path to unoconv binary for converting document files. Currently unused. Requires a space at the end. /usr/bin/unoconv
unoconvAlternatePath Path to unoconv launcher script for converting document files. Requires a space at the end. /usr/lib/transcoder/transcoderScripts/unoconvAlternative.sh
DublinCore File name for Dublin Core metadata dublincore.xml
MD5FileName File name containing SIP MD5 checksum MD5checksum.txt
fileUUIDHumanReadable Log file containing unique IDs for items within a SIP FileUUIDs.log

archivematicaFormatPolicies

The /etc/transcoder/archivematicaFormatPolicies directory contains XML files which control how Archivematica performs normalization. Transcoder reads the file extension of a file and selects the matching XML file to determine how to perform normalization. Note that, because normalization is based on file extension, objects with an incorrect file extension or no extension will usually fail to normalize - this is scheduled to change in Archivematica 0.7.2.

The following sample configuration file illustrates the syntax:

<source lang="xml">
<formatPolicy>
  <inherit></inherit>
  <accessFormat>MP3</accessFormat>
  <preservationFormat>WAV</preservationFormat>
  <accessConversionCommand>%ffmpegPath% -i %fileFullName% -ab 192000 %accessFileDirectory%%fileTitle%.%accessFormat%</accessConversionCommand>
  <preservationConversionCommand>%ffmpegPath% -i %fileFullName% %preservationFileDirectory%%fileTitle%.%preservationFormat%</preservationConversionCommand>
  <preservationConversionCommand>%xenaPath%</preservationConversionCommand>
</formatPolicy>
</source>

Every format policy XML encloses its contents in the <formatPolicy></formatPolicy> tag. All content intended to be read by transcoder for normalization should be between the opening and close tags.

  • <inherit> should should be used when a format shares the same normalization processes as another format. For instance, all audio files (MP3, WMA, etc.) share a common access format and preservation format and are converted using the same command. For instance, the XML files for MP3 and WMA contain only an <inherit> tag pointing to the "parent" AUDIO document that contains the commands for all audio types. If <inherit> is used, all other tags should be empty.
  • <accessFormat> defines the access format defined for the specified object type. It should be expressed using the file format for that type, in capital letters. For instance, MP3 is the accessFormat for audio files.
  • <preservationFormat> defines the preservation format for the specified object type. It uses the same syntax as <accessFormat>
  • <accessConversionCommand> contains the commandline options for launching the tool which creates an access copy from the specified object type. This can be customized or replaced. Any variables in use are defined by normalizationConfig.conf and should be customized there.
  • <preservationConversionCommand> contains the commandline options for launching the tool which creates a preservation copy from the specified object type. This can be customized or replaced. Any variables in use are defined by normalizationConfig.conf and should be customized there.

Note that all tags should be present, regardless if they are being used. For instance, if the format in question does not inherit another type, the inherit opening and closing tags should still be present without enclosing any content, e.g.: <inherit></inherit>

TODO: learn why there can be two <preservationConversionCommand>s and update documentation