Difference between revisions of "Transcoder"

From Archivematica
Jump to navigation Jump to search
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information.</div><p>
 +
 
Transcode: convert (language or information) from one form of coded representation to another.[ source: [http://oxforddictionaries.com/view/entry/m_en_gb0876960#m_en_gb0876960 Oxford English Dictionary] ]
 
Transcode: convert (language or information) from one form of coded representation to another.[ source: [http://oxforddictionaries.com/view/entry/m_en_gb0876960#m_en_gb0876960 Oxford English Dictionary] ]
  
Line 4: Line 6:
 
The transcoder is developed by artefactual, for the purpose of normalization and generating access copies in the archivematica system. In earlier versions it was called normalizer. It will try to identify the file type by the file extension, or other metadata, and look for matching configured actions for those identified. It will then perform those actions, and exit with a zero status if it believes those actions have been completed successfully.
 
The transcoder is developed by artefactual, for the purpose of normalization and generating access copies in the archivematica system. In earlier versions it was called normalizer. It will try to identify the file type by the file extension, or other metadata, and look for matching configured actions for those identified. It will then perform those actions, and exit with a zero status if it believes those actions have been completed successfully.
  
=Development=
+
=Transcoder Database=
Presently to manage the complexity of automating the link between file identification and actions, a database based implementation of the transcoder is being built to replace the current xml one.
+
In Archivematica release 0.7.1 alpha, the normalalization rules have been moved to a database, and can be seen under the preservation planning tab on the dashboard. In future releases, we plan to support modification of these rules through the dashboard interface.
  
==Database==
+
==Database Schema==
===Database Schema===
+
[[File:Transcoder_database_schema.png]]
[[File:transcoder_db_20110419.png]]
 
  
 
=Configuration=
 
=Configuration=
 
Configuration files are located in the /etc/transcoder/ directory.
 
Configuration files are located in the /etc/transcoder/ directory.
  
==transcoderConfig.conf==
+
The transcoder database credentials and server can be set in the dbsettings.conf file.
 +
 
 +
=Development=
 +
In the 0.9 release the transcoder was integrated with the [[MCP]].
 +
 
 +
During transfer processing, the fileIDs are identified by microservices. They are stored against the file in the FilesIDentifiedIDs table.
 +
 
 +
For normalization processing, the MCP will process down a chain for each file. The job for normalization of a file will check for command relationships with the identified file id's and the proper command classification (normalize preservation, normalize access). For every unique command found in that relaitonship, the MCP will create a task to be executed by the client. If no commands are identified the MCP will create a task with the default command, from the DefaultCommandsForClassifcations table, if one is defined.
 +
 
 +
 
 +
Integration with the MCP was done by relating commands to Microservice chain links. The Transcoder links (MicroserviceChainLinks of this type) have a one to one relationship with the tasksConfigs, which have a one to one relationship with the CommandRelationships. The protocol between the client and server is based on the command Relationships's pk. The MCP assigns a task to the client to perform x commandRelationship on y file (identified by fileUUID). The client can pull the information required to execute the command from the database.
 +
 
 +
Why the change? To support all clients not having to support all normalization tools, the tasks needed to be assigned by tool availibility. Currently the archivematica-client package depends on all the tools required, but there are situations where this will be required. While these are not currently implemented an example would be normalizing on a windows machine, using microsoft office. The windows machine could theoretically run a client, but it wouldn't be able to support the standard archivematica tools, as they are linux based. To differentiate the two, use the supportedBy field in the Commands table.
 +
 
 +
==Example==
 +
Normalization commands are created as part of the archivematica install. They are kept in the database, and populated upon install by the [https://github.com/artefactual/archivematica/blob/master/src/transcoder/share/mysql /usr/share/archivematica/transcoder/mysql] sql script.
 +
===Create Commands===
 +
First, create the command(s) that will need to run. These commands can even be complete scripts. The command Type will need to be defined. A list of supported command types is in the CommandTypes table. You may also wish to create a special command for getting the event detail text for the event.
 +
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 +
See code:
 +
<div class="mw-collapsible-content">
 +
<pre>
 +
-- Commands for handling Video files --
 +
INSERT INTO Commands
 +
    (commandType, command, description)
 +
    -- VALUES SELECT pk FROM FileIDS WHERE description = 'Normalize Defaults'
 +
    VALUES (
 +
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
 +
    ('echo program=\\"ffmpeg\\"\\; version=\\"`ffmpeg 2>&1 | grep \"FFmpeg version\"`\\"'),
 +
    ('Get event detail text for ffmpeg extraction')
 +
);
 +
 
 +
set @ffmpegEventDetailCommandID = LAST_INSERT_ID();
 +
 
 +
INSERT INTO Commands
 +
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description)
 +
    VALUES (
 +
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
 +
    ('ffmpeg -i "%fileFullName%" -vcodec libx264 -preset medium -crf 18 "%outputDirectory%%prefix%%fileName%%postfix%.mp4"'),
 +
    '%outputDirectory%%prefix%%fileName%%postfix%.mp4',
 +
    @ffmpegEventDetailCommandID,
 +
    @standardVerificationCommand,
 +
    ('Transcoding to mp4 with ffmpeg')
 +
);
 +
set @ffmpegToMP4CommandID = LAST_INSERT_ID();
 +
 
 +
INSERT INTO Commands
 +
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description)
 +
    VALUES (
 +
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
 +
    ('#!/bin/bash
 +
# This file is part of Archivematica.
 +
#
 +
# Copyright 2010-2012 Artefactual Systems Inc. <http://artefactual.com>
 +
#
 +
# Archivematica is free software: you can redistribute it and/or modify
 +
# it under the terms of the GNU Affero General Public License as published by
 +
# the Free Software Foundation, either version 3 of the License, or
 +
# (at your option) any later version.
 +
#
 +
# Archivematica is distributed in the hope that it will be useful,
 +
# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +
# GNU General Public License for more details.
 +
#
 +
# You should have received a copy of the GNU General Public License
 +
# along with Archivematica.  If not, see <http://www.gnu.org/licenses/>.
 +
 
 +
# @package Archivematica
 +
# @subpackage transcoder
 +
# @author Joseph Perry <joseph@artefactual.com>
 +
# @version svn: $Id$
 +
 
 +
inputFile="%fileFullName%"
 +
outputFile="%outputDirectory%%prefix%%fileName%%postfix%.mkv"
 +
audioCodec="pcm_s16le"
 +
videoCodec="ffv1"
 +
audioStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=audio" -c`
 +
videoStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=video" -c`
 +
 
 +
command="ffmpeg -i \"${inputFile}\" "
 +
if [ ${audioStreamCount} -ge 1 ] ; then
 +
command="${command} -vcodec ${videoCodec} "
 +
fi
  
transcoderConfig.conf is the primary transcoder configuration file. It is a bash script which defines the variables used in the various file format policy XML files; it primarily contains paths to conversion tools and standard file names.
+
if [ ${videoStreamCount} -ge 1 ] ; then
 +
command="${command} -acodec ${audioCodec}"
 +
fi
  
Variables are stored as standard bash shell script variables. Variables can be added or edited using any text editor; any new variables added become available for use in format policy XML files. They use the format:
+
command="${command} ${outputFile}"
  
variableName="variable contents"
+
addAudioStream=" -acodec ${audioCodec}  -newaudio"
 +
addVideoStream=" -vcodec ${videoCodec}  -newvideo"
  
Default variables:
+
#add additional audio channels
 +
for (( c=1; c<${audioStreamCount}; c++ )); do
 +
command="${command} ${addAudioStream}"
 +
#echo $command
 +
done
  
{|
+
for (( c=1; c<${videoStreamCount}; c++ )); do
|'''Variable'''
+
command="${command} ${addVideoStream}"
|'''Description'''
+
#echo $command
|'''Default value'''
+
done
|-
 
|formatPoliciesPath
 
|Directory containing format policy XML files
 
|/etc/transcoder/archivematicaFormatPolicies/
 
|-
 
|transcoderScriptsDir
 
|Directory containing transcoder normalization scripts
 
|/usr/lib/transcoder/transcoderScripts/
 
|-
 
|convertPath
 
|Path to ImageMagick for image conversion. Requires a space at the end.
 
|/usr/bin/convert
 
|-
 
|ffmpegPath
 
|Path to ffmpeg for audio and video. Requires a space at the end.
 
|/usr/bin/ffmpeg
 
|-
 
|theoraPath
 
|Path to ffmpeg2theora script to create Ogg Theora and Vorbis files. Currently unused. Requires a space at the end.
 
|/usr/bin/ffmpeg2theora
 
|-
 
|unoconvPath
 
|Path to unoconv binary for converting document files. Currently unused. Requires a space at the end.
 
|/usr/bin/unoconv
 
|-
 
|unoconvAlternatePath
 
|Path to unoconv launcher script for converting document files. Requires a space at the end.
 
|/usr/lib/transcoder/transcoderScripts/unoconvAlternative.sh
 
|-
 
|DublinCore
 
|File name for Dublin Core metadata
 
|dublincore.xml
 
|-
 
|MD5FileName
 
|File name containing SIP MD5 checksum
 
|MD5checksum.txt
 
|-
 
|fileUUIDHumanReadable
 
|Log file containing unique IDs for items within a SIP
 
|FileUUIDs.log
 
|}
 
  
==archivematicaFormatPolicies==
+
echo $command
 +
eval $command
 +
'),
 +
    '%outputDirectory%%prefix%%fileName%%postfix%.mkv',
 +
    @ffmpegEventDetailCommandID,
 +
    @standardVerificationCommand,
 +
    ('Transcoding to mkv with ffmpeg')
 +
);
 +
set @ffmpegToMKVCommandID = LAST_INSERT_ID();
  
The /etc/transcoder/archivematicaFormatPolicies directory contains XML files which control how Archivematica performs normalization. Transcoder reads the file extension of a file and selects the matching XML file to determine how to perform normalization. Note that, because normalization is based on file extension, objects with an incorrect file extension or no extension will usually fail to normalize - this is [http://code.google.com/p/archivematica/issues/detail?id=156 scheduled] to change in Archivematica 0.7.2.
+
-- End of Commands for handling Video files --
 +
</pre>
 +
</div>
 +
</div>
 +
===Create FileIds===
 +
Second, create the file type. The FileIDs entry is a cover all, for future releases supporting more than one type of file identification. Every file identificaiton will have a unique corresponding entry in the FileIDs. The validPreservationFormat, and validAccessFormat relate to what appears in the normalization report. These are for identifying files at risk of format obsolescence, with failed or no normalization command. The FileIDsByExtension is the entry that links a '.mpg' file to the fileID. Files are related to their extension fileID's in the 'Identify Files ByExtension' micro-service, creating an entry in the FilesIdentifiedIDs table.
 +
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 +
See code:
 +
<div class="mw-collapsible-content"><pre>
 +
INSERT INTO FileIDs
 +
    (description, validPreservationFormat, validAccessFormat)
 +
    VALUES (
 +
    'A .mpg file', FALSE, FALSE
 +
);
 +
set @fileID = LAST_INSERT_ID();
  
The following sample configuration file illustrates the syntax:
+
INSERT INTO FileIDsByExtension
 +
    (Extension, FileIDs)
 +
    VALUES (
 +
    'mpg',
 +
    @fileID
 +
);
 +
</pre>
 +
</div>
 +
</div>
  
<pre>
+
===Create relationship between command and fileID===
<source lang="xml">
+
Third, create the relationship between the command and the file identification format. The relationship will play a role (preservation, or access) defined in the commandClassication. It's important to note the fileID references the FileIDs table, not the FileIDsByExtension table. The commandClassification was part of some testing of prioritizing normalization commands based on file identifcation types (using more than one file identification method); it's default value is @fileIDByExtensionDefaultGroupMemberID (0), even if left undefined.
<formatPolicy>
+
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
  <inherit></inherit>
+
See code:
  <accessFormat>MP3</accessFormat>
+
<div class="mw-collapsible-content"><pre>
  <preservationFormat>WAV</preservationFormat>
+
INSERT INTO CommandRelationships
  <accessConversionCommand>%ffmpegPath% -i %fileFullName% -ab 192000 %accessFileDirectory%%fileTitle%.%accessFormat%</accessConversionCommand>
+
    (GroupMember, commandClassification, command, fileID)
  <preservationConversionCommand>%ffmpegPath% -i %fileFullName% %preservationFileDirectory%%fileTitle%.%preservationFormat%</preservationConversionCommand>
+
    VALUES (
  <preservationConversionCommand>%xenaPath%</preservationConversionCommand>
+
    @fileIDByExtensionDefaultGroupMemberID,
</formatPolicy>
+
    (SELECT pk FROM CommandClassifications WHERE classification = 'preservation'),
</source>
+
    @ffmpegToMKVCommandID,
 +
    @fileID
 +
);
 +
</pre>
 +
</div>
 +
</div>
 +
 
 +
===Create processing link to execute===
 +
Lastly, create the MicroServiceChainLink to be processed by the MCP, containing relationship between the link and the CommandRelationship.
 +
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 +
See code:
 +
<div class="mw-collapsible-content"><pre>
 +
INSERT INTO TasksConfigs (taskType, taskTypePKReference, description)
 +
    VALUES
 +
    (8,      LAST_INSERT_ID(), 'Normalize preservation');
 +
INSERT INTO MicroServiceChainLinks (microserviceGroup, currentTask, defaultNextChainLink)   
 +
    VALUES (@microserviceGroup, LAST_INSERT_ID(), @defaultPreservationNormalizationFailedLink);
 +
set @MicroServiceChainLink = LAST_INSERT_ID();
 +
INSERT INTO MicroServiceChainLinksExitCodes (microServiceChainLink, exitCode, nextMicroServiceChainLink)
 +
    VALUES (@MicroServiceChainLink, 0, @defaultPreservationNormalizationSucceededLink);
 
</pre>
 
</pre>
 +
</div>
 +
</div>
 +
 +
===Complete Code===
 +
To see all of the code as one.
 +
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 +
Click to expand the entire example as a whole.
 +
<div class="mw-collapsible-content"><pre>
 +
-- Commands for handling Video files --
 +
INSERT INTO Commands
 +
    (commandType, command, description)
 +
    -- VALUES SELECT pk FROM FileIDS WHERE description = 'Normalize Defaults'
 +
    VALUES (
 +
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
 +
    ('echo program=\\"ffmpeg\\"\\; version=\\"`ffmpeg 2>&1 | grep \"FFmpeg version\"`\\"'),
 +
    ('Get event detail text for ffmpeg extraction')
 +
);
 +
 +
set @ffmpegEventDetailCommandID = LAST_INSERT_ID();
 +
 +
INSERT INTO Commands
 +
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description)
 +
    VALUES (
 +
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
 +
    ('ffmpeg -i "%fileFullName%" -vcodec libx264 -preset medium -crf 18 "%outputDirectory%%prefix%%fileName%%postfix%.mp4"'),
 +
    '%outputDirectory%%prefix%%fileName%%postfix%.mp4',
 +
    @ffmpegEventDetailCommandID,
 +
    @standardVerificationCommand,
 +
    ('Transcoding to mp4 with ffmpeg')
 +
);
 +
set @ffmpegToMP4CommandID = LAST_INSERT_ID();
 +
 +
INSERT INTO Commands
 +
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description)
 +
    VALUES (
 +
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
 +
    ('#!/bin/bash
 +
# This file is part of Archivematica.
 +
#
 +
# Copyright 2010-2012 Artefactual Systems Inc. <http://artefactual.com>
 +
#
 +
# Archivematica is free software: you can redistribute it and/or modify
 +
# it under the terms of the GNU Affero General Public License as published by
 +
# the Free Software Foundation, either version 3 of the License, or
 +
# (at your option) any later version.
 +
#
 +
# Archivematica is distributed in the hope that it will be useful,
 +
# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +
# GNU General Public License for more details.
 +
#
 +
# You should have received a copy of the GNU General Public License
 +
# along with Archivematica.  If not, see <http://www.gnu.org/licenses/>.
 +
 +
# @package Archivematica
 +
# @subpackage transcoder
 +
# @author Joseph Perry <joseph@artefactual.com>
 +
# @version svn: $Id$
 +
 +
inputFile="%fileFullName%"
 +
outputFile="%outputDirectory%%prefix%%fileName%%postfix%.mkv"
 +
audioCodec="pcm_s16le"
 +
videoCodec="ffv1"
 +
audioStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=audio" -c`
 +
videoStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=video" -c`
 +
 +
command="ffmpeg -i \"${inputFile}\" "
 +
if [ ${audioStreamCount} -ge 1 ] ; then
 +
command="${command} -vcodec ${videoCodec} "
 +
fi
 +
 +
if [ ${videoStreamCount} -ge 1 ] ; then
 +
command="${command} -acodec ${audioCodec}"
 +
fi
 +
 +
command="${command} ${outputFile}"
 +
 +
addAudioStream=" -acodec ${audioCodec}  -newaudio"
 +
addVideoStream=" -vcodec ${videoCodec}  -newvideo"
 +
 +
#add additional audio channels
 +
for (( c=1; c<${audioStreamCount}; c++ )); do
 +
command="${command} ${addAudioStream}"
 +
#echo $command
 +
done
 +
 +
for (( c=1; c<${videoStreamCount}; c++ )); do
 +
command="${command} ${addVideoStream}"
 +
#echo $command
 +
done
 +
 +
echo $command
 +
eval $command
 +
'),
 +
    '%outputDirectory%%prefix%%fileName%%postfix%.mkv',
 +
    @ffmpegEventDetailCommandID,
 +
    @standardVerificationCommand,
 +
    ('Transcoding to mkv with ffmpeg')
 +
);
 +
set @ffmpegToMKVCommandID = LAST_INSERT_ID();
 +
 +
-- End of Commands for handling Video files --
 +
 +
-- ADD Normalization Path for .MPEG --
 +
INSERT INTO FileIDs
 +
    (description, validPreservationFormat, validAccessFormat)
 +
    VALUES (
 +
    'A .mpeg file', FALSE, FALSE
 +
);
 +
set @fileID = LAST_INSERT_ID();
 +
 +
INSERT INTO FileIDsByExtension
 +
    (Extension, FileIDs)
 +
    VALUES (
 +
    'mpeg',
 +
    @fileID
 +
);
  
Every format policy XML encloses its contents in the '''<formatPolicy></formatPolicy>''' tag. All content intended to be read by transcoder for normalization should be between the opening and close tags.
+
INSERT INTO FileIDGroupMembers
 +
    (fileID, groupID)
 +
    VALUES (@fileID, @videoGroup);
  
* '''<inherit>''' should should be used when a format shares the same normalization processes as another format. For instance, all audio files (MP3, WMA, etc.) share a common access format and preservation format and are converted using the same command. For instance, the XML files for MP3 and WMA contain only an <inherit> tag pointing to the "parent" AUDIO document that contains the commands for all audio types. If <inherit> is used, all other tags should be empty.
+
INSERT INTO CommandRelationships
* '''<accessFormat>''' defines the access format defined for the specified object type. It should be expressed using the file format for that type, in capital letters. For instance, '''MP3''' is the accessFormat for audio files.
+
    (GroupMember, commandClassification, command, fileID)
* '''<preservationFormat>''' defines the preservation format for the specified object type. It uses the same syntax as <accessFormat>
+
    VALUES (
* '''<accessConversionCommand>''' contains the commandline options for launching the tool which creates an access copy from the specified object type. This can be customized or replaced. Any variables in use are defined by normalizationConfig.conf and should be customized there.
+
    @fileIDByExtensionDefaultGroupMemberID,
* '''<preservationConversionCommand>''' contains the commandline options for launching the tool which creates a preservation copy from the specified object type. This can be customized or replaced. Any variables in use are defined by normalizationConfig.conf and should be customized there.
+
    (SELECT pk FROM CommandClassifications WHERE classification = 'preservation'),
 +
    @ffmpegToMKVCommandID,
 +
    @fileID
 +
);
  
Note that all tags should be present, regardless if they are being used. For instance, if the format in question does not inherit another type, the inherit opening and closing tags should still be present without enclosing any content, e.g.: <inherit></inherit>
+
INSERT INTO TasksConfigs (taskType, taskTypePKReference, description)
 +
    VALUES
 +
    (8,     LAST_INSERT_ID(), 'Normalize preservation');
 +
INSERT INTO MicroServiceChainLinks (microserviceGroup, currentTask, defaultNextChainLink)   
 +
    VALUES (@microserviceGroup, LAST_INSERT_ID(), @defaultPreservationNormalizationFailedLink);
 +
set @MicroServiceChainLink = LAST_INSERT_ID();
 +
INSERT INTO MicroServiceChainLinksExitCodes (microServiceChainLink, exitCode, nextMicroServiceChainLink)
 +
    VALUES (@MicroServiceChainLink, 0, @defaultPreservationNormalizationSucceededLink);
  
===Known Issues===
 
  
*Transcode will not normalize between codecs when using the same container. Transcoder does not normalize files which are already in an access/preservation format. Because it currently uses the file extension to identify file formats, this means that transcode erroneously conflates video containers with video codecs, and won't normalize from one file extension to the same file extension. This is expected to be fixed in Archivematica 0.8. See [http://code.google.com/p/archivematica/issues/detail?id=468 issue 468]
+
INSERT INTO CommandRelationships
 +
    (GroupMember, commandClassification, command, fileID)
 +
    VALUES (
 +
    @fileIDByExtensionDefaultGroupMemberID,
 +
    (SELECT pk FROM CommandClassifications WHERE classification = 'access'),
 +
    @ffmpegToMP4CommandID,
 +
    @fileID
 +
);
 +
 
 +
INSERT INTO TasksConfigs (taskType, taskTypePKReference, description)
 +
    VALUES
 +
    (8,     LAST_INSERT_ID(), 'Normalize access');
 +
INSERT INTO MicroServiceChainLinks (microserviceGroup, currentTask, defaultNextChainLink)   
 +
    VALUES (@microserviceGroup, LAST_INSERT_ID(), @defaultAccessNormalizationFailedLink);
 +
set @MicroServiceChainLink = LAST_INSERT_ID();
 +
INSERT INTO MicroServiceChainLinksExitCodes (microServiceChainLink, exitCode, nextMicroServiceChainLink)
 +
    VALUES (@MicroServiceChainLink, 0, @defaultAccessNormalizationSucceededLink);
 +
 
 +
 
 +
-- End Of ADD Normalization Path for .MPEG --
 +
</pre>
 +
</div>
 +
</div>
  
TODO: learn why there can be two <preservationConversionCommand>s and update documentation
+
=Future Development=
 +
We are considering building a [[Format_policy_registry]].
  
 
[[Category:Development documentation]]
 
[[Category:Development documentation]]

Latest revision as of 16:05, 11 February 2020

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Transcode: convert (language or information) from one form of coded representation to another.[ source: Oxford English Dictionary ]

Overview[edit]

The transcoder is developed by artefactual, for the purpose of normalization and generating access copies in the archivematica system. In earlier versions it was called normalizer. It will try to identify the file type by the file extension, or other metadata, and look for matching configured actions for those identified. It will then perform those actions, and exit with a zero status if it believes those actions have been completed successfully.

Transcoder Database[edit]

In Archivematica release 0.7.1 alpha, the normalalization rules have been moved to a database, and can be seen under the preservation planning tab on the dashboard. In future releases, we plan to support modification of these rules through the dashboard interface.

Database Schema[edit]

Transcoder database schema.png

Configuration[edit]

Configuration files are located in the /etc/transcoder/ directory.

The transcoder database credentials and server can be set in the dbsettings.conf file.

Development[edit]

In the 0.9 release the transcoder was integrated with the MCP.

During transfer processing, the fileIDs are identified by microservices. They are stored against the file in the FilesIDentifiedIDs table.

For normalization processing, the MCP will process down a chain for each file. The job for normalization of a file will check for command relationships with the identified file id's and the proper command classification (normalize preservation, normalize access). For every unique command found in that relaitonship, the MCP will create a task to be executed by the client. If no commands are identified the MCP will create a task with the default command, from the DefaultCommandsForClassifcations table, if one is defined.


Integration with the MCP was done by relating commands to Microservice chain links. The Transcoder links (MicroserviceChainLinks of this type) have a one to one relationship with the tasksConfigs, which have a one to one relationship with the CommandRelationships. The protocol between the client and server is based on the command Relationships's pk. The MCP assigns a task to the client to perform x commandRelationship on y file (identified by fileUUID). The client can pull the information required to execute the command from the database.

Why the change? To support all clients not having to support all normalization tools, the tasks needed to be assigned by tool availibility. Currently the archivematica-client package depends on all the tools required, but there are situations where this will be required. While these are not currently implemented an example would be normalizing on a windows machine, using microsoft office. The windows machine could theoretically run a client, but it wouldn't be able to support the standard archivematica tools, as they are linux based. To differentiate the two, use the supportedBy field in the Commands table.

Example[edit]

Normalization commands are created as part of the archivematica install. They are kept in the database, and populated upon install by the /usr/share/archivematica/transcoder/mysql sql script.

Create Commands[edit]

First, create the command(s) that will need to run. These commands can even be complete scripts. The command Type will need to be defined. A list of supported command types is in the CommandTypes table. You may also wish to create a special command for getting the event detail text for the event.

See code:

-- Commands for handling Video files --
INSERT INTO Commands 
    (commandType, command, description) 
    -- VALUES SELECT pk FROM FileIDS WHERE description = 'Normalize Defaults'
    VALUES (
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
    ('echo program=\\"ffmpeg\\"\\; version=\\"`ffmpeg 2>&1 | grep \"FFmpeg version\"`\\"'),
    ('Get event detail text for ffmpeg extraction')
);

set @ffmpegEventDetailCommandID = LAST_INSERT_ID();

INSERT INTO Commands 
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description) 
    VALUES (
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
    ('ffmpeg -i "%fileFullName%" -vcodec libx264 -preset medium -crf 18 "%outputDirectory%%prefix%%fileName%%postfix%.mp4"'),
    '%outputDirectory%%prefix%%fileName%%postfix%.mp4',
    @ffmpegEventDetailCommandID,
    @standardVerificationCommand,
    ('Transcoding to mp4 with ffmpeg')
);
set @ffmpegToMP4CommandID = LAST_INSERT_ID();

INSERT INTO Commands 
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description) 
    VALUES (
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
    ('#!/bin/bash
# This file is part of Archivematica.
#
# Copyright 2010-2012 Artefactual Systems Inc. <http://artefactual.com>
#
# Archivematica is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Archivematica is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Archivematica.  If not, see <http://www.gnu.org/licenses/>.

# @package Archivematica
# @subpackage transcoder
# @author Joseph Perry <joseph@artefactual.com>
# @version svn: $Id$

inputFile="%fileFullName%"
outputFile="%outputDirectory%%prefix%%fileName%%postfix%.mkv"
audioCodec="pcm_s16le"
videoCodec="ffv1"
audioStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=audio" -c`
videoStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=video" -c`

command="ffmpeg -i \"${inputFile}\" "
if [ ${audioStreamCount} -ge 1 ] ; then
command="${command} -vcodec ${videoCodec} "
fi

if [ ${videoStreamCount} -ge 1 ] ; then
command="${command} -acodec ${audioCodec}"
fi

command="${command} ${outputFile}"

addAudioStream=" -acodec ${audioCodec}  -newaudio"
addVideoStream=" -vcodec ${videoCodec}  -newvideo"

#add additional audio channels
for (( c=1; c<${audioStreamCount}; c++ )); do 
command="${command} ${addAudioStream}"
#echo $command
done

for (( c=1; c<${videoStreamCount}; c++ )); do 
command="${command} ${addVideoStream}"
#echo $command
done

echo $command
eval $command
'),
    '%outputDirectory%%prefix%%fileName%%postfix%.mkv',
    @ffmpegEventDetailCommandID,
    @standardVerificationCommand,
    ('Transcoding to mkv with ffmpeg')
);
set @ffmpegToMKVCommandID = LAST_INSERT_ID();

-- End of Commands for handling Video files --

Create FileIds[edit]

Second, create the file type. The FileIDs entry is a cover all, for future releases supporting more than one type of file identification. Every file identificaiton will have a unique corresponding entry in the FileIDs. The validPreservationFormat, and validAccessFormat relate to what appears in the normalization report. These are for identifying files at risk of format obsolescence, with failed or no normalization command. The FileIDsByExtension is the entry that links a '.mpg' file to the fileID. Files are related to their extension fileID's in the 'Identify Files ByExtension' micro-service, creating an entry in the FilesIdentifiedIDs table.

See code:

INSERT INTO FileIDs 
    (description, validPreservationFormat, validAccessFormat)
    VALUES (
    'A .mpg file', FALSE, FALSE
);
set @fileID = LAST_INSERT_ID();

INSERT INTO FileIDsByExtension 
    (Extension, FileIDs)
    VALUES (
    'mpg',
    @fileID
);

Create relationship between command and fileID[edit]

Third, create the relationship between the command and the file identification format. The relationship will play a role (preservation, or access) defined in the commandClassication. It's important to note the fileID references the FileIDs table, not the FileIDsByExtension table. The commandClassification was part of some testing of prioritizing normalization commands based on file identifcation types (using more than one file identification method); it's default value is @fileIDByExtensionDefaultGroupMemberID (0), even if left undefined.

See code:

INSERT INTO CommandRelationships 
    (GroupMember, commandClassification, command, fileID)
    VALUES (
    @fileIDByExtensionDefaultGroupMemberID,
    (SELECT pk FROM CommandClassifications WHERE classification = 'preservation'),
    @ffmpegToMKVCommandID,
    @fileID
);

Create processing link to execute[edit]

Lastly, create the MicroServiceChainLink to be processed by the MCP, containing relationship between the link and the CommandRelationship.

See code:

INSERT INTO TasksConfigs (taskType, taskTypePKReference, description)
    VALUES
    (8,      LAST_INSERT_ID(), 'Normalize preservation');
INSERT INTO MicroServiceChainLinks (microserviceGroup, currentTask, defaultNextChainLink)     
    VALUES (@microserviceGroup, LAST_INSERT_ID(), @defaultPreservationNormalizationFailedLink);
set @MicroServiceChainLink = LAST_INSERT_ID();
INSERT INTO MicroServiceChainLinksExitCodes (microServiceChainLink, exitCode, nextMicroServiceChainLink) 
    VALUES (@MicroServiceChainLink, 0, @defaultPreservationNormalizationSucceededLink);

Complete Code[edit]

To see all of the code as one.

Click to expand the entire example as a whole.

-- Commands for handling Video files --
INSERT INTO Commands 
    (commandType, command, description) 
    -- VALUES SELECT pk FROM FileIDS WHERE description = 'Normalize Defaults'
    VALUES (
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
    ('echo program=\\"ffmpeg\\"\\; version=\\"`ffmpeg 2>&1 | grep \"FFmpeg version\"`\\"'),
    ('Get event detail text for ffmpeg extraction')
);

set @ffmpegEventDetailCommandID = LAST_INSERT_ID();

INSERT INTO Commands 
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description) 
    VALUES (
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
    ('ffmpeg -i "%fileFullName%" -vcodec libx264 -preset medium -crf 18 "%outputDirectory%%prefix%%fileName%%postfix%.mp4"'),
    '%outputDirectory%%prefix%%fileName%%postfix%.mp4',
    @ffmpegEventDetailCommandID,
    @standardVerificationCommand,
    ('Transcoding to mp4 with ffmpeg')
);
set @ffmpegToMP4CommandID = LAST_INSERT_ID();

INSERT INTO Commands 
    (commandType, command, outputLocation, eventDetailCommand, verificationCommand, description) 
    VALUES (
    (SELECT pk FROM CommandTypes WHERE type = 'bashScript'),
    ('#!/bin/bash
# This file is part of Archivematica.
#
# Copyright 2010-2012 Artefactual Systems Inc. <http://artefactual.com>
#
# Archivematica is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Archivematica is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Archivematica.  If not, see <http://www.gnu.org/licenses/>.

# @package Archivematica
# @subpackage transcoder
# @author Joseph Perry <joseph@artefactual.com>
# @version svn: $Id$

inputFile="%fileFullName%"
outputFile="%outputDirectory%%prefix%%fileName%%postfix%.mkv"
audioCodec="pcm_s16le"
videoCodec="ffv1"
audioStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=audio" -c`
videoStreamCount=`ffprobe "${inputFile}" -show_streams 2>&1 | grep "codec_type=video" -c`

command="ffmpeg -i \"${inputFile}\" "
if [ ${audioStreamCount} -ge 1 ] ; then
command="${command} -vcodec ${videoCodec} "
fi

if [ ${videoStreamCount} -ge 1 ] ; then
command="${command} -acodec ${audioCodec}"
fi

command="${command} ${outputFile}"

addAudioStream=" -acodec ${audioCodec}  -newaudio"
addVideoStream=" -vcodec ${videoCodec}  -newvideo"

#add additional audio channels
for (( c=1; c<${audioStreamCount}; c++ )); do 
command="${command} ${addAudioStream}"
#echo $command
done

for (( c=1; c<${videoStreamCount}; c++ )); do 
command="${command} ${addVideoStream}"
#echo $command
done

echo $command
eval $command
'),
    '%outputDirectory%%prefix%%fileName%%postfix%.mkv',
    @ffmpegEventDetailCommandID,
    @standardVerificationCommand,
    ('Transcoding to mkv with ffmpeg')
);
set @ffmpegToMKVCommandID = LAST_INSERT_ID();

-- End of Commands for handling Video files --

-- ADD Normalization Path for .MPEG --
INSERT INTO FileIDs 
    (description, validPreservationFormat, validAccessFormat)
    VALUES (
    'A .mpeg file', FALSE, FALSE
);
set @fileID = LAST_INSERT_ID();

INSERT INTO FileIDsByExtension 
    (Extension, FileIDs)
    VALUES (
    'mpeg',
    @fileID
);

INSERT INTO FileIDGroupMembers 
    (fileID, groupID) 
    VALUES (@fileID, @videoGroup);

INSERT INTO CommandRelationships 
    (GroupMember, commandClassification, command, fileID)
    VALUES (
    @fileIDByExtensionDefaultGroupMemberID,
    (SELECT pk FROM CommandClassifications WHERE classification = 'preservation'),
    @ffmpegToMKVCommandID,
    @fileID
);

INSERT INTO TasksConfigs (taskType, taskTypePKReference, description)
    VALUES
    (8,      LAST_INSERT_ID(), 'Normalize preservation');
INSERT INTO MicroServiceChainLinks (microserviceGroup, currentTask, defaultNextChainLink)     
    VALUES (@microserviceGroup, LAST_INSERT_ID(), @defaultPreservationNormalizationFailedLink);
set @MicroServiceChainLink = LAST_INSERT_ID();
INSERT INTO MicroServiceChainLinksExitCodes (microServiceChainLink, exitCode, nextMicroServiceChainLink) 
    VALUES (@MicroServiceChainLink, 0, @defaultPreservationNormalizationSucceededLink);


INSERT INTO CommandRelationships 
    (GroupMember, commandClassification, command, fileID)
    VALUES (
    @fileIDByExtensionDefaultGroupMemberID,
    (SELECT pk FROM CommandClassifications WHERE classification = 'access'),
    @ffmpegToMP4CommandID,
    @fileID
);

INSERT INTO TasksConfigs (taskType, taskTypePKReference, description)
    VALUES
    (8,      LAST_INSERT_ID(), 'Normalize access');
INSERT INTO MicroServiceChainLinks (microserviceGroup, currentTask, defaultNextChainLink)     
    VALUES (@microserviceGroup, LAST_INSERT_ID(), @defaultAccessNormalizationFailedLink);
set @MicroServiceChainLink = LAST_INSERT_ID();
INSERT INTO MicroServiceChainLinksExitCodes (microServiceChainLink, exitCode, nextMicroServiceChainLink) 
    VALUES (@MicroServiceChainLink, 0, @defaultAccessNormalizationSucceededLink);


-- End Of ADD Normalization Path for .MPEG --

Future Development[edit]

We are considering building a Format_policy_registry.