Difference between revisions of "Adding Format Identification Tools"

From Archivematica
Jump to navigation Jump to search
(→‎Run tool: Clarifications)
(Reordered, so that dependencies came first, and things that depend on them later.)
Line 9: Line 9:
  
 
Since each link in the workflow depends on the next one, we need to start at the end and work our way back.
 
Since each link in the workflow depends on the next one, we need to start at the end and work our way back.
 +
 +
All pk's (primary keys) in the database are UUIDs generated with 'uuid -v4' on the command line
  
 
=Add tool=
 
=Add tool=
To include a new tool in the archivematica packages, it should be a dependancy package itself.
+
To include a new tool in the archivematica packages, it should be a dependency package itself.
  
=Add workflow=
+
==FileIDTypes==
The choices for file identification tool choices are made at link f4dea20e-f3fe-4a37-b20f-0e70a7bc960e.
+
First, an entry for FileIDTypes is required.
  
Additional choices can be added by adding entries.
 
 
<pre>
 
<pre>
SELECT
+
INSERT INTO `FileIDTypes` (`pk`, `description`, `replaces`, `lastModified`, `enabled`) VALUES ('afdbee13-eec5-4182-8c6c-f5638ee290f3','FileIDByFIDO',NULL,'2013-03-12 01:54:24',1);
    chainAvailable, startingLink, description
 
FROM
 
    MicroServiceChainChoice
 
    JOIN MicroServiceChains ON chainAvailable = MicroServiceChains.pk
 
WHERE
 
    choiceAvailableAtLink = 'f4dea20e-f3fe-4a37-b20f-0e70a7bc960e';
 
+--------------------------------------+--------------------------------------+---------------------+
 
| chainAvailable                      | startingLink                        | description        |
 
+--------------------------------------+--------------------------------------+---------------------+
 
| 229e34d9-3768-4b78-97b7-6cd4a2f07868 | b549130c-943b-4791-b1f6-93b837990138 | extension (default) |
 
| c44e0251-1c69-482d-a679-669b70d09fb1 | 56b42318-3eb3-466c-8a0d-7ac272136a96 | FITS - DROID        |
 
| 1d8836cf-ac02-437c-9283-4ddb7b018810 | 37f2e794-6485-4524-a384-37b3209916ed | FITS - ffident      |
 
| d607f083-7c86-49a2-bc36-06a03db28a80 | 766b23ad-65ed-46a3-aa2e-b9bdaf3386d0 | FITS - JHOVE        |
 
| 586006d1-f3af-4b5f-9f1a-c893244fa7a9 | d7a0e33d-aa3c-435f-a6ef-8e39f2e7e3a0 | FITS - summary      |
 
| 50f47870-3932-4a88-879d-d021a24758ad | f87f13d2-8aae-45c9-bc8a-e5c32a37654e | FITS - file utility |
 
| c76624a8-6f85-43cf-8ea7-0663502c712f | 982229bd-73b8-432e-a1d9-2d9d15d7287d | FIDO                |
 
+--------------------------------------+--------------------------------------+---------------------+
 
</pre>
 
 
 
FIDO was added by:
 
<pre>
 
INSERT INTO MicroServiceChains (pk, startingLink, description)
 
    VALUES ('c76624a8-6f85-43cf-8ea7-0663502c712f', '982229bd-73b8-432e-a1d9-2d9d15d7287d', 'FIDO');
 
</pre>
 
* All '''pk''''s are UUIDs generated with 'uuid -v4' on the command line
 
* MicroServiceChains.startingLink: MicroServiceChainLinks to start running if this choice is selected.  Defined below in [[#Set Selection]]
 
* MicroServiceChains.description: Name of the tool
 
 
 
<pre>
 
INSERT INTO MicroServiceChainChoice (pk, choiceAvailableAtLink, chainAvailable)
 
    VALUES ('e95b8f27-ea52-4247-bdf0-615273bc5fca', 'f4dea20e-f3fe-4a37-b20f-0e70a7bc960e', 'c76624a8-6f85-43cf-8ea7-0663502c712f');
 
 
</pre>
 
</pre>
* MicroServiceChainChoice.choiceAvailableAtLink: MicroServiceChainLink where the choice is made available.  This UUID was mentioned above
 
* MicroServiceChainChoice.chainAvailable: The chain added immediately above
 
  
 +
==Identify Files Script==
  
==Set Selection==
+
The script to identify the file must be created, and an entry for it added to the FilesIdentifiedIDs table. For Fido this script is [https://github.com/artefactual/archivematica/blob/642a3df29707e014f6c9ee5b7ea64785454c76cc/src/MCPClient/lib/clientScripts/archivematicaFido.py archivematicaFido.py] All the arguments the script accepts will be listed as arguments in StandardTasksConfigs, below.
The first step in the workflow is to set the selection as the tool to use during normalization. This is done by making an insert into the unit's variables table for the variable normalizationFileIdentificationToolIdentifierTypes. The value set is a piece of a SQL query used in [https://github.com/artefactual/archivematica/blob/master/src/MCPServer/lib/linkTaskManagerSplitOnFileIdAndruleset.py linkTaskManagerSplitOnFileIdAndruleset.py] to restrict the fileIDs used to the desired type.
 
  
For FIDO:
+
This script must also be added to the client as a supported module by adding it to [https://github.com/artefactual/archivematica/blob/642a3df29707e014f6c9ee5b7ea64785454c76cc/src/MCPClient/etc/archivematicaClientModules archivematicaClientModules].  Please keep the list in alphabetical order.
 
<pre>
 
<pre>
INSERT INTO `MicroServiceChainLinks` (`pk`, `currentTask`, `defaultNextChainLink`, `defaultPlaySound`, `microserviceGroup`, `reloadFileList`, `defaultExitMessage`, `replaces`, `lastModified`)
+
archivematicaFido_v0.0 = %clientScriptsDirectory%archivematicaFido.py 
    VALUES ('982229bd-73b8-432e-a1d9-2d9d15d7287d','1e516ea6-6814-4292-9ea9-552ebfaa0d23','4c4281a1-43cd-4c6e-b1dc-573bd1a23c43',NULL,'Normalize',1,'Failed',NULL,'2012-10-23 19:41:23');
 
INSERT INTO `TasksConfigs` (`pk`, `taskType`, `taskTypePKReference`, `description`, `replaces`, `lastModified`)
 
    VALUES ('1e516ea6-6814-4292-9ea9-552ebfaa0d23','6f0b612c-867f-4dfd-8e43-5b35b7f882d7','f130c16d-d419-4063-8c8b-2e4c3ad138bb','Set SIP to normalize with FIDO file identification.',NULL,'2012-10-23 19:41:23');
 
INSERT INTO `TasksConfigsSetUnitVariable` (`pk`, `variable`, `variableValue`, `microServiceChainLink`, `createdTime`, `updatedTime`)
 
    VALUES ('f130c16d-d419-4063-8c8b-2e4c3ad138bb','normalizationFileIdentificationToolIdentifierTypes','FileIDTypes.pk = \'afdbee13-eec5-4182-8c6c-f5638ee290f3\'',NULL,'2012-10-23 19:41:23','0000-00-00 00:00:00');
 
INSERT INTO `MicroServiceChainLinksExitCodes` (`pk`, `microServiceChainLink`, `exitCode`, `nextMicroServiceChainLink`, `playSound`, `exitMessage`, `replaces`, `lastModified`)
 
    VALUES ('82c97f8d-087d-4636-9dd9-59bbc04e6520','982229bd-73b8-432e-a1d9-2d9d15d7287d',0,'4c4281a1-43cd-4c6e-b1dc-573bd1a23c43',NULL,'Completed successfully',NULL,'2012-10-23 21:39:43');
 
 
</pre>
 
</pre>
  
 
==Run tool==
 
==Run tool==
The next step is to run the new tool on the objects. This requires both definition of the workflow, a task to support it, and the script to do all the work.
+
Now, we set up the chain link that will run the script.
  
===Workflow===
+
FIDO workflow link example:
FIDO workflow link:
 
 
<pre>
 
<pre>
SET @YLink = '83484326-7be7-4f9f-b252-94553cd42370';
+
-- Set up the UUIDs to be used as pks
 
 
 
SET @TasksConfigPKReference = '46883944-8561-44d0-ac50-e1c3fd9aeb59';
 
SET @TasksConfigPKReference = '46883944-8561-44d0-ac50-e1c3fd9aeb59';
 
SET @TasksConfig = '7f786b5c-c003-4ef1-97c2-c2269a04e89a';
 
SET @TasksConfig = '7f786b5c-c003-4ef1-97c2-c2269a04e89a';
 
SET @MicroServiceChainLink = '4c4281a1-43cd-4c6e-b1dc-573bd1a23c43';
 
SET @MicroServiceChainLink = '4c4281a1-43cd-4c6e-b1dc-573bd1a23c43';
 
SET @MicroServiceChainLinksExitCodes = 'd7653bbd-cd71-473d-b09e-fdd5b36a1d65';
 
SET @MicroServiceChainLinksExitCodes = 'd7653bbd-cd71-473d-b09e-fdd5b36a1d65';
 +
 +
--Where to go once completed is the same for both
 +
SET @YLink = '83484326-7be7-4f9f-b252-94553cd42370';
 
SET @defaultNextChainLink = @YLink;
 
SET @defaultNextChainLink = @YLink;
 
SET @NextMicroServiceChainLink = @YLink;
 
SET @NextMicroServiceChainLink = @YLink;
 
+
</pre>
 +
<pre>
 
INSERT INTO StandardTasksConfigs (pk, filterFileEnd, filterFileStart, filterSubDir, requiresOutputLock, standardOutputFile, standardErrorFile, execute, arguments)
 
INSERT INTO StandardTasksConfigs (pk, filterFileEnd, filterFileStart, filterSubDir, requiresOutputLock, standardOutputFile, standardErrorFile, execute, arguments)
 
     VALUES (@TasksConfigPKReference, NULL, NULL, 'objects/', FALSE, NULL, NULL, 'archivematicaFido_v0.0', '--fileUUID "%fileUUID%" --SIPUUID "%SIPUUID%" --filePath "%relativeLocation%" --eventIdentifierUUID "%taskUUID%" --date "%date%" --fileGrpUse "%fileGrpUse%"');
 
     VALUES (@TasksConfigPKReference, NULL, NULL, 'objects/', FALSE, NULL, NULL, 'archivematicaFido_v0.0', '--fileUUID "%fileUUID%" --SIPUUID "%SIPUUID%" --filePath "%relativeLocation%" --eventIdentifierUUID "%taskUUID%" --date "%date%" --fileGrpUse "%fileGrpUse%"');
 +
</pre>
 +
* This creates the configuration for running the script
 +
* filterFileEnd, filterFileStart, filterSubDir: What files to run the identification script on.  In this case, all the files in the objects/ subdirectory
 +
* execute: The script to execute, as it was defined in [https://github.com/artefactual/archivematica/blob/642a3df29707e014f6c9ee5b7ea64785454c76cc/src/MCPClient/etc/archivematicaClientModules archivematicaClientModules]
 +
* arguments: Anything that should be passed on the command line to the script.  It should include both the argument, and the value to pass to the argument. 
 +
 +
<pre>
 
INSERT INTO TasksConfigs (pk, taskType, taskTypePKReference, description)
 
INSERT INTO TasksConfigs (pk, taskType, taskTypePKReference, description)
 
     VALUES
 
     VALUES
 
     (@TasksConfig, 'a6b1c323-7d36-428e-846a-e7e819423577', @TasksConfigPKReference, 'Identify file formats with FIDO');
 
     (@TasksConfig, 'a6b1c323-7d36-428e-846a-e7e819423577', @TasksConfigPKReference, 'Identify file formats with FIDO');
 +
</pre>
 +
* This maps between the task (inserted below), and its configuration
 +
* taskType: We want this to run on every file, and a6b1c323-7d36-428e-846a-e7e819423577 is the pk for that
 +
* taskTypePKReference: is semantically a foreign key to StandardTasksConfigs, where our config is stored
 +
 +
<pre>
 
INSERT INTO MicroServiceChainLinks (pk, microserviceGroup, currentTask, defaultNextChainLink)
 
INSERT INTO MicroServiceChainLinks (pk, microserviceGroup, currentTask, defaultNextChainLink)
 
     VALUES (@MicroServiceChainLink, @microserviceGroup, @TasksConfig, @defaultNextChainLink);
 
     VALUES (@MicroServiceChainLink, @microserviceGroup, @TasksConfig, @defaultNextChainLink);
 +
</pre>
 +
* This creates the actual link to run
 +
* microserviceGroup:
 +
* currentTask: Points to the TasksConfigs we just inserted
 +
* defaultNextChainLink: where to go if we can't figure out where else to go, usually on failure
 +
 +
<pre>
 
INSERT INTO MicroServiceChainLinksExitCodes (pk, microServiceChainLink, exitCode, nextMicroServiceChainLink)
 
INSERT INTO MicroServiceChainLinksExitCodes (pk, microServiceChainLink, exitCode, nextMicroServiceChainLink)
 
     VALUES (@MicroServiceChainLinksExitCodes, @MicroServiceChainLink, 0, @NextMicroServiceChainLink);
 
     VALUES (@MicroServiceChainLinksExitCodes, @MicroServiceChainLink, 0, @NextMicroServiceChainLink);
 
SET @NextMicroServiceChainLink = @MicroServiceChainLink;
 
SET @NextMicroServiceChainLink = @MicroServiceChainLink;
 +
</pre>
 +
* This configures where to go once the MicroServiceChainLink is complete, based on exit code
 +
* microServiceChainLink: the chain link we just defined
 +
* exitCode: the exit code this is valid for.  In this case, 0=success.  These should obviously match what the script generates
 +
* nextMicroServiceChainLink: where to go next.  The last link, to return to the processing workflow already exists, and must be defined as the next chain to process.  The link is 83484326-7be7-4f9f-b252-94553cd42370
 +
 +
<pre>
 +
SELECT
 +
    MicroServiceChainLinks.pk, TasksConfigs.description
 +
FROM
 +
    MicroServiceChainLinks
 +
    JOIN TasksConfigs ON TasksConfigs.pk = MicroServiceChainLinks.currentTask
 +
WHERE
 +
    MicroServiceChainLinks.pk='83484326-7be7-4f9f-b252-94553cd42370';
 +
+--------------------------------------+---------------------------------------------------------------+
 +
| pk                                  | description                                                  |
 +
+--------------------------------------+---------------------------------------------------------------+
 +
| 83484326-7be7-4f9f-b252-94553cd42370 | Resume after normalization file identification tool selected. |
 +
+--------------------------------------+---------------------------------------------------------------+
 +
</pre>
 +
 +
=Add workflow=
 +
 +
==Set Selection==
 +
The first step in the workflow is to set the selection as the tool to use during normalization. This is done by making an insert into the unit's variables table for the variable normalizationFileIdentificationToolIdentifierTypes. The value set is a piece of a SQL query used in [https://github.com/artefactual/archivematica/blob/master/src/MCPServer/lib/linkTaskManagerSplitOnFileIdAndruleset.py linkTaskManagerSplitOnFileIdAndruleset.py] to restrict the fileIDs used to the desired type.
 +
 +
For FIDO:
 +
<pre>
 +
INSERT INTO `MicroServiceChainLinks` (`pk`, `currentTask`, `defaultNextChainLink`, `defaultPlaySound`, `microserviceGroup`, `reloadFileList`, `defaultExitMessage`, `replaces`, `lastModified`)
 +
    VALUES ('982229bd-73b8-432e-a1d9-2d9d15d7287d','1e516ea6-6814-4292-9ea9-552ebfaa0d23','4c4281a1-43cd-4c6e-b1dc-573bd1a23c43',NULL,'Normalize',1,'Failed',NULL,'2012-10-23 19:41:23');
 +
INSERT INTO `TasksConfigs` (`pk`, `taskType`, `taskTypePKReference`, `description`, `replaces`, `lastModified`)
 +
    VALUES ('1e516ea6-6814-4292-9ea9-552ebfaa0d23','6f0b612c-867f-4dfd-8e43-5b35b7f882d7','f130c16d-d419-4063-8c8b-2e4c3ad138bb','Set SIP to normalize with FIDO file identification.',NULL,'2012-10-23 19:41:23');
 +
INSERT INTO `TasksConfigsSetUnitVariable` (`pk`, `variable`, `variableValue`, `microServiceChainLink`, `createdTime`, `updatedTime`)
 +
    VALUES ('f130c16d-d419-4063-8c8b-2e4c3ad138bb','normalizationFileIdentificationToolIdentifierTypes','FileIDTypes.pk = \'afdbee13-eec5-4182-8c6c-f5638ee290f3\'',NULL,'2012-10-23 19:41:23','0000-00-00 00:00:00');
 +
INSERT INTO `MicroServiceChainLinksExitCodes` (`pk`, `microServiceChainLink`, `exitCode`, `nextMicroServiceChainLink`, `playSound`, `exitMessage`, `replaces`, `lastModified`)
 +
    VALUES ('82c97f8d-087d-4636-9dd9-59bbc04e6520','982229bd-73b8-432e-a1d9-2d9d15d7287d',0,'4c4281a1-43cd-4c6e-b1dc-573bd1a23c43',NULL,'Completed successfully',NULL,'2012-10-23 21:39:43');
 
</pre>
 
</pre>
  
===Identify Files Script===
+
==Chain Choice==
 +
The choices for file identification tool choices are made at link f4dea20e-f3fe-4a37-b20f-0e70a7bc960e.
  
The script to identify the file must be created, and an entry for it added to the FilesIdentifiedIDs table.  For Fido this script is [https://github.com/artefactual/archivematica/blob/642a3df29707e014f6c9ee5b7ea64785454c76cc/src/MCPClient/lib/clientScripts/archivematicaFido.py archivematicaFido.py]  Obviously, all the arguments the script accepts must be the same as the arguments listed in StandardTasksConfigs, above.
+
Additional choices can be added by adding entries.
 +
<pre>
 +
SELECT
 +
    chainAvailable, startingLink, description
 +
FROM
 +
    MicroServiceChainChoice
 +
    JOIN MicroServiceChains ON chainAvailable = MicroServiceChains.pk
 +
WHERE
 +
    choiceAvailableAtLink = 'f4dea20e-f3fe-4a37-b20f-0e70a7bc960e';
 +
+--------------------------------------+--------------------------------------+---------------------+
 +
| chainAvailable                      | startingLink                        | description        |
 +
+--------------------------------------+--------------------------------------+---------------------+
 +
| 229e34d9-3768-4b78-97b7-6cd4a2f07868 | b549130c-943b-4791-b1f6-93b837990138 | extension (default) |
 +
| c44e0251-1c69-482d-a679-669b70d09fb1 | 56b42318-3eb3-466c-8a0d-7ac272136a96 | FITS - DROID        |
 +
| 1d8836cf-ac02-437c-9283-4ddb7b018810 | 37f2e794-6485-4524-a384-37b3209916ed | FITS - ffident      |
 +
| d607f083-7c86-49a2-bc36-06a03db28a80 | 766b23ad-65ed-46a3-aa2e-b9bdaf3386d0 | FITS - JHOVE        |
 +
| 586006d1-f3af-4b5f-9f1a-c893244fa7a9 | d7a0e33d-aa3c-435f-a6ef-8e39f2e7e3a0 | FITS - summary      |
 +
| 50f47870-3932-4a88-879d-d021a24758ad | f87f13d2-8aae-45c9-bc8a-e5c32a37654e | FITS - file utility |
 +
| c76624a8-6f85-43cf-8ea7-0663502c712f | 982229bd-73b8-432e-a1d9-2d9d15d7287d | FIDO                |
 +
+--------------------------------------+--------------------------------------+---------------------+
 +
</pre>
  
This script must also be added to the client as a supported module by adding it to [https://github.com/artefactual/archivematica/blob/642a3df29707e014f6c9ee5b7ea64785454c76cc/src/MCPClient/etc/archivematicaClientModules archivematicaClientModules].  Please keep the list in alphabetical order.
+
This creates the chain that runs the FIDO scripts:
 
<pre>
 
<pre>
archivematicaFido_v0.0 = %clientScriptsDirectory%archivematicaFido.py 
+
INSERT INTO MicroServiceChains (pk, startingLink, description)
 +
    VALUES ('c76624a8-6f85-43cf-8ea7-0663502c712f', '982229bd-73b8-432e-a1d9-2d9d15d7287d', 'FIDO');
 
</pre>
 
</pre>
 +
* MicroServiceChains.startingLink: the chain link we defined above
 +
* MicroServiceChains.description: name of the tool
  
==Return==
+
Finally, we add the actual choice:
The last link, to return to the processing workflow already exists, and must be defined as the next chain to process.
+
<pre>
The link is 83484326-7be7-4f9f-b252-94553cd42370
+
INSERT INTO MicroServiceChainChoice (pk, choiceAvailableAtLink, chainAvailable)
 +
    VALUES ('e95b8f27-ea52-4247-bdf0-615273bc5fca', 'f4dea20e-f3fe-4a37-b20f-0e70a7bc960e', 'c76624a8-6f85-43cf-8ea7-0663502c712f');
 +
</pre>
 +
* MicroServiceChainChoice.choiceAvailableAtLink: MicroServiceChainLink where the choice is made available. We know this is f4dea20e-f3fe-4a37-b20f-0e70a7bc960e
 +
* MicroServiceChainChoice.chainAvailable: The chain we just added
  
 
=Add FPR rules=
 
=Add FPR rules=
Line 114: Line 161:
  
 
Fido example: [https://github.com/artefactual/archivematica/commit/ea02f7d6ffc420d5675fcd3bf261c63870881a65]
 
Fido example: [https://github.com/artefactual/archivematica/commit/ea02f7d6ffc420d5675fcd3bf261c63870881a65]
==FileIDTypes==
 
Remember an entry for FileIDTypes is required.
 
  
<pre>
 
INSERT INTO `FileIDTypes` (`pk`, `description`, `replaces`, `lastModified`, `enabled`) VALUES ('afdbee13-eec5-4182-8c6c-f5638ee290f3','FileIDByFIDO',NULL,'2013-03-12 01:54:24',1);
 
</pre>
 
  
 
[[Category:Development documentation]]
 
[[Category:Development documentation]]

Revision as of 19:10, 31 May 2013

This will show a developer how to add a new tool new tool to identify file formats (file ids), and allow normalization based on the newly identified file formats.

The workflow this section is looking to implement are:

  • Make the fileID tool a selectable choice
  • Specify that normalization will be using that fileIDType
  • Identify files using that tool, with valid archivematica fileIDs (format IDs)
  • Use those format IDs and associated commands to normalize files.

Since each link in the workflow depends on the next one, we need to start at the end and work our way back.

All pk's (primary keys) in the database are UUIDs generated with 'uuid -v4' on the command line

Add tool

To include a new tool in the archivematica packages, it should be a dependency package itself.

FileIDTypes

First, an entry for FileIDTypes is required.

INSERT INTO `FileIDTypes` (`pk`, `description`, `replaces`, `lastModified`, `enabled`) VALUES ('afdbee13-eec5-4182-8c6c-f5638ee290f3','FileIDByFIDO',NULL,'2013-03-12 01:54:24',1);

Identify Files Script

The script to identify the file must be created, and an entry for it added to the FilesIdentifiedIDs table. For Fido this script is archivematicaFido.py All the arguments the script accepts will be listed as arguments in StandardTasksConfigs, below.

This script must also be added to the client as a supported module by adding it to archivematicaClientModules. Please keep the list in alphabetical order.

archivematicaFido_v0.0 = %clientScriptsDirectory%archivematicaFido.py  

Run tool

Now, we set up the chain link that will run the script.

FIDO workflow link example:

-- Set up the UUIDs to be used as pks
SET @TasksConfigPKReference = '46883944-8561-44d0-ac50-e1c3fd9aeb59';
SET @TasksConfig = '7f786b5c-c003-4ef1-97c2-c2269a04e89a';
SET @MicroServiceChainLink = '4c4281a1-43cd-4c6e-b1dc-573bd1a23c43';
SET @MicroServiceChainLinksExitCodes = 'd7653bbd-cd71-473d-b09e-fdd5b36a1d65';

--Where to go once completed is the same for both
SET @YLink = '83484326-7be7-4f9f-b252-94553cd42370';
SET @defaultNextChainLink = @YLink;
SET @NextMicroServiceChainLink = @YLink;
INSERT INTO StandardTasksConfigs (pk, filterFileEnd, filterFileStart, filterSubDir, requiresOutputLock, standardOutputFile, standardErrorFile, execute, arguments)
    VALUES (@TasksConfigPKReference, NULL, NULL, 'objects/', FALSE, NULL, NULL, 'archivematicaFido_v0.0', '--fileUUID "%fileUUID%" --SIPUUID "%SIPUUID%" --filePath "%relativeLocation%" --eventIdentifierUUID "%taskUUID%" --date "%date%" --fileGrpUse "%fileGrpUse%"');
  • This creates the configuration for running the script
  • filterFileEnd, filterFileStart, filterSubDir: What files to run the identification script on. In this case, all the files in the objects/ subdirectory
  • execute: The script to execute, as it was defined in archivematicaClientModules
  • arguments: Anything that should be passed on the command line to the script. It should include both the argument, and the value to pass to the argument.
INSERT INTO TasksConfigs (pk, taskType, taskTypePKReference, description)
    VALUES
    (@TasksConfig, 'a6b1c323-7d36-428e-846a-e7e819423577', @TasksConfigPKReference, 'Identify file formats with FIDO');
  • This maps between the task (inserted below), and its configuration
  • taskType: We want this to run on every file, and a6b1c323-7d36-428e-846a-e7e819423577 is the pk for that
  • taskTypePKReference: is semantically a foreign key to StandardTasksConfigs, where our config is stored
INSERT INTO MicroServiceChainLinks (pk, microserviceGroup, currentTask, defaultNextChainLink)
    VALUES (@MicroServiceChainLink, @microserviceGroup, @TasksConfig, @defaultNextChainLink);
  • This creates the actual link to run
  • microserviceGroup:
  • currentTask: Points to the TasksConfigs we just inserted
  • defaultNextChainLink: where to go if we can't figure out where else to go, usually on failure
INSERT INTO MicroServiceChainLinksExitCodes (pk, microServiceChainLink, exitCode, nextMicroServiceChainLink)
    VALUES (@MicroServiceChainLinksExitCodes, @MicroServiceChainLink, 0, @NextMicroServiceChainLink);
SET @NextMicroServiceChainLink = @MicroServiceChainLink;
  • This configures where to go once the MicroServiceChainLink is complete, based on exit code
  • microServiceChainLink: the chain link we just defined
  • exitCode: the exit code this is valid for. In this case, 0=success. These should obviously match what the script generates
  • nextMicroServiceChainLink: where to go next. The last link, to return to the processing workflow already exists, and must be defined as the next chain to process. The link is 83484326-7be7-4f9f-b252-94553cd42370
SELECT 
    MicroServiceChainLinks.pk, TasksConfigs.description 
FROM 
    MicroServiceChainLinks 
    JOIN TasksConfigs ON TasksConfigs.pk = MicroServiceChainLinks.currentTask 
WHERE 
    MicroServiceChainLinks.pk='83484326-7be7-4f9f-b252-94553cd42370';
+--------------------------------------+---------------------------------------------------------------+
| pk                                   | description                                                   |
+--------------------------------------+---------------------------------------------------------------+
| 83484326-7be7-4f9f-b252-94553cd42370 | Resume after normalization file identification tool selected. |
+--------------------------------------+---------------------------------------------------------------+

Add workflow

Set Selection

The first step in the workflow is to set the selection as the tool to use during normalization. This is done by making an insert into the unit's variables table for the variable normalizationFileIdentificationToolIdentifierTypes. The value set is a piece of a SQL query used in linkTaskManagerSplitOnFileIdAndruleset.py to restrict the fileIDs used to the desired type.

For FIDO:

INSERT INTO `MicroServiceChainLinks` (`pk`, `currentTask`, `defaultNextChainLink`, `defaultPlaySound`, `microserviceGroup`, `reloadFileList`, `defaultExitMessage`, `replaces`, `lastModified`) 
    VALUES ('982229bd-73b8-432e-a1d9-2d9d15d7287d','1e516ea6-6814-4292-9ea9-552ebfaa0d23','4c4281a1-43cd-4c6e-b1dc-573bd1a23c43',NULL,'Normalize',1,'Failed',NULL,'2012-10-23 19:41:23');
INSERT INTO `TasksConfigs` (`pk`, `taskType`, `taskTypePKReference`, `description`, `replaces`, `lastModified`) 
    VALUES ('1e516ea6-6814-4292-9ea9-552ebfaa0d23','6f0b612c-867f-4dfd-8e43-5b35b7f882d7','f130c16d-d419-4063-8c8b-2e4c3ad138bb','Set SIP to normalize with FIDO file identification.',NULL,'2012-10-23 19:41:23');
INSERT INTO `TasksConfigsSetUnitVariable` (`pk`, `variable`, `variableValue`, `microServiceChainLink`, `createdTime`, `updatedTime`) 
    VALUES ('f130c16d-d419-4063-8c8b-2e4c3ad138bb','normalizationFileIdentificationToolIdentifierTypes','FileIDTypes.pk = \'afdbee13-eec5-4182-8c6c-f5638ee290f3\'',NULL,'2012-10-23 19:41:23','0000-00-00 00:00:00');
INSERT INTO `MicroServiceChainLinksExitCodes` (`pk`, `microServiceChainLink`, `exitCode`, `nextMicroServiceChainLink`, `playSound`, `exitMessage`, `replaces`, `lastModified`) 
    VALUES ('82c97f8d-087d-4636-9dd9-59bbc04e6520','982229bd-73b8-432e-a1d9-2d9d15d7287d',0,'4c4281a1-43cd-4c6e-b1dc-573bd1a23c43',NULL,'Completed successfully',NULL,'2012-10-23 21:39:43');

Chain Choice

The choices for file identification tool choices are made at link f4dea20e-f3fe-4a37-b20f-0e70a7bc960e.

Additional choices can be added by adding entries.

SELECT 
    chainAvailable, startingLink, description 
FROM 
    MicroServiceChainChoice 
    JOIN MicroServiceChains ON chainAvailable = MicroServiceChains.pk 
WHERE 
    choiceAvailableAtLink = 'f4dea20e-f3fe-4a37-b20f-0e70a7bc960e';
+--------------------------------------+--------------------------------------+---------------------+
| chainAvailable                       | startingLink                         | description         |
+--------------------------------------+--------------------------------------+---------------------+
| 229e34d9-3768-4b78-97b7-6cd4a2f07868 | b549130c-943b-4791-b1f6-93b837990138 | extension (default) |
| c44e0251-1c69-482d-a679-669b70d09fb1 | 56b42318-3eb3-466c-8a0d-7ac272136a96 | FITS - DROID        |
| 1d8836cf-ac02-437c-9283-4ddb7b018810 | 37f2e794-6485-4524-a384-37b3209916ed | FITS - ffident      |
| d607f083-7c86-49a2-bc36-06a03db28a80 | 766b23ad-65ed-46a3-aa2e-b9bdaf3386d0 | FITS - JHOVE        |
| 586006d1-f3af-4b5f-9f1a-c893244fa7a9 | d7a0e33d-aa3c-435f-a6ef-8e39f2e7e3a0 | FITS - summary      |
| 50f47870-3932-4a88-879d-d021a24758ad | f87f13d2-8aae-45c9-bc8a-e5c32a37654e | FITS - file utility |
| c76624a8-6f85-43cf-8ea7-0663502c712f | 982229bd-73b8-432e-a1d9-2d9d15d7287d | FIDO                |
+--------------------------------------+--------------------------------------+---------------------+

This creates the chain that runs the FIDO scripts:

INSERT INTO MicroServiceChains (pk, startingLink, description)
    VALUES ('c76624a8-6f85-43cf-8ea7-0663502c712f', '982229bd-73b8-432e-a1d9-2d9d15d7287d', 'FIDO');
  • MicroServiceChains.startingLink: the chain link we defined above
  • MicroServiceChains.description: name of the tool

Finally, we add the actual choice:

INSERT INTO MicroServiceChainChoice (pk, choiceAvailableAtLink, chainAvailable)
    VALUES ('e95b8f27-ea52-4247-bdf0-615273bc5fca', 'f4dea20e-f3fe-4a37-b20f-0e70a7bc960e', 'c76624a8-6f85-43cf-8ea7-0663502c712f');
  • MicroServiceChainChoice.choiceAvailableAtLink: MicroServiceChainLink where the choice is made available. We know this is f4dea20e-f3fe-4a37-b20f-0e70a7bc960e
  • MicroServiceChainChoice.chainAvailable: The chain we just added

Add FPR rules

Populating the local FPR rules can be done by creating a temporary script, and using a set of sample files with extensions. Use the FileIDs for the extensions to populate values for already in preservation/access status, and related commands.

Fido example: [1]