Difference between revisions of "Micro-services"

From Archivematica
Jump to navigation Jump to search
(Undo revision 4292 by Evelyn McLellan (talk))
Line 18: Line 18:
 
!style="width:70%"|'''Description'''
 
!style="width:70%"|'''Description'''
 
|-
 
|-
|backupSIP
+
|Create SIP backup
|Create a backup of the entire SIP as soon as it is ingested.
+
|Creates a backup of the SIP. By default these are stored in /sharedDirectoryStructure/SIPbackups/. The backups are automatically removed at the end of SIP processing, when the AIP has been moved to archival storage.
 
|-
 
|-
|verifySIPcompliance
+
|Verify SIP compliance
|Verify that the SIP conforms to the folder structure required for processing in Archivematica.
+
|Verify that the SIP conforms to the folder structure required for processing in Archivematica. The structure is as follows: ''/logs/'', ''/logs/fileMeta/'', ''/metadata/'', ''/metadata/submissionDocumentation/'', ''/objects/''.
 
|-
 
|-
|assignIdentifier
+
|Assign file UUIDs and checksums
|Each file in the SIP is assigned a universal unique identifier and a sha-1 checksum for future integrity checks.
+
|Assigns file UUIDs and generates checksums for each file in the /objects/ directory. This step also creates the PREMIS files located in the /logs/fileMeta/ directory. The files in this directory are named based on the fileUUID of the file they represent.
 
|-
 
|-
|verifyChecksums
+
|Verify metadata directory checksums
|If the ingested SIP already contains a checksum file, this micro-service will check it to confirm that none of the files were deleted or altered upon transfer to Archivematica.
+
|Checks any checksum files that were placed in the /metadata/ folder of the SIP prior to ingest. Note that the filenames need to be named based on their algorithm: ''checksum.sha1'', ''checksum.sha256'', ''checksum.md5''.
 
|-
 
|-
|createDublinCore
+
|Remove thumbs.db files
|If the ingested SIP does not already contain one, a Dublin Core xml template is added to the metadata folder in the SIP. The user can fill in fields as desired. These values are uploaded to the access system as part of the DIP created by Archivematica.
+
|Removes any [http://en.wikipedia.org/wiki/Windows_thumbnail_cache Thumbs.db] files. May be expanded to others in future releases.
 
|-
 
|-
|appraiseForSubmission
+
|Create Dublin Core template
|The user may review the SIP to confirm that it complies with any submission agreements. The user can delete unwanted files at this point; a log of the deleted files will be added to the information package.
+
|If the ingested SIP does not already contain one, a Dublin Core xml template is added to the ''/metadata/'' folder in the SIP. The user can fill in fields as desired. These values are uploaded to the access system as part of the DIP created by Archivematica.
 
|-
 
|-
|quarantine
+
|Set file permissions
|The SIP is placed in quarantine for a pre-set period of time. The user can move the SIP out of quarantine before the pre-set time has expired, if desired.
+
|Changes file permissions on the SIP to allow the user to modify the SIP contents.
 
|-
 
|-
|extractPackages
+
|Appraise SIP for submission
|Files are extracted from any .zip, .tar or other file package formats; each extracted file is assigned a universal unique identifier and a sha-1 checksum.
+
|Manual approval step. Review the SIP to confirm that it conforms to any submission agreements and remove files and folders if desired. Do not move or rename files or folders as this will cause them to be excluded from the AIP.
 
|-
 
|-
|sanitizeNames
+
|Scan for removed files post appraise SIP for submission
|Prohibited characters which may cause processing errors on known operating systems (e.g. spaces or ampersands) are removed from file and directory names and replaced with underscores.
+
|Checks to see if any files were deleted and creates a list of them at ''/logs/removedFilesAppraiseSIPForSubmission.log''.
 
|-
 
|-
|virusScan
+
|Place in quarantine
|ClamAV scans all files in the SIP. In the event that a virus or other malware is found, the SIP is placed in a folder called SIPerrors and all processing on the SIP is stopped.
+
|Places SIP in quarantine for a pre-set period of time. The purpose of this is to allow time for new viruses to be identified, and antivirus groups to update their virus definitions. Note: for demonstration purposes, the quarantine period is set to a minute.
 
|-
 
|-
|validateFormatsAndExtractMetadata
+
|Remove from quarantine
|File formats are identified and the files validated against external format specifications. Technical metadata is extracted from the file.
+
|Archivematica uses a cron job to periodically check for SIPs that have met the configured quarantine time. Keeping in mind the purpose of the quarantine period, if you know the virus definitions are up to date for any virus possibly contained in the SIP (eg. The SIP source is a cd from 4 years ago) then you can remove it from quarantine manually.
 +
|-
 +
|Extract packages
 +
|Extracts objects from any zipped files or other packages.
 +
|-
 +
|Sanitize file and directory names
 +
|Some file systems do not support unicode or other special characters in filenames. This micro-service removes prohibited characters and replaces them with dashes. Original filenames are preserved in the PREMIS metadata.
 +
|-
 +
|Scan for viruses
 +
|Uses [http://www.clamav.net/lang/en/ ClamAV], parses the output and creates a PREMIS event. If a virus is found, the SIP is automatically placed in ''/sharedPath/watchedDirectories/failed/''.
 +
|-
 +
|Characterize and extract metadata
 +
|Identifies and validates formats and extracts object metadata using the [http://code.google.com/p/fits/ File Information Tool Set (FITS)]. Adds output to the PREMIS files.
 +
|-
 +
|Set file permissions
 +
|Changes file permissions on the SIP to allow the user to modify the SIP contents.
 
|-
 
|-
 
|appraiseForPreservation
 
|appraiseForPreservation

Revision as of 18:28, 8 August 2011

Main Page > Documentation > Technical Architecture > Micro-services


Micro-service.png

The Archivematica micro-services are granular system tasks which operate on a conceptual entity that is equivalent to an OAIS information package: Submission Information Package (SIP), Archival Information Package (AIP), Dissemination Information Package (DIP). The physical structure of an information package will include files, checksums, logs, XML metadata, etc..

These information packages are moved from one service to the next using the well-established Unix pipeline design pattern. Each micro-service is defined in a simple XML configuration file and associated with a watched directory. When an information package is moved to that directory it triggers the micro-service.

Each service is provided by a combination of Archivematica Python scrips and one or more of the free, open-source software tools bundled in the Archivematica system. Each micro-service results in a success or error state and the information package is moved accordingly to a success or error directory. Each success or error directory is the watched directory for a subsequent micro-service. This allows for the chaining of directories into complex, custom workflows. Archivematica implements a default ingest to access workflow that is compliant with the ISO-OAIS functional model.

Archivematica Micro-services

Micro-service Description
Create SIP backup Creates a backup of the SIP. By default these are stored in /sharedDirectoryStructure/SIPbackups/. The backups are automatically removed at the end of SIP processing, when the AIP has been moved to archival storage.
Verify SIP compliance Verify that the SIP conforms to the folder structure required for processing in Archivematica. The structure is as follows: /logs/, /logs/fileMeta/, /metadata/, /metadata/submissionDocumentation/, /objects/.
Assign file UUIDs and checksums Assigns file UUIDs and generates checksums for each file in the /objects/ directory. This step also creates the PREMIS files located in the /logs/fileMeta/ directory. The files in this directory are named based on the fileUUID of the file they represent.
Verify metadata directory checksums Checks any checksum files that were placed in the /metadata/ folder of the SIP prior to ingest. Note that the filenames need to be named based on their algorithm: checksum.sha1, checksum.sha256, checksum.md5.
Remove thumbs.db files Removes any Thumbs.db files. May be expanded to others in future releases.
Create Dublin Core template If the ingested SIP does not already contain one, a Dublin Core xml template is added to the /metadata/ folder in the SIP. The user can fill in fields as desired. These values are uploaded to the access system as part of the DIP created by Archivematica.
Set file permissions Changes file permissions on the SIP to allow the user to modify the SIP contents.
Appraise SIP for submission Manual approval step. Review the SIP to confirm that it conforms to any submission agreements and remove files and folders if desired. Do not move or rename files or folders as this will cause them to be excluded from the AIP.
Scan for removed files post appraise SIP for submission Checks to see if any files were deleted and creates a list of them at /logs/removedFilesAppraiseSIPForSubmission.log.
Place in quarantine Places SIP in quarantine for a pre-set period of time. The purpose of this is to allow time for new viruses to be identified, and antivirus groups to update their virus definitions. Note: for demonstration purposes, the quarantine period is set to a minute.
Remove from quarantine Archivematica uses a cron job to periodically check for SIPs that have met the configured quarantine time. Keeping in mind the purpose of the quarantine period, if you know the virus definitions are up to date for any virus possibly contained in the SIP (eg. The SIP source is a cd from 4 years ago) then you can remove it from quarantine manually.
Extract packages Extracts objects from any zipped files or other packages.
Sanitize file and directory names Some file systems do not support unicode or other special characters in filenames. This micro-service removes prohibited characters and replaces them with dashes. Original filenames are preserved in the PREMIS metadata.
Scan for viruses Uses ClamAV, parses the output and creates a PREMIS event. If a virus is found, the SIP is automatically placed in /sharedPath/watchedDirectories/failed/.
Characterize and extract metadata Identifies and validates formats and extracts object metadata using the File Information Tool Set (FITS). Adds output to the PREMIS files.
Set file permissions Changes file permissions on the SIP to allow the user to modify the SIP contents.
appraiseForPreservation The user may appraise the contents of the SIP and delete unwanted files; a log of the deleted files is added to the information package.
normalize Normalize SIP files into a preservation format copy and an access format copy for each file according to its media type preservation plan. These are packaged along with the original file in the AIP.
compilePreservationMetadata Compile a METS file with a complete set of PREMIS metadata for each ingested file. Technical metadata is placed in the PREMIS bjectCharacteristicsExtension element.
createAIPchecksum Generate a sha-1 checksum for all AIP contents.
prepareAIP Package AIP using the Library of Congress Bagit specification.
storeAIP The user may review the AIP and approve it for archival storage. The AIP is moved into the AIPsStore folder which is synced to the storage system.
generateDIP The access copies that were created during the "transcode" micro-service are placed in a DIP folder and the METS file is added to the DIP.
uploadDIP The user may review the DIP and remove any access copies that cannot be sent to the public access system due to copyright, security or other issues. The user then approves the DIP for upload and the DIP is uploaded into the public access system (in Archivematica, the default access system is the open-source archival description tool ICA-AtoM). A backup copy of the DIP, including files that were deleted, is sent to the DIPbackups folder.