Administrator manual 1.2

From Archivematica
Revision as of 17:10, 7 August 2014 by Mdemeo (talk | contribs) (→‎Storage service: Document package/space/location structure)
Jump to navigation Jump to search

Main Page > Documentation > Administrator manual 1.2

This manual covers administrator-specific instructions for Archivematica. It will also provide help for using forms in the Administration tab of the Archivematica dashboard and the administrator capabilities in the Format Policy Registry (FPR), which you will find in the Preservation planning tab of the dashboard.

For end-user instructions, please see the user manual.

Installation

Upgrading

Currently, Archivematica does not support upgrading from one version to the next. A re-install is required. After re-installing, you can restore Archivematica's knowledge of your AIPs, by rebuilding the AIP index and, if you have transfers stored in the backlog, rebuilding the transfer index.

Storage service

The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

Home page of Storage Service

Storage Service entities and organization

Packages

The Storage Service is oriented storing packages. A "package" is a bundle of one or more files transferred from an external service; for example, a package may be an AIP, a backlogged transfer, or a DIP. Each package is stored in a location.

Spaces

A space models a specific storage device. That device might be a locally-accessible disk, a network share, or a remote system accessible via a protocol like FEDORA, SWIFT, or LOCKSS. The space provides the Storage Service with configuration to read and/or write data stored within itself.

Packages are not stored directly inside a space; instead, packages are stored within locations, which are organized subdivisions of a space.

Locations

A location is a subdivision of a space. Each location is assigned a specific purpose, such as AIP storage or transfer backlog, in order to provide an organized way to structure content within a space.

Archivematica Configuration

When installing Archivematica, options to configure it with the Storage Service will be presented.

Install3.png

If you have installed the Storage Service at a different URL, you may change that here.

The top button 'Use default transfer source & AIP storage locations' will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The bottom button 'Register this pipeline & set up transfer source and AIP storage locations' will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be entered, and Archivematica will attempt to register its dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed, and a Pipeline for this Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

Change the port in the web server configuration

The storage services uses nginx by default, so you can edit /etc/nginx/sites-enabled/storage and change the line that says

listen 8000;

change 8000 to whatever port you prefer to use.

Keep in mind that in a default installation of Archivematica 1.0, the dashboard is running in Apache on port 80. So it is not possible to make nginx run on port 80 on the same machine. If you install the storage service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

Spaces

Spaces.png

A storage Space contains all the information necessary to connect to the physical storage. It is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored. All locations must be contained in a space.

A space is usually the immediate parent of the Location folders. For example, if you had transfer source locations at /home/artefactual/archivematica-sampledata-2013-10-10-09-17-20 and /home/artefactual/maildir_transfers, the Space's path would be /home/artefactual/

Currently supported protocols are local filesystem, NFS, and pipeline local filesystem.

Local Filesystem

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

Fields

  • Path: Absolute path to the Space on the local filesystem
  • Size: (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.

NFS

NFS spaces are for NFS exports mounted on the Storage Service server, and the Archivematica pipeline.


Fields

  • Path: Absolute path the space is mounted at on the filesystem local to the storage service
  • Size: (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
  • Remote name: Hostname or IP address of the remote computer exporting the NFS mount.
  • Remote path: Export path on the NFS server
  • Version: nfs or nfs4 - as would be passed to the mount command.
  • Manually Mounted: Check this if it has been mounted already. Otherwise, the Storage Service will try to mount it. Note: this feature is not yet available.

Pipeline Local Filesystem

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the storage service is hosted on storage_service_host and Archivematica is running on archivematica1 . The transfer sources for Archivematica are stored locally on archivematica1, but the storage service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

Note: Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.

Fields

  • Path: Absolute path to the space on the remote machine.
  • Size: (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
  • Remote name: Hostname or IP address of the computer running Archivematica. Should be SSH accessible from the Storage Service computer.
  • Remote user: Username on the remote host


Locations

Locations.png

A storage Location is contained in a Space, and knows its purpose in the Archivematica system. A Location is also where Packages are stored. Each Location is associated with a pipeline and can only be accessed by that pipeline.

Currently, a Location can have one of three purposes: Transfer Source, Currently Processing, or AIP Storage. Transfer source locations display in Archivematica's Transfer tab, and any folder in a transfer source can be selected to become a Transfer. AIP storage locations are where the completed AIPs are put for long-term storage. During processing, Archivematica uses the currently processing location associated with that pipeline. Only one currently processing location should be associated with a given pipeline. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

Fields

  • Purpose: What use the Location is for
  • Pipeline: Which pipelines this location is available to.
  • Relative Path: Path to this Location, relative to the space that contains it.
  • Description: Description of the Location to be displayed to the user.
  • Quota: (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
  • Enabled: If checked, this location is accessible to pipelines associated with it. If unchecked, it will not show up to any pipeline.

Pipeline

Pipelines.png

A pipeline is an Archivematica instance registered with the Storage Service, including the server and all associated clients. Each pipeline is uniquely identified by a UUID, which can be found in the dashboard under Administration -> General Configuration. When installing Archivematica, it will attempt to register its UUID with the Storage Service, with a description of "Archivematica on <hostname>".

Fields

  • UUID: Unique identifier of the Archivematica pipeline
  • Description: Description of the pipeline displayed to the user. e.g. Sankofa demo site
  • Enabled: If checked, this pipeline can access locations associate with it. If unchecked, all locations will be disabled, even if associated.
  • Default Locations: If checked, the default locations configured in Administration -> Configuration will be created or associated with the new pipeline.


Packages

Packages.png

A Package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). They cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the storage service administrator. To learn more about deleting an AIP, see Deleting an AIP.

Administration

StorageserviceAdmin1.png
StorageserviceAdmin2.png

The Administration section manages the users and settings for the Storage Service.

Users

Only registered users can long into the storage service, and the Users page is where users can be created or modified.

The storage service has two types of users: administrative users, and regular users. In the 0.4.0 release of the storage service, the only distinction between the two types is for email notifications; administrators will be notified by email when special events occur, while regular users will not.

Settings

Settings control the behavior of the Storage Service. Default Locations are the created or associated with pipelines when they are created.

Pipelines are disabled upon creation? sets whether a newly created Pipeline can access its Locations. If a Pipeline is disabled, it cannot access any of its locations. By disabling newly created Pipelines, it provides some security against unwanted perusal of the files in Locations, or use by unauthorized Archivematica instances. This can be configured individually when creating a Pipeline manually through the Storage Service website.

Default Locations set what existing locations should be associated with a newly created Pipeline, or what new Locations should be created for each new Pipeline. No matter what is configured here, a Currently Processing location is created for all Pipelines, since one is required. Multiple Transfer Source or AIP Storage Locations can be configured by holding down Ctrl when selecting them. New Locations in an existing Space can be created for Pipelines that use default locations by entering the relevant information.

How to Configure a Location

For Spaces of the type "Local Filesystem," Locations are basically directories (or more accurately, paths to directories). You can create Locations for Transfer Source, Currently Processing, or AIP Storage directories.

To create and configure a new Location:

  1. In the Storage Service, click on the "Spaces" tab.
  2. Under the Space that you want to add the Location to, click on the "Create Location here" link.
  3. Choose a purpose (e.g. AIP Storage) and pipeline, and enter a "Relative Path" (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to, e.g. for the default Space, the Path is '/' so your Location path would be relative to that (in the example here, the complete path would end up being '/var/mylocation'). Note: if the path you are defining in your Location doesn't exist, you must create it manually and make sure it is writable by the archivematica user.
  4. Save the Location settings.
  5. The new location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard "Administration" tab) or as a destination for AIP storage.

Store DIP

Dashboard administration tab

The Archivematica administration pages, under the Administration tab of the dashboard, allows you to configure application components and manage users.

Processing configuration

When processing a SIP or transfer, you may want to automate some of the workflow choices. Choices can be preconfigured by putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

If a SIP or transfer is submitted with a 'processingMCP.xml' file, processing decisions will be made with the included file.

The XML file format is:

<processingMCP>
  <preconfiguredChoices>
    <!-- Send to quarantine? -->
    <preconfiguredChoice>
      <appliesTo>755b4177-c587-41a7-8c52-015277568302</appliesTo>
      <goToChain>d4404ab1-dc7f-4e9e-b1f8-aa861e766b8e</goToChain>
    </preconfiguredChoice>
    <!-- Display metadata reminder -->
    <preconfiguredChoice>
      <appliesTo>eeb23509-57e2-4529-8857-9d62525db048</appliesTo>
      <goToChain>5727faac-88af-40e8-8c10-268644b0142d</goToChain>
    </preconfiguredChoice>
    <!-- Remove from quarantine -->
    <preconfiguredChoice>
      <appliesTo>19adb668-b19a-4fcb-8938-f49d7485eaf3</appliesTo>
      <goToChain>333643b7-122a-4019-8bef-996443f3ecc5</goToChain>
      <delay unitCtime="yes">2419200.0</delay>
    </preconfiguredChoice>
    <!-- Extract packages -->
    <preconfiguredChoice>
      <appliesTo>dec97e3c-5598-4b99-b26e-f87a435a6b7f</appliesTo>
      <goToChain>01d80b27-4ad1-4bd1-8f8d-f819f18bf685</goToChain>
    </preconfiguredChoice>
    <!-- Delete extracted packages -->
    <preconfiguredChoice>
      <appliesTo>f19926dd-8fb5-4c79-8ade-c83f61f55b40</appliesTo>
      <goToChain>85b1e45d-8f98-4cae-8336-72f40e12cbef</goToChain>
    </preconfiguredChoice>
    <!-- Select pre-normalize file format identification command -->
    <preconfiguredChoice>
      <appliesTo>7a024896-c4f7-4808-a240-44c87c762bc5</appliesTo>
      <goToChain>3c1faec7-7e1e-4cdd-b3bd-e2f05f4baa9b</goToChain>
    </preconfiguredChoice>
    <!-- Select compression algorithm -->
    <preconfiguredChoice>
      <appliesTo>01d64f58-8295-4b7b-9cab-8f1b153a504f</appliesTo>
      <goToChain>9475447c-9889-430c-9477-6287a9574c5b</goToChain>
    </preconfiguredChoice>
    <!-- Select compression level -->
    <preconfiguredChoice>
      <appliesTo>01c651cb-c174-4ba4-b985-1d87a44d6754</appliesTo>
      <goToChain>414da421-b83f-4648-895f-a34840e3c3f5</goToChain>
    </preconfiguredChoice>
  </preconfiguredChoices>
</processingMCP>

Where appliesTo is the UUID associated with the micro-service job presented in the dashboard, and goToChain is the UUID of the desired selection. The default processingMCP.xml file is located at '/var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml'.

The processing configuration administration page of the dashboard provides you with an easy form to configure the default 'processingMCP.xml' that's added to a SIP or transfer if it doesn't already contain one. When you change the options using the web interface the necessary XML will be written behind the scenes.

Processing configuration form in Administration tab of the dashboard


  • For the approval (yes/no) steps, the user ticks the box on the left-hand side to make a choice. If the box is not ticked, the approval step will appear in the dashboard.
  • For the other steps, if no actions are selected the choices appear in the dashboard
  • You can select whether or not to send transfers to quarantine (yes/no) and decide how long you'd like them to stay there.
  • You can select whether to extract packages as well as whether to keep and/or delete the extracted objects and/or the package itself.
  • You can approve normalization, sending the AIP to storage, and uploading the DIP without interrupting the workflow in the dashboard.
  • You can pre-select which format identification tool and command to run in both/either transfer and/or ingest to base your normalization upon.
  • You can choose to send a transfer to backlog or to create a SIP every time.
  • You can select to be reminded to add PREMIS event metadata about manual normalization should you choose to use that capability.
  • You can select between 7z using lzma and 7zip using bzip or parallel bzip2 algorithms for AIP compression.
  • For select compression level, the options are as follows:
    • 1 - fastest mode
    • 3 - fast compression mode
    • 5 - normal compression mode
    • 7 - maximum compression
    • 9 - ultra compression
  • You can select one archival storage location where you will consistently send your AIPs.

General

In the general configuration section, you can select interface options and set Storage Service options for your Archivematica client.

General configuration options in Administration tab of the dashboard

Interface options

Here, you can hide parts of the interface that you don't need to use. In particular, you can hide CONTENTdm DIP upload link, AtoM DIP upload link and DSpace transfer type.

Storage Service options

This is where you'll find the complete URL for the Storage Service. See Storage Service for more information about this feature.

Failures

Archivematica 1.2 includes dashboard failure reporting.

General configuration options in Administration tab of the dashboard

Transfer source location

Archivematica allows you to start transfers using the operating system's file browser or via a web interface. Source files for transfers, however, can't be uploaded using the web interface: they must exist on volumes accessible to the Archivematica MCP server and configured via the Storage Service.

When starting a transfer you're required to select one or more directories of files to add to the transfer.

You can view your transfer source directories in the Administrative tab of the dashboard under "Transfer source locations".


AIP storage locations

AIP storage directories are directories in which completed AIPs are stored. Storage directories can be specified in a manner similar to transfer source directories using the Storage Service.

You can view your transfer source directories in the Administrative tab of the dashboard under "AIP storage locations"

AtoM DIP upload

Archivematica can upload DIPs directly to an AtoM website so the contents can be accessed online. The AtoM DIP upload configuration page is where you specify the details of the AtoM installation you'd like the DIPs uploaded to (and, if using Rsync to transfer the DIP files, Rsync transfer details).

The parameters that you'll most likely want to set are url, email, and password. These parameters, respectively, specify the destination AtoM website's URL, the email address used to log in to the website, and the password used to log in to the website.

AtoM DIP upload can also use Rsync as a transfer mechanism. Rsync is an open source utility for efficiently transferring files. The rsync-target parameter is used to specify an Rsync-style target host/directory pairing, "foobar.com:~/dips/" for example. The rsync-command parameter is used to specify rsync connection options, "ssh -p 22222 -l user" for example. If you are using the rsync option, please see AtoM server configuration below.

To set any parameters for AtoM DIP upload change the values, preserving the existing format they're specified in, in the "Command arguments" field then click "Save".

Note that in AtoM, the sword plugin (Admin --> Plugins --> qtSwordPlugin) must be enabled in order for AtoM to receive uploaded DIPs. Enabling Job scheduling (Admin --> Settings --> Job scheduling) is also recommended.

AtoM server configuration

This server configuration step is necessary to allow Archivematica to log in to the AtoM server without passwords, and only when the user is deploying the rsync option described above in the AtoM DIP upload section.

To enable sending DIPs from Archivematica to the AtoM server:

Generate SSH keys for the Archivematica user. Leave the passphrase field blank.

 $ sudo -i -u archivematica
 $ cd ~
 $ ssh-keygen

Copy the contents of /var/lib/archivematica/.ssh/id_rsa.pub somewhere handy, you will need it later.

Now, it's time to configure the AtoM server so Archivematica can send the DIPs using SSH/rsync. For that purpose, you will create a user called archivematica and we are going to assign that user a restricted shell with access only to rsync:

 $ sudo apt-get install rssh
 $ sudo useradd -d /home/archivematica -m -s /usr/bin/rssh archivematica
 $ sudo passswd -l archivematica
 $ sudo vim /etc/rssh.conf // Make sure that allowrsync is uncommented!

Add the SSH key that we generated before:

 $ sudo mkdir /home/archivematica/.ssh
 $ chmod 700 /home/archivematica/.ssh/
 $ sudo vim /home/archivematica/.ssh/authorized_keys // Paste here the contents of id_dsa.pub
 $ chown -R archivematica:archivematica /home/archivematica

In Archivematica, make sure that you update the --rsync-target accordingly.
These are the parameters that we are passing to the upload-qubit microservice.
Go to the Administration > Upload DIP page in the dashboard.

Generic parameters:

--url="http://atom-hostname/index.php" \
--email="demo@example.com" \
--password="demo" \
--uuid="%SIPUUID%" \
--rsync-target="archivematica@atom-hostname:/tmp" \
--debug

CONTENTdm DIP upload

Archivematica can also upload DIPs to CONTENTdm instances. Multiple CONTENTdm destinations may be configured.

For each possible CONTENTdm DIP upload destination, you'll specify a brief description and configuration parameters appropriate for the destination. Paramters include %ContentdmServer% (full path to the CONTENTdm API, including the leading 'http://' or 'https://', for example http://example.com:81/dmwebservices/index.php), %ContentdmUser%, and %ContentdmGroup% (Linux user and group on the CONTENTdm server, not a CONTENTdm username). Note that only %ContentdmServer% is required is you are going to produce CONTENTdm Project Client packages; %ContentdmUser%, and %ContentdmGroup% are also required if you are going to use the "direct upload" option for uploading your DIPs into CONTENTdm.

When changing parameters for a CONTENTdm DIP upload destination simply change the values, preserving the existing format they're specified in. To add an upload destination fill in the form at the bottom of the page with the appropriate values. When you've completed your changes click the "Save" button.


PREMIS agent

The PREMIS agent name and code can be set via the administration interface.

thumbs

Rest API

In addition to automation using the processingMCP.xml file, Archivematica includes a REST API for automating transfer approval. Using this API, you can create a custom script that copies a transfer to the appropriate directory then uses the curl command, or some other means, to let Archivematica know that the copy is complete.

API keys

Use of the REST API requires the use of API keys. An API key is associated with a specific user. To generate an API key for a user:

  1. Browse to /administration/accounts/list/
  2. Click the "Edit" button for the user you'd like to generate an API key for
  3. Click the "Regenerate API key" checkbox
  4. Click "Save"

After generating an API key, you can click the "Edit" button for the user and you should see the API key.

IP whitelist

In addition to creating API keys, you'll need to add the IP of any computer making REST requests to the REST API whitelist. The IP whitelist can be edited in the administration interface at /administration/api/.

Approving a transfer

The REST API can be used to approve a transfer. The transfer must first be copied into the appropriate watch directory. To determine the location of the appropriate watch directory, first figure out where the shared directory is from the watchDirectoryPath value of /etc/archivematica/MCPServer/serverConfig.conf. Within that directory is a subdirectory activeTransfers. In this subdirectory are watch directories for the various transfer types.

When using the REST API to approve a transfer, if a transfer type isn't specified, the transfer will be deemed a standard transfer.

HTTP Method: POST

URL: /api/transfer/approve

Parameters:

directory: directory name of the transfer

type (optional): transfer type [standard|dspace|unzipped bag|zipped bag]

api_key: an API key

username: the username associated with the API key

Example curl command:

   curl --data "username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6&type=standard&directory=MyTransfer" http://127.0.0.1/api/transfer/approve

Example result:

   {"message": "Approval successful."}

Listing unapproved transfers

The REST API can be used to get a list of unapproved transfers. Each transfer's directory name and type is returned.

Method: GET

URL: /api/transfer/unapproved

Parameters:

api_key: an API key

username: the username associated with the API key

Example curl command:

   curl "http://127.0.0.1/api/transfer/unapproved?username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6"

Example result:

   {
       "message": "Fetched unapproved transfers successfully.",
       "results": [{
               "directory": "MyTransfer",
              "type": "standard"
           }
       ]
   }

Users

The dashboard provides a simple cookie-based user authentication system using the Django authentication framework. Access to the dashboard is limited only to logged-in users and a login page will be shown when the user is not recognized. If the application can't find any user in the database, the user creation page will be shown instead, allowing the creation of an administrator account.

Users can be also created, modified and deleted from the Administration tab. Only users who are administrators can create and edit user accounts.

You can add a new user to the system by clicking the "Add new" button on the user administration page. By adding a user you provide a way to access Archivematica using a username/password combination. Should you need to change a user's username or password, you can do so by clicking the "Edit" button, corresponding to the user, on the administration page. Should you need to revoke a user's access, you can click the corresponding "Delete" button.

CLI creation of administrative users

If you need an additional administrator user one can be created via the command-line, issue the following commands:

   cd /usr/share/archivematica/dashboard
   export PATH=$PATH:/usr/share/archivematica/dashboard
   export DJANGO_SETTINGS_MODULE=settings.common
   python manage.py createsuperuser

CLI password resetting

If you've forgotten the password for your administrator user, or any other user, you can change it via the command-line:

   cd /usr/share/archivematica/dashboard
   export PATH=$PATH:/usr/share/archivematica/dashboard
   export DJANGO_SETTINGS_MODULE=settings.common
   python manage.py changepassword <username>

Security

Archivematica uses PBKDF2 as the default algorithm to store passwords. This should be sufficient for most users: it's quite secure, requiring massive amounts of computing time to break. However, other algorithms could be used as the following document explains: How Django stores passwords.

Our plan is to extend this functionality in the future adding groups and granular permissions support.

Dashboard preservation planning tab

Format Policy Registry (FPR)

Introduction to the Format Policy Registry

The Format Policy Registry (FPR) is a database which allows Archivematica users to define format policies for handling file formats. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format). Format policies will change as community standards, practices and tools evolve. Format policies are maintained by Artefactual, who provides a freely-available FPR server hosted at fpr.archivematica.org. This server stores structured information about normalization format policies for preservation and access. You can update your local FPR from the FPR server using the UPDATE button in the preservation planning tab of the dashboard. In addition, you can maintain local rules to add new formats or customize the behaviour of Archivematica. The Archivematica dashboard communicates with the FPR server via a REST API.

First-time configuration

The first time a new Archivematica installation is set up, it will attempt to connect to the FPR server as part of the initial configuration process. As a part of the setup, it will register the Archivematica install with the server and pull down the current set of format policies. In order to register the server, Archivematica will send the following information to the FPR Server, over an encrypted connection:

  1. Agent Identifier (supplied by the user during registration while installing Archivematica)
  2. Agent Name (supplied by the user during registration while installing Archivematica)
  3. IP address of host
  4. UUID of Archivematica instance
  5. current time
  • The only information that will be passed back and forth between Archivematica and the FPR Server would be these format policies - what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. No information about the content that has been run through Archivematica, or any details about the Archivematica installation or configuration would be sent to the FPR Server.
  • Because Archivematica is an open source project, it is possible for any organization to conduct a software audit/code review before running Archivematica in a production environment in order to independently verify the information being shared with the FPR Server. An organization could choose to run a private FPR Server, accessible only within their own network(s), to provide at least a limited version of the benefits of sharing format policies, while guaranteeing a completely self-contained preservation system. This is something that Artefactual is not intending to develop, but anyone is free to extend the software as they see fit, or to hire us or other developers to do so.

Updating format policies

FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica. Clicking the "update" button will initiate an FPR pull which will bring in any new or altered rules since the last time an update was performed.

Types of FPR entries

Format

In the FPR, a "format" is a record representing one or more related format versions, which are records representing a specific file format. For example, the format record for "Graphics Interchange Format" (GIF) is comprised of format versions for both GIF 1987a and 1989a.

When creating a new format version, the following fields are available:

  • Description (required) - Text describing the format. This will be saved in METS files.
  • Version (required) - The version number for this specific format version (not the FPR record). For example, for Adobe Illustrator 14 .ai files, you might choose "14".
  • Pronom id - The specific format version's unique identifier in PRONOM, the UK National Archives's format registry. This is optional, but highly recommended.
  • Access format and Preservation format - Indicates whether this format is suitable as an access format for end users, and for preservation.

Format Group

A format group is a convenient grouping of related file formats which share common properties. For instance, the FPR includes an "Image (raster)" group which contains format records for GIF, JPEG, and PNG. Each format can belong to one (and only one) format group.


Characterization

Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization micro-service always ran the FITS tool. As of Archivematica 1.2, characterization is fully customizable by the Archivematica administrator.

Characterization tools

Archivematica has four default characterization tools upon installation. Which tool will run on a given file depends on the type of file, as determined by the selected identification tool.

Default

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

Multimedia

Archivematica 1.2 introduced three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

  • FFprobe, a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
  • MediaInfo, a characterization tool oriented towards audio and video data
  • ExifTool, a characterization tool oriented towards still image data and extraction of embedded metadata
Writing a new characterization command

Information on writing new characterization commands can be found in the FPR administrator's manual.

Writing a characterization command is very similar to writing an identification command or a normalization command. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

Extraction

Archivematica supports extracting contents from files during the transfer phase.

Many transfers contain files which are packages of other files; examples of these include compressed archives, such as ZIP files, or disk images. Archivematica provides a transcription microservice which comes with several predefined rules to extract packages, and which is fully customizeable by Archivematica administrators. Administrators can write new commands, and assign existing formats to run for other file formats.

Writing a new extraction command

Writing an extraction command is very similar to writing an identification command or a normalization command.

An extraction command is passed two arguments: the file to extract, and the path to which the package should be extracted. Similar to normalization commands, these arguments will be interpolated directly into "bashScript" and "command" scripts, and passed as positional arguments to "pythonScript" and "asIs" scripts.

Name (bashScript and command) Commandline position (pythonScript and asIs) Description Sample value
%outputDirectory% First The full path to the directory in which the package's contents should be extracted /path/to/filename-uuid/
%inputFile% Second The full path to the package file /path/to/filename

Here's a simple example of how to call an existing tool (7-zip) without any extra logic:

7z x -bd -o"%outputDirectory%" "%inputFile%"

This Python script example is more complex, and attempts to determine whether any files were extracted in order to determine whether to exit 0 or 1 (and report success or failure):

from __future__ import print_function
import re
import subprocess
import sys

def extract(package, outdir):
    # -a extracts only allocated files; we're not capturing unallocated files
    try:
        process = subprocess.Popen(['tsk_recover', package, '-a', outdir],
            stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
        stdout, stderr = process.communicate()

        match = re.match(r'Files Recovered: (\d+)', stdout.splitlines()[0])
        if match:
            if match.groups()[0] == '0':
                raise Exception('tsk_recover failed to extract any files with the message: {}'.format(stdout))
            else:
                print(stdout)
    except Exception as e:
        return e

    return 0

def main(package, outdir):
    return extract(package, outdir)

if __name__ == '__main__':
    package = sys.argv[1]
    outdir = sys.argv[2]
    sys.exit(main(package, outdir))

Transcription

Archivematica 1.2 introduces a new transcription microservice. This microservice provides tools to transcribe the contents of media objects. In Archivematica 1.2 it is used to perform OCR on images of textual material, but it can also be used to create commands which perform other kinds of transcription.

Writing transcription commands

Writing a transcription command is very similar to writing an identification command or a normalization command.

Transcription commands are expected to write their data to disk inside the SIP. For commands which perform OCR, metadata can be placed inside the "metadata/OCRfiles" directory inside the SIP; other kinds of transcription should produce files within "metadata".

For example, the following bash script is used by Archivematica to transcribe images using the Tesseract software:

ocrfiles="%SIPObjectsDirectory%metadata/OCRfiles"
test -d "$ocrfiles" || mkdir -p "$ocrfiles"

tesseract %fileFullName% "$ocrfiles/%fileName%"

Identification

Identification Tools

The identification tool properties in Archivematica control the ways in which Archivematica identifies files and associates them with the FPR's version records. The current version of the FPR server contains two tools: a script based on the Open Planets Foundation's FIDO tool, which identifies based on the IDs in PRONOM, and a simple script which identifies files by their file extension. You can use the identification tools portion of FPR to customize the behaviour of the existing tools, or to write your own.

Identification Commands

Identification commands contain the actual code that a tool will run when identifying a file. This command will be run on every file in a transfer.

When adding a new command, the following fields are available:

  • Identifier (mandatory) - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
  • Script type (mandatory) - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.

When coding a command, you should expect your script to take the path to the file to be identifed as the first commandline argument. When returning an identification, the tool should print a single line containing only the identifier, and should exit 0. Any informative, diagnostic, and error message can be printed to stderr, where it will be visible to Archivematica users monitoring tool results. On failure, the tool should exit non-zero.

Identification Rules

These identification rules allow you to define the relationship between the output created by an identification tool, and one of the formats which exists in the FPR. This must be done for the format to be tracked internally by Archivematica, and for it to be used by normalization later on. For instance, if you created a FIDO configuration which returns MIME types, you could create a rule which associates the output "image/jpeg" with the "Generic JPEG" format in the FPR.

Identification rules are necessary only when a tool is configured to return file extensions or MIME types. Because PUIDs are universal, Archivematica will always look these up for you without requiring any rules to be created, regardless of what tool is being used.

When creating an identification rule, the following mandatory fields must be filled out:

  • Format - Allows you to select one of the formats which already exists in the FPR.
  • Command - Indicates the command that produces this specific identification.
  • Output - The text which is written to standard output by the specified command, such as "image/jpeg"

Format Policy Tools

Format policy tools control how Archivematica processes files during ingest. The most common kind of these tools are normalization tools, which produce preservation and access copies from ingested files. Archivematica comes configured with a number of commands and scripts to normalize several file formats, and you can use this section of the FPR to customize them or to create your own. These are organized similarly to the #Identification Tools documented above.

Archivematica uses the following kinds of format policy rules:

  • Characterization
  • Extraction
  • Normalization - Access, preservation and thumbnails
  • Event detail - Extracts information about a given tool in order to be inserted into a generated METS file.
  • Transcription
  • Verification - Validates a file produced by another command. For instance, a tool could use Exiftool or JHOVE to determine whether a thumbnail produced by a normalization command was valid and well-formed.

Format Policy Commands

Like the #Identification Commands above, format policy commands are scripts or command line statements which control how a normalization tool runs. This command will be run once on every file being normalized using this tool in a transfer.

When creating a normalization command, the following mandatory fields must be filled out:

  • Tool - One or more tools to be associated with this command.
  • Description - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
  • Command - The script's source, or the commandline statement to execute.
  • Script type - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.
  • Output format (optional) - The format the command outputs. For example, a command to normalize audio to MP3 using ffmpeg would select the appropriate MP3 format from the dropdown.
  • Output location (optional) - The path the normalized file will be written to. See the #Writing a command section of the documentation for more information.
  • Command usage - The purpose of the command; this will be used by Archivematica to decide whether a command is appropriate to run in different circumstances. Values are "Normalization", "Event detail", and "Verification". See the #Writing a command section of the documentation for more information.
  • Event detail command - A command to provide information about the software running this command. This will be written to the METS file as the "event detail" property. For example, the normalization commands which use ffmpeg use an event detail command to extract ffmpeg's version number.

Format Policy Rules

Format policy rules allow commands to be associated with specific file types. For instance, this allows you to configure the command that uses ImageMagick to create thumbnails to be run on .gif and .jpeg files, while selecting a different command to be run on .png files.

When creating a format policy rule, the following mandatory fields must be filled out:

  • Purpose - Allows Archivematica to distinguish rules that should be used to normalize for preservation, normalize for access, to extract information, etc.
  • Format - The file format the associated command should be selected for.
  • Command - The specific command to call when this rule is used.

Writing a command

Identification command

Identification commands are very simple to write, though they require some familiarity with Unix scripting.

An identification command run once for every file in a transfer. It will be passed a single argument (the path to the file to identify), and no switches.

On success, a command should:

  • Print the identifier to stdout
  • Exit 0

On failure, a command should:

  • Print nothing to stdout
  • Exit non-zero (Archivematica does not assign special significance to non-zero exit codes)

A command can print anything to stderr on success or error, but this is purely informational - Archivematica won't do anything special with it. Anything printed to stderr by the command will be shown to the user in the Archivematica dashboard's detailed tool output page. You should print any useful error output to stderr if identification fails, but you can also print any useful extra information to stderr if identification succeeds.

Here's a very simple Python script that identifies files by their file extension:

import os.path, sys
(_, extension) = os.path.splitext(sys.argv[1])
if len(extension) == 0:
	exit(1)
else:
	print extension.lower()

Here's a more complex Python example, which uses Exiftool's XML output to return the MIME type of a file:

#!/usr/bin/env python

from lxml import etree
import subprocess
import sys

try:
    xml = subprocess.check_output(['exiftool', '-X', sys.argv[1]])
    doc = etree.fromstring(xml)
    print doc.find('.//{http://ns.exiftool.ca/File/1.0/}MIMEType').text
except Exception as e:
    print >> sys.stderr, e
    exit(1)

Once you've written an identification command, you can register it in the FPR using the following steps:

  1. Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
  2. Navigate to the "Identification Tools" page, and click "Create New Tool".
  3. Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
  4. Click "Create".

Next, create a record for the command itself:

  1. Click "Create New Command".
  2. Select your tool from the "Tool" dropdown box.
  3. Fill out the Identifier with text to describe to a user what this tool does. For instance, we might choose "Identify MIME-type using Exiftool".
  4. Select the appropriate script type - in this case, "Python Script".
  5. Enter the source code for your script in the "Command" box.
  6. Click "Create Command".

Finally, you must create rules which associate the possible outputs of your tool with the FPR's format records. This needs to be done once for every supported format; we'll show it with MP3, as an example.

  1. Navigate to the "Identification Rules" page, and click "Create New Rule".
  2. Choose the appropriate foramt from the Format dropdown - in our case, "Audio: MPEG Audio: MPEG 1/2 Audio Layer 3".
  3. Choose your command from the Command dropdown.
  4. Enter the text your command will output when it identifies this format. For example, when our Exiftool command identifies an MP3 file, it will output "audio/mpeg".
  5. Click "Create".

Once this is complete, any new transfers you create will be able to use your new tool in the identification step.

Normalization Command

Normalization commands are a bit more complex to write because they take a few extra parameters.

The goal of a normalization command is to take an input file and transform it into a new format. For instance, Archivematica provides commands to transform video content into FFV1 for preservation, and into H.264 for access.

Archivematica provides several parameters specifying input and output filenames and other useful information. Several of the most common are shown in the examples below; a more complete list is in a later section of the documentation: #Normalization command variables and arguments

When writing a bash script or a command line, you can reference the variables directly in your code, like this:

inkscape -z "%fileFullName%" --export-pdf="%outputDirectory%%prefix%%fileName%%postfix%.pdf"

When writing a script in Python or other languages, the values will be passed to your script as commandline options, which you will need to parse. The following script provides an example using the argparse module that comes with Python:

import argparse
import subprocess

parser = argparse.ArgumentParser()

parser.add_argument('--file-full-name', dest='filename')
parser.add_argument('--output-file-name', dest='output')
parsed, _ = parser.parse_known_args()
args = [
    'ffmpeg', '-vsync', 'passthrough',
    '-i', parsed.filename,
    '-map', '0:v', '-map', '0:a',
    '-vcodec', 'ffv1', '-g', '1',
    '-acodec', 'pcm_s16le',
    parsed.output+'.mkv'
]

subprocess.call(args)

Once you've created a command, the process of registering it is similar to creating a new identification tool. The folling examples will use the Python normalization script above.

First, create a new tool record:

  1. Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
  2. Navigate to the "Identification Tools" page, and click "Create New Tool".
  3. Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
  4. Click "Create".

Next, create a record for your new command:

  1. Click "Create New Tool Command".
  2. Fill out the Description with text to describe to a user what this tool does. For instance, we might choose "Normalize to mkv using ffmpeg".
  3. Enter the source for your command in the Command textbox.
  4. Select the appropriate script type - in this case, "Python Script".
  5. Select the appropriate output format from the dropdown. This indicates to Archivematica what kind of file this command will produce. In this case, choose "Video: Matroska: Generic MKV".
  6. Enter the location the video will be saved to, using the script variables. You can usually use the "%outputFileName%" variable, and add the file extension - in this case "%outputFileName%.mkv"
  7. Select a verification command. Archivematica will try to use this tool to ensure that the file your command created works. Archivematica ships with two simple tools, which test whether the file exists and whether it's larger than 0 bytes, but you can create new commands that perform more complicated verifications.
  8. Finally, choose a command to produce the "Event detail" text that will be written in the section of the METS file covering the normalization event. Archivematica already includes a suitable command for ffmpeg, but you can also create a custom command.
  9. Click "Create command".

Finally, you must create rules which will associate your command with the formats it should run on.

Normalization command variables and arguments

The following variables and arguments control the behaviour of format policy command scripts.

Name (bashScript and command) Commandline option (pythonScript and asIs) Description Sample value
%fileName% --input-file= The filename of the file to process. This variable holds the file's basename, not the whole path. video.mov
%fileDirectory% --file-directory= The directory containing the input file. /path/to
%inputFile% --file-name= The fully-qualified path to the file to process. /path/to/video.mov
%fileExtension% --file-extension= The file extension of the input file. mov
%fileExtensionWithDot% --file-extension-with-dot= As above, without stripping the period. .mov
%outputDirectory% --output-directory= The directory to which the output file should be saved. /path/to/access/copies
%outputFileUUID% --output-file-uuid= The unique identifier assigned by Archivematica to the output file. 1abedf3e-3a4b-46d7-97da-bd9ae13859f5
%outputDirectory% --output-directory= The fully-qualified path to the directory where the new file should be written. /var/archivematica/sharedDirectory/www/AIPsStore/uuid
%outputFileName% --output-file-name= The fully-qualified path to the output file, minus the file extension. /path/to/access/copies/video-uuid

Customization and automation

  • Workflow processing decisions can be made in the processingMCP.xml file. See here.
  • Workflows are currently created at the development level.
    Some resources avialable
  • Normalization commands can be viewed in the preservation planning tab.
  • Normalization paths and commands are currently editable under the preservation planning tab in the dashboard.

Elasticsearch

Archivematica has the capability of indexing data about files contained in AIPs and this data can be accessed programatically for various applications.

If, for whatever reason, you need to delete an ElasticSearch index please see ElasticSearch Administration.

If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
    conn.delete_index('aips')
except:
    print "Error deleting index or index already deleted."

Rebuilding the AIP index

To rebuild the ElasticSearch AIP index enter the following to find the location of the rebuilding script:

   locate rebuild-elasticsearch-aip-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-aip-index-from-files" with the location of the script):

   /your/script/location/rebuild-elasticsearch-aip-index-from-files <location of your AIP store>

Rebuilding the transfer index

Similarly, to rebuild the ElasticSearch transfer data index enter the following to find the location of the rebuilding script:

   locate rebuild-elasticsearch-transfer-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-transfer-index-from-files" with the location of the script):

   /your/script/location/rebuild-elasticsearch-transfer-index-from-files <location of your AIP store>

Data backup

In Archivematica there are three types of data you'll likely want to back up:

  • Filesystem (particularly your storage directories)
  • MySQL
  • ElasticSearch

MySQL is used to store short-term processing data. You can back up the MySQL database by using the following command:

mysqldump -u <your username> -p<your password> -c MCP > <filename of backup>

ElasticSearch is used to store long-term data. Instructions and scripts for backing up and restoring ElasticSearch are available here.

Security

Once you've set up Archivematica it's a good practice, for the sake of security, to change the default passwords.

MySQL

You should create a new MySQL user or change the password of the default "archivematica" MySQL user. The change the password of the default user, enter the following into the command-line:

$ mysql -u root -p<your MyQL root password> -D mysql \
   -e "SET PASSWORD FOR 'archivematica'@'localhost' = PASSWORD('<new password>'); \
   FLUSH PRIVILEGES;"

Once you've done this you can change Archivematica's MySQL database access credentials by editing these two files:

  • /etc/archivematica/archivematicaCommon/dbsettings (change the user and password settings)
  • /usr/share/archivematica/dashboard/settings/common.py (change the USER and PASSWORD settings in the DATABASES section)

Archivematica does not presently support secured MySQL communication so MySQL should be run locally or on a secure, isolated network. See issue 1645.

AtoM

In addition to changing the MySQL credentials, if you've also installed AtoM you'll want to set the password for it as well. Note that after changing your AtoM credentials you should update the credentials on the AtoM DIP upload administration page as well.

Gearman

Archivematica relies on the German server for queuing work that needs to be done. Gearman currently doesn't support secured connections so Gearman should be run locally or on a secure, isolated network. See issue 1345.

Questions

If you run into any difficulties while administrating Archivematica, please check out our FAQ and, if that doesn't help you, contain us using the Archivematica discussion group.

Frequently asked questions

Discussion group