Archivematica - User contributions [en]

Getting started

2015-11-20T22:43:15Z

Mdemeo: /* Storage Service */ Update test instructions

[[Main Page]] > [[Development]] > Getting Started

This wiki page describes getting started with Archivematica as a developer. For user and administrative manuals, please see http://www.archivematica.org.

== Vital Stats ==

* Language: Python (primarily)
* License: [https://en.wikipedia.org/wiki/Affero_General_Public_License AGPL]
* VCS: git
* Major libraries: [https://www.djangoproject.com/ Django], [http://gearman.org/ gearman] ([https://pythonhosted.org/gearman/ Python API])
* [[Contribute_code|Contribution guidelines]]
** [[Contribute_code#Code_Style_Guide_For_Archivematica|Coding style]]

== Projects ==

Archivematica consists of several projects working together.

* [https://github.com/artefactual/archivematica Archivematica]: Main repository containing the user-facing dashboard, task manager MCPServer and clients scripts for the MCPClient
* [https://github.com/artefactual/archivematica-storage-service Storage Service]: Responsible for moving files to Archivematica for processing, and from Archivematica into storage
* [https://github.com/artefactual/archivematica-fpr-admin Format Policy Registry]: Submodule shared between Archivematica and the Format Policy Registry (FPR) server that displays and updates FPR rules and commands

There are also several smaller repositories that support Archivematica in various ways. In general, you will not need these to develop on Archivematica.

* [https://github.com/artefactual/archivematica-devtools Development tools]: Scripts to help with development. E.g. restarting services, workflow analysis
* [https://github.com/artefactual/archivematica-fpr-tools FPR tools]: All the tools, commands and rules used to populate the FPR database. Changes to the FPR should be submitted here.
* [https://github.com/artefactual/archivematica-docs Archivematica Documentation]: Documentation found at https://www.archivematica.org/en/docs/ Note that storage service documentation is found in the storage service repository.
* [https://github.com/artefactual/automation-tools Automation Tools]: Scripts used to automate processing material through Archivematica
* [https://github.com/artefactual/deploy-pub Deployment]: Ansible scripts for deploying and configuring Archivematica
* [https://github.com/artefactual-labs/ansible-archivematica Deployment-Archivematica]: Ansible playbook for Archivematica package install.
* [https://github.com/artefactual-labs/ansible-role-archivematica-src Deployment-Archivematica-dev]: Ansible playbook for Archivematica github install.
* [https://github.com/artefactual/fixity Fixity checker]: Commandline tool that assists in checking fixity for AIPs stored in Archivematica Storage Service instances.
* [https://github.com/artefactual/archivematica-sampledata Sample data]: Data to test and show off Archivematica's processing
* [https://github.com/artefactual/archivematica-history History]: Contains the pre-git history of Archivematica. Useful for checking the origins of code.

== Installation ==

There are two main ways to run Archivematica in development.

# Use ansible and vagrant to install Archivematica in a VM
# Install Archivematica on your development machine
# Alternate vagrant based installations exist as well

=== Ansible & Vagrant ===

To install and run Archivematica from source on a VM:
# Checkout deployment repo
#* <code>git clone https://github.com/artefactual/deploy-pub.git</code>
#* <code>cd deploy-pub/playbooks/archivematica</code>
# Install VirtualBox, Vagrant, Ansible
#* <code>sudo apt-get install virtualbox vagrant</code>
#* Vagrant must be at least 1.5 (it can also be downloaded from [https://www.vagrantup.com/downloads.html vagrantup.com])
#** <code>vagrant --version</code>
#* <code>sudo pip install -U ansible</code>
# Download Ansible roles
#* <code>ansible-galaxy install -f -r requirements.yml</code>
# (Optional) Change the branch by modifying <code>vars-singlenode.yml</code>
#* <code>amdev_version: "remotes/origin/branch-name"</code>
#* <code>ssdev_version: "remotes/origin/branch-name"</code>
# Create virtual machine and provision it
#* <code>vagrant up</code> (it takes a while)
# Login now available via:
#* <code>vagrant ssh</code>
# Services available:
#* Archivematica - http://192.168.168.192
#* Archivematica Storage Service: http://192.168.168.192:8000 (user: test, pass: test)
# Provisioning (via ansible) can be re-run
#* <code>vagrant provision</code>
#* Or with ansible directly <code>ansible-playbook -i .vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory singlenode.yml -u vagrant --private-key .vagrant/machines/am-local/virtualbox/private_key [--extra-vars=archivematica_src_dir=/path/to/code]</code> This allows you to pass ansible-specific parameters, such as <code>--start-at="name of task"</code>

=== Development Machine ===

See [[Development_environment#Setup|development environment setup instructions]]

=== Alternative Vagrant projects ===

*https://github.com/emltech/eml-archivematica-vagrant
*https://github.com/statsbiblioteket/archivematica-vagrant

== Tests ==

Archivematica and the related projects have a small but growing test suite. We use [http://pytest.org/ py.test] to run our tests, which should be listed as a requirement in the development/local requirements file.

To run the tests, go to the repository root and run <code>py.test</code>

See below for project-specific setup or changes to running the tests.

=== Archivematica ===

Before running Archivematica tests, set the following environment variable.

<pre>
#!/usr/bin/fish
set -xg PYTHONPATH $PYTHONPATH:/usr/share/archivematica/dashboard/:/usr/lib/archivematica/archivematicaCommon/
</pre>

<pre>
#!/usr/bin/bash
export PYTHONPATH=$PYTHONPATH:/usr/share/archivematica/dashboard/:/usr/lib/archivematica/archivematicaCommon/
</pre>

=== Storage Service ===

Before running Storage Service tests, set the following environment variables

<pre>
#!/usr/bin/fish
set -xg PYTHONPATH (pwd)/storage_service # The project root
set -xg DJANGO_SETTINGS_MODULE storage_service.settings.test
set -xg DJANGO_SECRET_KEY 'ADDKEY'
</pre>
<pre>
#!/usr/bin/bash
export PYTHONPATH=$(pwd)/storage_service # The project root
export DJANGO_SETTINGS_MODULE=storage_service.settings.test
export DJANGO_SECRET_KEY='ADDKEY'
</pre>

Release 1.3.1

2015-03-12T17:15:18Z

Mdemeo: Add MediaInfo

[[Main Page]] > [[External_tools|External software tools]]> Release 1.3.1

Archivematica integrates a suite of free and open-source tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model. In addition to the core Archivematica which is released under AGPL v3 license, the following tools are bundled with Archivematica 1.3.1:

{| border="1" cellpadding="10" cellspacing="0" width=90%
|-
|- style="background-color:#cccccc;"
!style="width:20%"|'''Tool'''
!style="width:10%"|'''Version'''
!style="width:50%"|'''Description'''
!style="width:40%"|'''License'''
|-
|-
|[https://github.com/LibraryOfCongress/bagit-java/ BagIt]
|4.9.0
|Standard and script to package digital objects and metadata for archival storage
|BSD License
|-
|[https://github.com/simsong/bulk_extractor bulk_extractor]
|1.4.4
|Disk image and file contents analysis tool
|Public domain
|-
|[http://www.clamav.net/ Clam AV (anti-virus)]
|0.98.6
|Anti-virus toolkit for UNIX
|GNU General Public License (GPL)
|-
|[http://www.elasticsearch.org/ ElasticSearch]
|0.90.13
|Indexing and search
|Apache License 2.0
|-
|[http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool]
|9.76
|Multimedia metadata extraction
|GNU General Public License and Artistic License
|-
|[http://ffmpeg.org/ FFmpeg]
|2.5.3
|Converts a wide variety of audio and video formats
|GNU Lesser General Public License (LGPL)
|-
|[http://code.google.com/p/fits/ File Information Tool Set (FITS)]
|0.8.2
|File format identification and validation software integration tool
|GNU Lesser General Public License (LGPL)
|-
|[https://github.com/openplanets/fido fido]
|1.3.1-78
|Format Identifier for Digital Objects
|Licensed under the Apache License, Version 2.0 (the "License")
|-
|[http://jhove.sourceforge.net/ JHOVE]
|1.6+dfsg-1
|Object validation tool
|GNU Lesser General Public License (LGPL)
|-
|[https://mediaarea.net/en/MediaInfo MediaInfo]
|0.7.52
|Multimedia metadata extraction
|BSD (2-clause), Zlib
|-
|[http://www.accesstomemory.org AtoM]
|2.1.2
|Web-based archival description and access tool
|GNU General Public License (GPL)
|-
|[http://www.imagemagick.org/script/index.php Imagemagick]
|6.6.9.7
|Converts a wide variety of bitmap images
|GPL compatible [http://www.imagemagick.org/script/license.php Imagemagick license]
|-
|[http://www.inkscape.org/ Inkscape]
|0.48.3.1
|Converts vector images to Scalable Vector Graphic (SVG) format
|GNU General Public License (GPL) version 2
|-
|[http://linux.about.com/cs/linux101/g/nfscommon.htm NFS-common]
|1.2.5
|Network File System Access - allows access to files on network storage devices.
|GNU General Public License (GPL)
|-
|[http://packages.ubuntu.com/lucid/python-lxml Python-lxml]
|2.3.2
|Python binding for libxml2 and libxslt
|GNU General Public License (GPL)
|-
|[http://www.sleuthkit.org/ The Sleuthkit]
|4.1.3
|Disk image management and extraction toolkit
|Common Public License / IBM Public License
|-
|[http://md5deep.sourceforge.net/ md5deep]
|3.9.5
|Checksum generation and verification scripts
|GNU General Public License (GPL)
|-
|[http://www.ossp.org/pkg/lib/uuid/ UUID]
|1.6.2
|command line interface (CLI) for the generation of DCE 1.1, ISO/IEC 11578:1996 and IETF RFC-4122 compliant Universally Unique Identifier (UUID).
|GNU General Public License (GPL)
|-
|[http://www.ubuntu.org/ Ubuntu Linux]
|14.04.2
|Interface with computing hardware. Ubuntu Linux server edition.
|GNU General Public License (GPL)
|-
|[http://manpages.ubuntu.com/manpages/hardy/man1/zip.1.html Zip]
|3.0
|Utility called by Bagit to create AIP package
|Info-Zip license: "Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely"
|-
|[https://www.djangoproject.com/ Django]
|1.5.4
|Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.
|BSD License
|-
|[http://gearman.org/ Gearman]
|0.27
|Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work.
|BSD License
|-
|[http://p7zip.sourceforge.net/ p7zip]
|9.20.1
|7-Zip is a file archiver with a high compression ratio. (LZMA)
|GNU General Public License (GPL)
|-
|[http://unarchiver.c3.cx/commandline unar]
|1.8.1
|The Unarchiver is an archive unpacker program
|GNU General Public License (GPL)
|-
|}

Release 1.3.1

2015-03-12T17:10:26Z

Mdemeo: Add additional 1.3.1 tools, fix some versions

[[Main Page]] > [[External_tools|External software tools]]> Release 1.3.1

Archivematica integrates a suite of free and open-source tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model. In addition to the core Archivematica which is released under AGPL v3 license, the following tools are bundled with Archivematica 1.3.1:

{| border="1" cellpadding="10" cellspacing="0" width=90%
|-
|- style="background-color:#cccccc;"
!style="width:20%"|'''Tool'''
!style="width:10%"|'''Version'''
!style="width:50%"|'''Description'''
!style="width:40%"|'''License'''
|-
|-
|[https://github.com/LibraryOfCongress/bagit-java/ BagIt]
|4.9.0
|Standard and script to package digital objects and metadata for archival storage
|BSD License
|-
|[https://github.com/simsong/bulk_extractor bulk_extractor]
|1.4.4
|Disk image and file contents analysis tool
|Public domain
|-
|[http://www.clamav.net/ Clam AV (anti-virus)]
|0.98.6
|Anti-virus toolkit for UNIX
|GNU General Public License (GPL)
|-
|[http://www.elasticsearch.org/ ElasticSearch]
|0.90.13
|Indexing and search
|Apache License 2.0
|-
|[http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool]
|9.76
|Multimedia metadata extraction
|GNU General Public License and Artistic License
|-
|[http://ffmpeg.org/ FFmpeg]
|2.5.3
|Converts a wide variety of audio and video formats
|GNU Lesser General Public License (LGPL)
|-
|[http://code.google.com/p/fits/ File Information Tool Set (FITS)]
|0.8.2
|File format identification and validation software integration tool
|GNU Lesser General Public License (LGPL)
|-
|[https://github.com/openplanets/fido fido]
|1.3.1-78
|Format Identifier for Digital Objects
|Licensed under the Apache License, Version 2.0 (the "License")
|-
|[http://jhove.sourceforge.net/ JHOVE]
|1.6+dfsg-1
|Object validation tool
|GNU Lesser General Public License (LGPL)
|-
|[http://www.accesstomemory.org AtoM]
|2.1.2
|Web-based archival description and access tool
|GNU General Public License (GPL)
|-
|[http://www.imagemagick.org/script/index.php Imagemagick]
|6.6.9.7
|Converts a wide variety of bitmap images
|GPL compatible [http://www.imagemagick.org/script/license.php Imagemagick license]
|-
|[http://www.inkscape.org/ Inkscape]
|0.48.3.1
|Converts vector images to Scalable Vector Graphic (SVG) format
|GNU General Public License (GPL) version 2
|-
|[http://linux.about.com/cs/linux101/g/nfscommon.htm NFS-common]
|1.2.5
|Network File System Access - allows access to files on network storage devices.
|GNU General Public License (GPL)
|-
|[http://packages.ubuntu.com/lucid/python-lxml Python-lxml]
|2.3.2
|Python binding for libxml2 and libxslt
|GNU General Public License (GPL)
|-
|[http://www.sleuthkit.org/ The Sleuthkit]
|4.1.3
|Disk image management and extraction toolkit
|Common Public License / IBM Public License
|-
|[http://md5deep.sourceforge.net/ md5deep]
|3.9.5
|Checksum generation and verification scripts
|GNU General Public License (GPL)
|-
|[http://www.ossp.org/pkg/lib/uuid/ UUID]
|1.6.2
|command line interface (CLI) for the generation of DCE 1.1, ISO/IEC 11578:1996 and IETF RFC-4122 compliant Universally Unique Identifier (UUID).
|GNU General Public License (GPL)
|-
|[http://www.ubuntu.org/ Ubuntu Linux]
|14.04.2
|Interface with computing hardware. Ubuntu Linux server edition.
|GNU General Public License (GPL)
|-
|[http://manpages.ubuntu.com/manpages/hardy/man1/zip.1.html Zip]
|3.0
|Utility called by Bagit to create AIP package
|Info-Zip license: "Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely"
|-
|[https://www.djangoproject.com/ Django]
|1.5.4
|Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.
|BSD License
|-
|[http://gearman.org/ Gearman]
|0.27
|Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work.
|BSD License
|-
|[http://p7zip.sourceforge.net/ p7zip]
|9.20.1
|7-Zip is a file archiver with a high compression ratio. (LZMA)
|GNU General Public License (GPL)
|-
|[http://unarchiver.c3.cx/commandline unar]
|1.8.1
|The Unarchiver is an archive unpacker program
|GNU General Public License (GPL)
|-
|}

Archivematica Release Notes

2014-09-05T23:16:09Z

Mdemeo: /* New and Updated Tools */ Remove libbfio/libewf - these are pure deps of sleuthkit, not used on their own

[[Main Page]] > [[Documentation]] > [[Release Notes]]
==Archivematica 1.2.0==

Released September X, 2014

===New features===

* '''Sponsored''' (Council of Prairie and Pacific University Libraries) For COPPUL hosting functionality at Bronze level, ability to process through to Transfer backlog only
* '''Sponsored''' (SFU Archives) SIP Arrangement - Create one or more SIPs from one or more transfers in the Ingest tab Transfer and SIP creation - #1726, #1571, #1713, #1035, #6022
** does not support taking content out of a SIP once it's been moved to the SIP arrangement panel
* '''Sponsored''' (Harvard Business School Library) Directory printer - See requirements Directory printer for recording original order
* '''Sponsored''' (Harvard Business School Library) OCR - See requirements OCR text in DIP
* '''Sponsored''' (Harvard Business School Library) Store DIP - See requirements Store DIP
* '''Sponsored''' (Yale University Libraries) Forensic disk image ingest #5037, #5356, #5900
* '''Sponsored''' includes identification and flagging of personal information in transfers, as well as other bulk extractor reporting functions
* Add ability to configure Characterization commands via FPR https://github.com/artefactual/archivematica/pull/6
* Add verification command micro-service (verify frame-level fixity and lossless compression) #6501
* Improvements to transfer start #6220Scalability: Add nailgun (improve performance of java tools like FITS)
* View pointer files from Archival Storage and Storage Service
* Improvements to file identification metadata in METS #
* Include TIKA #5027 and DROID in packages so FPR can be configured to use them as identification tools
* Include MediaInfo, Exiftool and framemd5 (maybe ffprobe) for characterization and metadata extraction instead of FITS #5034
* Support Dublin Core metadata in JSON (as well as csv, which was already supported) https://github.com/artefactual/archivematica/pull/14
* Updated FIDO with the most recent PRONOM IDs ([[http://www.nationalarchives.gov.uk/aboutapps/pronom/release-notes.xml| Version 77]]) released July 18th, 2014

Archivematica 1.2.0 runs with a new version of the Storage Service, 0.4.0.

===New and Updated Tools===

new
*bulk_extractor 1.4.4 http://digitalcorpora.org/downloads/bulk_extractor/
*exiftool 9.70 http://www.sno.phy.queensu.ca/~phil/exiftool/index.html
*MediaInfo 0.7.52 http://mediaarea.net/en/MediaInfo
*nailgun 0.9.1 http://www.martiansoftware.com/nailgun/
*sleuthkit 4.1.3 http://www.sleuthkit.org/sleuthkit/download.php
*unar 1.8.1 http://unarchiver.c3.cx/commandline
*Tesseract 3.02 https://code.google.com/p/tesseract-ocr/

updated:

*bagit 4.9.0 https://github.com/LibraryOfCongress/bagit-java/releases
*ffmpeg 2.3 https://www.ffmpeg.org/download.html#releases
*fido 1.3.1.77 https://github.com/openplanets/fido/tree/1.3.1-77
*fits 0.8.0 http://projects.iq.harvard.edu/fits/downloads
*ImageMagick 6.6.9-7 http://www.imagemagick.org

==Storage Service 0.4.0==

This release allows integration with LOCKSS storage, adds a fixity checking app to the backend, and includes several developer features as well as features required for future releases of the Archivematica dashboard.

===New features===

*Sponsored (SFU Library) LOCKSS available as an AIP storage location using PLN Manager "LOCKSS-o-MATIC" (AIP storage / API plugin) #5425 PR15
*Sponsored (SFU) Ability to configure transfer backlog locations via the Storage Service #6131 PR#9
*Sponsored (Harvard Business School Library) Manage DIP storage PR11
*Sponsored (Museum of Modern Art) Fixity checking app #6597 , 1109 PR13
*View pointer files from Archival Storage and SS #5716 PR5

===Enhancements===

*Optimizations in moving files between Locations #6248 PR4
*Streamlined creation of new endpoints with decorators PR14
*New dependency added unar (and lsar) used to add support for AIP's with multiple Extensions (e.g., aip.tar.gz) #6764 PR15

===Bugfixes===

*Setting Location path from the user interface #5608 PR10
*Allow email address to be used as username #6674 PR12
*Ability to change internal processing space #6819
*Editing users no longer results in server error #6717

==Storage Service 0.3.0==

Released April 10th, 2014

Includes backend enhancements and API-level changes only, with no direct user facing changes
* '''Sponsored''' (University of Alberta) [[Dataset_preservation|Dataset preservation]]
** '''Sponsored''' Add support for AIC's https://github.com/artefactual/archivematica-storage-service/pull/2
* Improved unicode support https://github.com/artefactual/archivematica-storage-service/pull/3
* v2 of internal REST API (the API used by the Dashboard) and update documentation
* Storage Service now supports updating - no longer necessary to reinstall to upgrade https://github.com/artefactual/archivematica-storage-service/pull/6

==Archivematica 1.1==

Released May 2, 2014

===New features===

* '''Sponsored''' (University of Alberta) [[Dataset_preservation|Dataset preservation]]
** '''Sponsored''' creation and management of AICs #5802
** '''Sponsored''' AIP pointer file #5159
** '''Sponsored''' pointer file tracks multi-AIP relationships
** '''Sponsored''' pointer file includes compression information and other metadata required to find and process (e.g. open) AIP
* '''Sponsored''' (University of Alberta) Enhancements to [[UM_manual_normalization_1.0|manual normalization workflow]]
** '''Sponsored''' ability to add PREMIS event detail information for manually normalized files via the dashboard #5216 [[UM_manual_normalization_1.1#Adding_PREMIS_eventDetail_for_manual_normalization|User Manual - Adding PREMIS eventDetail for manual normalization]]
* Backend/Not user-facing:
**Improved unicode support https://github.com/artefactual/archivematica/pull/17
**Better handling of preconfigured choices (processingMCP.xml)
**More choices in processing archive file formats (extra preconfigured choices)
**Improved handling of unit variables (passing parameters between micro-services)
**Update to FITS 0.8.0 (or newer if available)
**Update to ElasticSearch 0.90.13
**Security fix (avoid invoking subshell when running micro-services) https://github.com/artefactual/archivematica/pull/16
**File identification in mets is now from file id tool and not FITS https://github.com/artefactual/archivematica/pull/15

===Bug fixes and enhancements===

*[http://bit.ly/1lPBqwV bug fixes]

==Archivematica 1.0==

Release for public testing: September 2013
Package release: January 2014

===New features===

* Format Policy Registry (FPR) improvements including
**Ability to add/change format policies in the dashboard
**Ability to update the local FPR from fpr.archivematica.org
**Upload and report performance stats to FPR
**For detailed information about the FPR, see [[Administrator_manual_1.0#Format_Policy_Registry_.28FPR.29|Administrator manual--FPR]]
*Generation of "fail" reports in the administrative tab of the dashboard
*Eliminate unused interface options (e.g. DSpace transfer, CONTENTdm upload, ICA-AtoM upload) via the administrative tab of the dashboard
*DIP upload to Archivist Toolkit [[Archivists Toolkit integration]] with a metadata entry gui in the dashboard and actionable PREMIS rights
*AIP pointer file
*[[Administrator_manual_1.0#Storage_service|Storage service]] with API
*Ability to request to delete an AIP via the dashboard
*Upgraded to [https://github.com/harvard-lts/fits FITS 0.62]
*Ability for multiple pipelines to write to a shared ElasticSearch index and to the same AIP store(s) (i.e. multiple dept's -> one institution)
* Further scalability testing/prototyping and improved documentation

===Bug fixes and enhancements===

*[https://projects.artefactual.com/versions/31 bug fixes]

Archivematica Release Notes

2014-09-05T23:08:08Z

Mdemeo: /* New and Updated Tools */ Add more tools and homepages, adjust new/updated list

[[Main Page]] > [[Documentation]] > [[Release Notes]]
==Archivematica 1.2.0==

Released September X, 2014

===New features===

* '''Sponsored''' (Council of Prairie and Pacific University Libraries) For COPPUL hosting functionality at Bronze level, ability to process through to Transfer backlog only
* '''Sponsored''' (SFU Archives) SIP Arrangement - Create one or more SIPs from one or more transfers in the Ingest tab Transfer and SIP creation - #1726, #1571, #1713, #1035, #6022
** does not support taking content out of a SIP once it's been moved to the SIP arrangement panel
* '''Sponsored''' (Harvard Business School Library) Directory printer - See requirements Directory printer for recording original order
* '''Sponsored''' (Harvard Business School Library) OCR - See requirements OCR text in DIP
* '''Sponsored''' (Harvard Business School Library) Store DIP - See requirements Store DIP
* '''Sponsored''' (Yale University Libraries) Forensic disk image ingest #5037, #5356, #5900
* '''Sponsored''' includes identification and flagging of personal information in transfers, as well as other bulk extractor reporting functions
* Add ability to configure Characterization commands via FPR https://github.com/artefactual/archivematica/pull/6
* Add verification command micro-service (verify frame-level fixity and lossless compression) #6501
* Improvements to transfer start #6220Scalability: Add nailgun (improve performance of java tools like FITS)
* View pointer files from Archival Storage and Storage Service
* Improvements to file identification metadata in METS #
* Include TIKA #5027 and DROID in packages so FPR can be configured to use them as identification tools
* Include MediaInfo, Exiftool and framemd5 (maybe ffprobe) for characterization and metadata extraction instead of FITS #5034
* Support Dublin Core metadata in JSON (as well as csv, which was already supported) https://github.com/artefactual/archivematica/pull/14
* Updated FIDO with the most recent PRONOM IDs ([[http://www.nationalarchives.gov.uk/aboutapps/pronom/release-notes.xml| Version 77]]) released July 18th, 2014

Archivematica 1.2.0 runs with a new version of the Storage Service, 0.4.0.

===New and Updated Tools===

new
*bulk_extractor 1.4.4 http://digitalcorpora.org/downloads/bulk_extractor/
*exiftool 9.70 http://www.sno.phy.queensu.ca/~phil/exiftool/index.html
*libbfio 20130507 https://code.google.com/p/libbfio/
*libewf 20130416 https://code.google.com/p/libewf/
*MediaInfo 0.7.52 http://mediaarea.net/en/MediaInfo
*nailgun 0.9.1 http://www.martiansoftware.com/nailgun/
*sleuthkit 4.1.3 http://www.sleuthkit.org/sleuthkit/download.php
*unar 1.8.1 http://unarchiver.c3.cx/commandline
*Tesseract 3.02 https://code.google.com/p/tesseract-ocr/

updated:

*bagit 4.9.0 https://github.com/LibraryOfCongress/bagit-java/releases
*ffmpeg 2.3 https://www.ffmpeg.org/download.html#releases
*fido 1.3.1.77 https://github.com/openplanets/fido/tree/1.3.1-77
*fits 0.8.0 http://projects.iq.harvard.edu/fits/downloads
*ImageMagick 6.6.9-7 http://www.imagemagick.org

==Storage Service 0.4.0==

This release allows integration with LOCKSS storage, adds a fixity checking app to the backend, and includes several developer features as well as features required for future releases of the Archivematica dashboard.

===New features===

*Sponsored (SFU Library) LOCKSS available as an AIP storage location using PLN Manager "LOCKSS-o-MATIC" (AIP storage / API plugin) #5425 PR15
*Sponsored (SFU) Ability to configure transfer backlog locations via the Storage Service #6131 PR#9
*Sponsored (Harvard Business School Library) Manage DIP storage PR11
*Sponsored (Museum of Modern Art) Fixity checking app #6597 , 1109 PR13
*View pointer files from Archival Storage and SS #5716 PR5

===Enhancements===

*Optimizations in moving files between Locations #6248 PR4
*Streamlined creation of new endpoints with decorators PR14
*New dependency added unar (and lsar) used to add support for AIP's with multiple Extensions (e.g., aip.tar.gz) #6764 PR15

===Bugfixes===

*Setting Location path from the user interface #5608 PR10
*Allow email address to be used as username #6674 PR12
*Ability to change internal processing space #6819
*Editing users no longer results in server error #6717

==Storage Service 0.3.0==

Released April 10th, 2014

Includes backend enhancements and API-level changes only, with no direct user facing changes
* '''Sponsored''' (University of Alberta) [[Dataset_preservation|Dataset preservation]]
** '''Sponsored''' Add support for AIC's https://github.com/artefactual/archivematica-storage-service/pull/2
* Improved unicode support https://github.com/artefactual/archivematica-storage-service/pull/3
* v2 of internal REST API (the API used by the Dashboard) and update documentation
* Storage Service now supports updating - no longer necessary to reinstall to upgrade https://github.com/artefactual/archivematica-storage-service/pull/6

==Archivematica 1.1==

Released May 2, 2014

===New features===

* '''Sponsored''' (University of Alberta) [[Dataset_preservation|Dataset preservation]]
** '''Sponsored''' creation and management of AICs #5802
** '''Sponsored''' AIP pointer file #5159
** '''Sponsored''' pointer file tracks multi-AIP relationships
** '''Sponsored''' pointer file includes compression information and other metadata required to find and process (e.g. open) AIP
* '''Sponsored''' (University of Alberta) Enhancements to [[UM_manual_normalization_1.0|manual normalization workflow]]
** '''Sponsored''' ability to add PREMIS event detail information for manually normalized files via the dashboard #5216 [[UM_manual_normalization_1.1#Adding_PREMIS_eventDetail_for_manual_normalization|User Manual - Adding PREMIS eventDetail for manual normalization]]
* Backend/Not user-facing:
**Improved unicode support https://github.com/artefactual/archivematica/pull/17
**Better handling of preconfigured choices (processingMCP.xml)
**More choices in processing archive file formats (extra preconfigured choices)
**Improved handling of unit variables (passing parameters between micro-services)
**Update to FITS 0.8.0 (or newer if available)
**Update to ElasticSearch 0.90.13
**Security fix (avoid invoking subshell when running micro-services) https://github.com/artefactual/archivematica/pull/16
**File identification in mets is now from file id tool and not FITS https://github.com/artefactual/archivematica/pull/15

===Bug fixes and enhancements===

*[http://bit.ly/1lPBqwV bug fixes]

==Archivematica 1.0==

Release for public testing: September 2013
Package release: January 2014

===New features===

* Format Policy Registry (FPR) improvements including
**Ability to add/change format policies in the dashboard
**Ability to update the local FPR from fpr.archivematica.org
**Upload and report performance stats to FPR
**For detailed information about the FPR, see [[Administrator_manual_1.0#Format_Policy_Registry_.28FPR.29|Administrator manual--FPR]]
*Generation of "fail" reports in the administrative tab of the dashboard
*Eliminate unused interface options (e.g. DSpace transfer, CONTENTdm upload, ICA-AtoM upload) via the administrative tab of the dashboard
*DIP upload to Archivist Toolkit [[Archivists Toolkit integration]] with a metadata entry gui in the dashboard and actionable PREMIS rights
*AIP pointer file
*[[Administrator_manual_1.0#Storage_service|Storage service]] with API
*Ability to request to delete an AIP via the dashboard
*Upgraded to [https://github.com/harvard-lts/fits FITS 0.62]
*Ability for multiple pipelines to write to a shared ElasticSearch index and to the same AIP store(s) (i.e. multiple dept's -> one institution)
* Further scalability testing/prototyping and improved documentation

===Bug fixes and enhancements===

*[https://projects.artefactual.com/versions/31 bug fixes]

Development environment

2014-08-15T19:53:52Z

Mdemeo: /* Update */ Document new dev-helper question

[[Main Page]] > [[Development]] > Development Environment

This page explains how you can configure and use a standard Linux system as an Archivematica development environment.
The Archivematica development environment is available for developers that want the ability to customize or enhance their own Archivematica installation and/or [[contribute code]] back to the Archivematica project.

=Setup=
*Install ubuntu 12.04
*create a non-root user (with sudo privileges)
*log in as your new non-root user
*install archivematica storage service
<pre>
sudo apt-get update
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:archivematica/release
sudo add-apt-repository ppa:archivematica/externals
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install archivematica-storage-service
sudo rm /etc/nginx/sites-enabled/default
sudo ln -s /etc/nginx/sites-available/storage /etc/nginx/sites-enabled/storage
sudo ln -s /etc/uwsgi/apps-available/storage.ini /etc/uwsgi/apps-enabled/storage.ini
sudo service uwsgi restart
sudo service nginx restart
</pre>

*install git
<pre>sudo apt-get install git</pre>

*Use git to checkout Archivematica code
<pre>git clone https://github.com/artefactual/archivematica.git</pre>

*Run the install and helper scripts.
** <pre>cd archivematica</pre>
** <pre>./dev-installer</pre>
** Answer Y to all prompts.
** Restart the machine. (this is to enable the upstart services).
** <pre>cd archivematica</pre>
** <pre>./dev-helper</pre>
** Answer Y to all prompts.
** Complete AtoM setup
*** The database should already be created.
*** http://localhost/atom
*** https://www.qubit-toolkit.org/wiki/Installation#Open_Qubit.2C_ICA-AtoM.2C_or_DCB_in_your_web_browser
** Open the [dashboard http://localhost]

=Configuring and Using Archivematica=
These manuals contain information about the use and configuration of Archivematica:
 https://www.archivematica.org/wiki/Administrator_manual
 https://www.archivematica.org/wiki/User_Manual

=Update=

To pull down the latest code commits from the repository, and reset your dev install, navigate to the directory where Archivematica has been cloned:
*Change Directory to the archivematica git directory.
<pre>cd ~/archivematica/</pre>
*Check to see if you have any local changes that need to be stashed
<pre>git diff</pre>
*If there are local changes please stash or commit them. (Or you won't be able to update).
<pre>git stash</pre>
Update...
*<pre>./dev-helper</pre>
*<pre>"Would you like to git pull?" (y/N) y</pre>
*<pre>"Would you like to update/install package requirements?" (y/N) </pre>
*<pre>"Would you like to recreate the databases?" (y/N) y</pre>
**Note: this command will remove any processing sips/transfers.
*<pre>"Would you like to erase the ElasticSearch indexes?" (y/N) </pre>
*<pre>"Would you like to clear transfer backlog and AIP storage?" (y/N) </pre>
**Note: this command will delete all stored AIPs and stored backlogged transfers.
*<pre>"Would you like to restart archivematica services?" (y/N) y</pre>
*<pre>"Would you like to update sample data in /home/user/archivematica-sampledata?" (y/N) </pre>
*<pre>"Would you like to export sample data from /home/user/archivematica-sampledata?" (y/N) </pre>
*<pre>"Would you like to update AtoM and restart its atom-worker service?" (y/N) y</pre>
If you stashed changes, re-apply them with
<pre>git stash pop</pre>

=Troubleshooting=
* If it stalls during update "Would you like to update/install package requirements?" (y/N)
** Stop the script with CTRL + C
** Try installing the item it failed on on the command line. Ie.
** <pre>sudo apt-get install postfix</pre>
** Restart the dev-helper "Would you like to update/install package requirements?"

* If you get this error error starting service:
** <pre>sudo start archivematica-mcp-server</pre>
** <pre>start: Unknown job: archivematica-mcp-server</pre>
** Then reboot the machine.

*If a SIP processing fails, it will move it to the 'failed' directory which is located:
**<pre>/var/archivematica/sharedDirectory/watchedDirectories/failed</pre>

*if ArchivematicaServer freezes
**<pre>sudo restart archivematica-mcp-server</pre>
**<pre>sudo restart archivematica-mcp-client</pre>

*if ArchivematicaClient freezes (in terminal kill command)
**<pre>sudo restart archivematica-mcp-client</pre>

*Updates to the Dashboard may require an Apache webserver restart:
**<pre>sudo /etc/init.d/apache2 restart</pre>

*If you find a problem running the Dashboard and you want to get a detailed error log to report us, please switch it to debug mode following [http://archivematica.org/wiki/index.php?title=Dashboard#Debug_mode these instructions].

*MCP is currently logging to the /tmp/directory
** /tmp/archivematicaMCPClient-HOST.log
** /tmp/archivematicaMCPServer-HOST-DATE.log
** /tmp/archivematicaMCPServerPID

[[Category:Development documentation]]

Install-1.2-packages

2014-08-08T18:50:41Z

Mdemeo: Provide instructions to start/restart FITS

[[Installation]] >> [[Install-1.2|Install 1.2]] >> Install 1.2 packages

== Updating from Archivematica 1.1 ==

If you have installed Archivematica 1.1.0 from packages, it is possible to update your installation without re-installing. The steps are:

=== Update Archivematica Storage Service ===
<pre>
sudo apt-get update
sudo apt-get install archivematica-storage-service
</pre>

=== Update Archivematica ===
During the update process you will be asked about updating configuration files. Choose to accept the maintainers versions. You will also be asked about updating the database, say 'ok' to each of those steps. If you have set a password for the root mysql database user, enter it when prompted.
<pre>
sudo apt-get install archivematica-mcp-server
sudo apt-get install archivematica-mcp-client
sudo apt-get install archivematica-dashboard
</pre>

=== Update ElasticSearch ===

Archivematica 1.2.0 has been tested most extensively against version 0.90.13 of ElasticSearch. It is possible to use an older version (e.g. 0.20.6, which is what was distributed with Archivematica 1.0.0). Do not attempt to use ElasticSearch 1.0 or greater.

*Add the ElasticSearch apt repository next (from http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-repositories.html):
<pre>
sudo wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
</pre>
Then add this line to the bottom of /etc/apt/sources.list
<pre>
deb http://packages.elasticsearch.org/elasticsearch/0.90/debian stable main
</pre>
Now refresh your list of available packages and upgrade elasticsearch
<pre>
sudo apt-get update
sudo apt-get install elasticsearch
</pre>

=== Restart Services ===
<pre>
sudo service uwsgi restart
sudo service nginx restart
sudo /etc/init.d/apache2 restart
sudo /etc/init.d/elasticsearch restart
sudo /etc/init.d/gearman-job-server restart
sudo restart archivematica-mcp-server
sudo restart archivematica-mcp-client
sudo restart fits
</pre>

== Installing Archivematica 1.2 packages (new install) ==

Archivematica packages are hosted on Launchpad, in an Ubuntu PPA (Personal Package Archive). In order to install software onto your Ubuntu 12.04.4 system from a PPA:

*Add the archivematica/release PPA to your list of trusted repositories (if add-apt-repositories is not available you must install python-software-properties first):
<pre>
sudo apt-get update
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:archivematica/release
sudo add-apt-repository ppa:archivematica/externals
</pre>

*Add the ElasticSearch apt repository next (from http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-repositories.html):
<pre>
sudo wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
</pre>
Then add this line to the bottom of /etc/apt/sources.list
<pre>
deb http://packages.elasticsearch.org/elasticsearch/0.90/debian stable main
</pre>

*Update your system to the most recent 12.04 release

*This step will also fetch a list of the software from the PPAs you just added to your system.

<pre>
sudo apt-get update
sudo apt-get upgrade
</pre>

* Install all packages (each of these packages can be installed seperately, if necessary). Say YES or OK to any prompts you get after entering the following into terminal:
<pre>
sudo apt-get install archivematica-storage-service
sudo apt-get install elasticsearch
sudo apt-get install archivematica-mcp-server
sudo apt-get install archivematica-mcp-client
sudo apt-get install archivematica-dashboard
</pre>

* Configure the dashboard and storage service
note:these steps are safe to do on a desktop, or a machine dedicated to Archivematica. They may not be advisable on an existing web server. Consult with your web server administrator if you are unsure.

<pre>
sudo wget -q https://raw.github.com/artefactual/archivematica/stable/1.1.x/src/vm-includes/share/apache.default -O /etc/apache2/sites-available/default
sudo rm /etc/nginx/sites-enabled/default
sudo ln -s /etc/nginx/sites-available/storage /etc/nginx/sites-enabled/storage
sudo ln -s /etc/uwsgi/apps-available/storage.ini /etc/uwsgi/apps-enabled/storage.ini
sudo service uwsgi restart
sudo service nginx restart
sudo /etc/init.d/apache2 restart
sudo freshclam
sudo /etc/init.d/clamav-daemon start
sudo /etc/init.d/elasticsearch restart
sudo /etc/init.d/gearman-job-server restart
sudo start archivematica-mcp-server
sudo start archivematica-mcp-client
sudo start fits
</pre>

* Test the [[Administrator_manual_1.2#Storage_service|storage service]]
The storage service runs as a separate web application from the Archivematica dashboard. Go to the following link in a web browser:

http://localhost:8000 (or use the IP address of the machine you have been installing on). 
log in as user: test password: test

* Create a new administrative user in the Storage service

The storage service has its own set of users. In the User menu in the Administrative tab of the storage service, add at least one administrative user, and delete or modify the test user.

* Test the dashboard
you can login to the archivematica dashboard and finish the installation in a web browser:

http://localhost

* Register your installation for full Format Policy Registry interoperability.

[[Register_1.2|Register Archivematica 1.2]]

== Using ICA-AtoM with Archivematica ==

If you are interested in using the [https://www.ica-atom.org/ ICA-Atom] web-based content management and access system you can use the package within our PPA. Archivematica can upload digital objects to their associated descriptions in ICA-AtoM.

It is installed by running the following commands in your Terminal (answer yes to all prompts, it may require your mysql adminitration password if one was entered above):
<pre>
sudo aptitude install ica-atom
sudo aptitude install upload-qubit
sudo /etc/init.d/apache2 restart
sudo cp /var/www/atom/init/atom-worker.conf /etc/init/atom-worker.conf
</pre>

You can finish your installation by visiting http://localhost/ica-atom/. For more information about ICA-AtoM installation see [https://www.ica-atom.org/doc/Installation ICA-AtoM 1.3.1 installation].

Occasionally when loading the ica-atom web installer for the first time an error appears, refreshing your web browser usually resolves this. We intend to fix this in future version of the AtoM package.

After the AtoM installation, configure it to use the Sword plugin (in the Plugins menu) as well as the job scheduler (in Settings), then start the atom-worker:

<pre>
sudo start atom-worker
</pre>

== Using AtoM 2.x with Archivematica ==

Archivematica has been successfully tested with Atom 2.x. The best known configuration is Archivematica 1.0.1 or greater with Atom 2.0.2-rc1 or greater.
Installation instructions for Atom 2 are available on the accesstomemory.org website https://www.accesstomemory.org/en/docs/2.0/ . When following those instructions, it is best to download Atom from the git repository (rather than use one of the supplied tarballs). When checking out Atom, use tag v2.0.2-rc1.

Once you have a working Atom installation, you can configure dip upload between Archivematica and Atom. The basic steps are:

* update atom dip upload configuration in the Archivematica dashboard
* confirm upload-qubit package is installed on the Archivematica server
* confirm atom-worker is configured on the Atom server
* enable the Sword Plugin in the Atom plugins page
* enable job scheduling in the Atom settings page
* confirm gearman is installed on the Atom server
* configure ssh keys to allow rsync to work for the archivematica user, from the Archivematica server to the Atom server
* start gearman on the Atom server
* start the atom worker on the Atom server

These steps are detailed on the [[Upload_DIP|DIP Upload page]].

Install-1.2-packages

2014-08-08T18:33:34Z

Mdemeo: Create page by copying 1.1 contents and tweaking

[[Installation]] >> [[Install-1.2|Install 1.2]] >> Install 1.2 packages

== Updating from Archivematica 1.1 ==

If you have installed Archivematica 1.1.0 from packages, it is possible to update your installation without re-installing. The steps are:

=== Update Archivematica Storage Service ===
<pre>
sudo apt-get update
sudo apt-get install archivematica-storage-service
</pre>

=== Update Archivematica ===
During the update process you will be asked about updating configuration files. Choose to accept the maintainers versions. You will also be asked about updating the database, say 'ok' to each of those steps. If you have set a password for the root mysql database user, enter it when prompted.
<pre>
sudo apt-get install archivematica-mcp-server
sudo apt-get install archivematica-mcp-client
sudo apt-get install archivematica-dashboard
</pre>

=== Update ElasticSearch ===

Archivematica 1.2.0 has been tested most extensively against version 0.90.13 of ElasticSearch. It is possible to use an older version (e.g. 0.20.6, which is what was distributed with Archivematica 1.0.0). Do not attempt to use ElasticSearch 1.0 or greater.

*Add the ElasticSearch apt repository next (from http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-repositories.html):
<pre>
sudo wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
</pre>
Then add this line to the bottom of /etc/apt/sources.list
<pre>
deb http://packages.elasticsearch.org/elasticsearch/0.90/debian stable main
</pre>
Now refresh your list of available packages and upgrade elasticsearch
<pre>
sudo apt-get update
sudo apt-get install elasticsearch
</pre>

=== Restart Services ===
<pre>
sudo service uwsgi restart
sudo service nginx restart
sudo /etc/init.d/apache2 restart
sudo /etc/init.d/elasticsearch restart
sudo /etc/init.d/gearman-job-server restart
sudo restart archivematica-mcp-server
sudo restart archivematica-mcp-client
</pre>

== Installing Archivematica 1.2 packages (new install) ==

Archivematica packages are hosted on Launchpad, in an Ubuntu PPA (Personal Package Archive). In order to install software onto your Ubuntu 12.04.4 system from a PPA:

*Add the archivematica/release PPA to your list of trusted repositories (if add-apt-repositories is not available you must install python-software-properties first):
<pre>
sudo apt-get update
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:archivematica/release
sudo add-apt-repository ppa:archivematica/externals
</pre>

*Add the ElasticSearch apt repository next (from http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-repositories.html):
<pre>
sudo wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
</pre>
Then add this line to the bottom of /etc/apt/sources.list
<pre>
deb http://packages.elasticsearch.org/elasticsearch/0.90/debian stable main
</pre>

*Update your system to the most recent 12.04 release

*This step will also fetch a list of the software from the PPAs you just added to your system.

<pre>
sudo apt-get update
sudo apt-get upgrade
</pre>

* Install all packages (each of these packages can be installed seperately, if necessary). Say YES or OK to any prompts you get after entering the following into terminal:
<pre>
sudo apt-get install archivematica-storage-service
sudo apt-get install elasticsearch
sudo apt-get install archivematica-mcp-server
sudo apt-get install archivematica-mcp-client
sudo apt-get install archivematica-dashboard
</pre>

* Configure the dashboard and storage service
note:these steps are safe to do on a desktop, or a machine dedicated to Archivematica. They may not be advisable on an existing web server. Consult with your web server administrator if you are unsure.

<pre>
sudo wget -q https://raw.github.com/artefactual/archivematica/stable/1.1.x/src/vm-includes/share/apache.default -O /etc/apache2/sites-available/default
sudo rm /etc/nginx/sites-enabled/default
sudo ln -s /etc/nginx/sites-available/storage /etc/nginx/sites-enabled/storage
sudo ln -s /etc/uwsgi/apps-available/storage.ini /etc/uwsgi/apps-enabled/storage.ini
sudo service uwsgi restart
sudo service nginx restart
sudo /etc/init.d/apache2 restart
sudo freshclam
sudo /etc/init.d/clamav-daemon start
sudo /etc/init.d/elasticsearch restart
sudo /etc/init.d/gearman-job-server restart
sudo start archivematica-mcp-server
sudo start archivematica-mcp-client
</pre>

* Test the [[Administrator_manual_1.2#Storage_service|storage service]]
The storage service runs as a separate web application from the Archivematica dashboard. Go to the following link in a web browser:

http://localhost:8000 (or use the IP address of the machine you have been installing on). 
log in as user: test password: test

* Create a new administrative user in the Storage service

The storage service has its own set of users. In the User menu in the Administrative tab of the storage service, add at least one administrative user, and delete or modify the test user.

* Test the dashboard
you can login to the archivematica dashboard and finish the installation in a web browser:

http://localhost

* Register your installation for full Format Policy Registry interoperability.

[[Register_1.2|Register Archivematica 1.2]]

== Using ICA-AtoM with Archivematica ==

If you are interested in using the [https://www.ica-atom.org/ ICA-Atom] web-based content management and access system you can use the package within our PPA. Archivematica can upload digital objects to their associated descriptions in ICA-AtoM.

It is installed by running the following commands in your Terminal (answer yes to all prompts, it may require your mysql adminitration password if one was entered above):
<pre>
sudo aptitude install ica-atom
sudo aptitude install upload-qubit
sudo /etc/init.d/apache2 restart
sudo cp /var/www/atom/init/atom-worker.conf /etc/init/atom-worker.conf
</pre>

You can finish your installation by visiting http://localhost/ica-atom/. For more information about ICA-AtoM installation see [https://www.ica-atom.org/doc/Installation ICA-AtoM 1.3.1 installation].

Occasionally when loading the ica-atom web installer for the first time an error appears, refreshing your web browser usually resolves this. We intend to fix this in future version of the AtoM package.

After the AtoM installation, configure it to use the Sword plugin (in the Plugins menu) as well as the job scheduler (in Settings), then start the atom-worker:

<pre>
sudo start atom-worker
</pre>

== Using AtoM 2.x with Archivematica ==

Archivematica has been successfully tested with Atom 2.x. The best known configuration is Archivematica 1.0.1 or greater with Atom 2.0.2-rc1 or greater.
Installation instructions for Atom 2 are available on the accesstomemory.org website https://www.accesstomemory.org/en/docs/2.0/ . When following those instructions, it is best to download Atom from the git repository (rather than use one of the supplied tarballs). When checking out Atom, use tag v2.0.2-rc1.

Once you have a working Atom installation, you can configure dip upload between Archivematica and Atom. The basic steps are:

* update atom dip upload configuration in the Archivematica dashboard
* confirm upload-qubit package is installed on the Archivematica server
* confirm atom-worker is configured on the Atom server
* enable the Sword Plugin in the Atom plugins page
* enable job scheduling in the Atom settings page
* confirm gearman is installed on the Atom server
* configure ssh keys to allow rsync to work for the archivematica user, from the Archivematica server to the Atom server
* start gearman on the Atom server
* start the atom worker on the Atom server

These steps are detailed on the [[Upload_DIP|DIP Upload page]].

Administrator manual 1.2

2014-08-07T23:10:56Z

Mdemeo: /* Storage service */ Document package/space/location structure

[[Main Page]] > [[Documentation]] > Administrator manual 1.2

This manual covers administrator-specific instructions for Archivematica. It will also provide help for using forms in the Administration tab of the Archivematica dashboard and the administrator capabilities in the Format Policy Registry (FPR), which you will find in the Preservation planning tab of the dashboard.

For end-user instructions, please see the [[User_manual_1.2|user manual]].

= Installation =
* [[Installation|Instructions for installing the latest build of Archivematica on your server]]

= Upgrading =

Currently, Archivematica does not support upgrading from one version to the next. A re-install is required. After re-installing, you can restore Archivematica's knowledge of your AIPs, by [[#Rebuilding_the_AIP_index|rebuilding the AIP index]] and, if you have transfers stored in the backlog, [[#Rebuilding_the_transfer_index|rebuilding the transfer index]].

= Storage service =
The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

[[File:SS1-0.png|700px|center|thumb|Home page of Storage Service]]

== Storage Service entities and organization ==

=== Packages ===

The Storage Service is oriented storing ''packages''. A "package" is a bundle of one or more files transferred from an external service; for example, a package may be an AIP, a backlogged transfer, or a DIP. Each package is stored in a [[#location|location]].

=== Spaces ===

A ''space'' models a specific storage device. That device might be a locally-accessible disk, a network share, or a remote system accessible via a protocol like FEDORA, SWIFT, or LOCKSS. The space provides the Storage Service with configuration to read and/or write data stored within itself.

Packages are not stored directly inside a space; instead, packages are stored within [[#locations|locations]], which are organized subdivisions of a space.

=== Locations ===

A location is a subdivision of a [[#space|space]]. Each location is assigned a specific ''purpose'', such as AIP storage or transfer backlog, in order to provide an organized way to structure content within a space.

== Archivematica Configuration ==

When installing Archivematica, options to configure it with the Storage Service will be presented.

[[File:Install3.png|600px|center]]

If you have installed the Storage Service at a different URL, you may change that here.

The top button 'Use default transfer source & AIP storage locations' will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The bottom button 'Register this pipeline & set up transfer source and AIP storage locations' will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be entered, and Archivematica will attempt to register its dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed, and a Pipeline for this Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

=== Change the port in the web server configuration ===

The storage services uses nginx by default, so you can edit /etc/nginx/sites-enabled/storage and change the line that says

listen 8000;

change 8000 to whatever port you prefer to use.

Keep in mind that in a default installation of Archivematica 1.0, the dashboard is running in Apache on port 80. So it is not possible to make nginx run on port 80 on the same machine. If you install the storage service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

== Spaces ==
[[File:Spaces.png|600px|center]]
A storage Space contains all the information necessary to connect to the physical storage. It is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored. All locations must be contained in a space.

A space is usually the immediate parent of the Location folders. For example, if you had transfer source locations at <tt>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</tt> and <tt>/home/artefactual/maildir_transfers</tt>, the Space's path would be <tt>/home/artefactual/</tt>

Currently supported protocols are local filesystem, NFS, and pipeline local filesystem.

=== Local Filesystem ===

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

==== Fields ====
* ''Path'': Absolute path to the Space on the local filesystem
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.

=== NFS ===

NFS spaces are for NFS exports mounted on the Storage Service server, and the Archivematica pipeline.

==== Fields ====
* ''Path'': Absolute path the space is mounted at on the filesystem local to the storage service
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the remote computer exporting the NFS mount.
* ''Remote path'': Export path on the NFS server
* ''Version'': nfs or nfs4 - as would be passed to the <tt>mount</tt> command.
* ''Manually Mounted'': Check this if it has been mounted already. Otherwise, the Storage Service will try to mount it. ''Note: this feature is not yet available.''

=== Pipeline Local Filesystem ===

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the storage service is hosted on <tt>storage_service_host</tt> and Archivematica is running on <tt>archivematica1</tt> . The transfer sources for Archivematica are stored locally on <tt>archivematica1</tt>, but the storage service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

'''Note: Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.'''

==== Fields ====
* ''Path'': Absolute path to the space on the remote machine.
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the computer running Archivematica. Should be SSH accessible from the Storage Service computer.
* ''Remote user'': Username on the remote host

== Locations ==
[[File:Locations.png|600px|center]]
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. A Location is also where Packages are stored. Each Location is associated with a pipeline and can only be accessed by that pipeline.

Currently, a Location can have one of three purposes: Transfer Source, Currently Processing, or AIP Storage. Transfer source locations display in Archivematica's Transfer tab, and any folder in a transfer source can be selected to become a Transfer. AIP storage locations are where the completed AIPs are put for long-term storage. During processing, Archivematica uses the currently processing location associated with that pipeline. Only one currently processing location should be associated with a given pipeline. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

==== Fields ====
* ''Purpose'': What use the Location is for
* ''Pipeline'': Which pipelines this location is available to.
* ''Relative Path'': Path to this Location, relative to the space that contains it.
* ''Description'': Description of the Location to be displayed to the user.
* ''Quota'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Enabled'': If checked, this location is accessible to pipelines associated with it. If unchecked, it will not show up to any pipeline.

== Pipeline ==
[[File:Pipelines.png|600px|center]]
A pipeline is an Archivematica instance registered with the Storage Service, including the server and all associated clients. Each pipeline is uniquely identified by a UUID, which can be found in the dashboard under Administration -> General Configuration. When installing Archivematica, it will attempt to register its UUID with the Storage Service, with a description of "Archivematica on <hostname>".

==== Fields ====
* ''UUID'': Unique identifier of the Archivematica pipeline
* ''Description'': Description of the pipeline displayed to the user. e.g. Sankofa demo site
* ''Enabled'': If checked, this pipeline can access locations associate with it. If unchecked, all locations will be disabled, even if associated.
* ''Default Locations'': If checked, the default locations configured in Administration -> Configuration will be created or associated with the new pipeline.

== Packages ==
[[File:Packages.png|600px|center]]
A Package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). They cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the storage service administrator. To learn more about deleting an AIP, see [[UM_archival_storage_1.2#Deleting_an_AIP|Deleting an AIP]].

== Administration ==
[[File:StorageserviceAdmin1.png|600px|center]]
[[File:StorageserviceAdmin2.png|600px|center]]
The Administration section manages the users and settings for the Storage Service.

=== Users ===

Only registered users can long into the storage service, and the Users page is where users can be created or modified.

The storage service has two types of users: administrative users, and regular users. In the 0.4.0 release of the storage service, the only distinction between the two types is for email notifications; administrators will be notified by email when special events occur, while regular users will not.

=== Settings ===

Settings control the behavior of the Storage Service. Default Locations are the created or associated with pipelines when they are created.

'''Pipelines are disabled upon creation?''' sets whether a newly created Pipeline can access its Locations. If a Pipeline is disabled, it cannot access any of its locations. By disabling newly created Pipelines, it provides some security against unwanted perusal of the files in Locations, or use by unauthorized Archivematica instances. This can be configured individually when creating a Pipeline manually through the Storage Service website.

'''Default Locations''' set what existing locations should be associated with a newly created Pipeline, or what new Locations should be created for each new Pipeline. No matter what is configured here, a Currently Processing location is created for all Pipelines, since one is required. Multiple Transfer Source or AIP Storage Locations can be configured by holding down Ctrl when selecting them. New Locations in an existing Space can be created for Pipelines that use default locations by entering the relevant information.

== How to Configure a Location ==

For Spaces of the type "Local Filesystem," Locations are basically directories (or more accurately, paths to directories). You can create Locations for Transfer Source, Currently Processing, or AIP Storage directories.

To create and configure a new Location:

# In the Storage Service, click on the "Spaces" tab.
# Under the Space that you want to add the Location to, click on the "Create Location here" link.
# Choose a purpose (e.g. AIP Storage) and pipeline, and enter a "Relative Path" (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to, e.g. for the default Space, the Path is '/' so your Location path would be relative to that (in the example here, the complete path would end up being '/var/mylocation'). Note: if the path you are defining in your Location doesn't exist, you must create it manually and make sure it is writable by the archivematica user.
# Save the Location settings.
# The new location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard "Administration" tab) or as a destination for AIP storage.

== Store DIP ==

= Dashboard administration tab =

The Archivematica administration pages, under the Administration tab of the dashboard, allows you to configure application components and manage users.

== Processing configuration ==

When processing a SIP or transfer, you may want to automate some of the workflow choices. Choices can be preconfigured by putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

If a SIP or transfer is submitted with a 'processingMCP.xml' file, processing decisions will be made with the included file.

The XML file format is:
<pre><processingMCP>
<preconfiguredChoices>

<preconfiguredChoice>
<appliesTo>755b4177-c587-41a7-8c52-015277568302</appliesTo>
<goToChain>d4404ab1-dc7f-4e9e-b1f8-aa861e766b8e</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>eeb23509-57e2-4529-8857-9d62525db048</appliesTo>
<goToChain>5727faac-88af-40e8-8c10-268644b0142d</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>19adb668-b19a-4fcb-8938-f49d7485eaf3</appliesTo>
<goToChain>333643b7-122a-4019-8bef-996443f3ecc5</goToChain>
<delay unitCtime="yes">2419200.0</delay>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>dec97e3c-5598-4b99-b26e-f87a435a6b7f</appliesTo>
<goToChain>01d80b27-4ad1-4bd1-8f8d-f819f18bf685</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>f19926dd-8fb5-4c79-8ade-c83f61f55b40</appliesTo>
<goToChain>85b1e45d-8f98-4cae-8336-72f40e12cbef</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>7a024896-c4f7-4808-a240-44c87c762bc5</appliesTo>
<goToChain>3c1faec7-7e1e-4cdd-b3bd-e2f05f4baa9b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01d64f58-8295-4b7b-9cab-8f1b153a504f</appliesTo>
<goToChain>9475447c-9889-430c-9477-6287a9574c5b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01c651cb-c174-4ba4-b985-1d87a44d6754</appliesTo>
<goToChain>414da421-b83f-4648-895f-a34840e3c3f5</goToChain>
</preconfiguredChoice>
</preconfiguredChoices>
</processingMCP>
</pre>

Where appliesTo is the UUID associated with the micro-service job presented in the dashboard, and goToChain is the UUID of the desired selection. The default processingMCP.xml file is located at '/var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml'.

The processing configuration administration page of the dashboard provides you with an easy form to configure the default 'processingMCP.xml' that's added to a SIP or transfer if it doesn't already contain one. When you change the options using the web interface the necessary XML will be written behind the scenes.

[[File:ProcessingConfig1-1.png|1000px|center|thumb|Processing configuration form in Administration tab of the dashboard]]

*For the approval (yes/no) steps, the user ticks the box on the left-hand side to make a choice. If the box is not ticked, the approval step will appear in the dashboard.
*For the other steps, if no actions are selected the choices appear in the dashboard
*You can select whether or not to send transfers to quarantine (yes/no) and decide how long you'd like them to stay there.
*You can select whether to extract packages as well as whether to keep and/or delete the extracted objects and/or the package itself.
*You can approve normalization, sending the AIP to storage, and uploading the DIP without interrupting the workflow in the dashboard.
*You can pre-select which format identification tool and command to run in both/either transfer and/or ingest to base your normalization upon.
*You can choose to send a transfer to backlog or to create a SIP every time.
*You can select to be reminded to add PREMIS event metadata about manual normalization should you choose to use that capability.
*You can select between 7z using lzma and 7zip using bzip or parallel bzip2 algorithms for AIP compression.
*For select compression level, the options are as follows:
**1 - fastest mode
**3 - fast compression mode
**5 - normal compression mode
**7 - maximum compression
**9 - ultra compression
*You can select one archival storage location where you will consistently send your AIPs.

== General ==

In the general configuration section, you can select interface options and set [[Administrator_manual_1.2#Storage_service|Storage Service]] options for your Archivematica client.

[[File:Generalconfig.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

=== Interface options ===

Here, you can hide parts of the interface that you don't need to use. In particular, you can hide CONTENTdm DIP upload link, AtoM DIP upload link and DSpace transfer type.

=== Storage Service options ===

This is where you'll find the complete URL for the Storage Service. See [[Administrator_manual_1.2#Storage_service|Storage Service]] for more information about this feature.

== Failures ==

Archivematica 1.2 includes dashboard failure reporting.
[[File:FailuresAdmin.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

== Transfer source location ==

Archivematica allows you to start transfers using the operating system's file browser or via a web interface. Source files for transfers, however, can't be uploaded using the web interface: they must exist on volumes accessible to the Archivematica MCP server and configured via the [[Administrator_manual_1.2#Storage_service|Storage Service]].

When starting a transfer you're required to select one or more directories of files to add to the transfer.

You can view your transfer source directories in the Administrative tab of the dashboard under "Transfer source locations".

== AIP storage locations ==

AIP storage directories are directories in which completed AIPs are stored. Storage directories can be specified in a manner similar to transfer source directories using the [[Administrator_manual_1.2#Storage_service|Storage Service]].

You can view your transfer source directories in the Administrative tab of the dashboard under "AIP storage locations"

== AtoM DIP upload ==

Archivematica can upload DIPs directly to an [https://www.ica-atom.org/ AtoM] website so the contents can be accessed online. The AtoM DIP upload configuration page is where you specify the details of the AtoM installation you'd like the DIPs uploaded to (and, if using Rsync to transfer the DIP files, Rsync transfer details).

The parameters that you'll most likely want to set are <code>url</code>, <code>email</code>, and <code>password</code>. These parameters, respectively, specify the destination AtoM website's URL, the email address used to log in to the website, and the password used to log in to the website.

AtoM DIP upload can also use [http://en.wikipedia.org/wiki/Rsync Rsync] as a transfer mechanism. Rsync is an open source utility for efficiently transferring files. The <code>rsync-target</code> parameter is used to specify an Rsync-style target host/directory pairing, "foobar.com:~/dips/" for example. The <code>rsync-command</code> parameter is used to specify rsync connection options, "ssh -p 22222 -l user" for example. If you are using the rsync option, please see AtoM server configuration below.

To set any parameters for AtoM DIP upload change the values, preserving the existing format they're specified in, in the "Command arguments" field then click "Save".

Note that in AtoM, the sword plugin (Admin --> Plugins --> qtSwordPlugin) must be enabled in order for AtoM to receive uploaded DIPs. Enabling Job scheduling (Admin --> Settings --> Job scheduling) is also recommended.

=== AtoM server configuration ===

This server configuration step is necessary to allow Archivematica to log in to the AtoM server without passwords, and only when the user is deploying the rsync option described above in the AtoM DIP upload section.

To enable sending DIPs from Archivematica to the AtoM server:

Generate SSH keys for the Archivematica user. Leave the passphrase field blank.
<pre>
$ sudo -i -u archivematica
$ cd ~
$ ssh-keygen
</pre>

Copy the contents of <code>/var/lib/archivematica/.ssh/id_rsa.pub</code> somewhere handy, you will need it later.

Now, it's time to configure the AtoM server so Archivematica can send the DIPs using SSH/rsync. For that purpose, you will create a user called <code>archivematica</code> and we are going to assign that user a restricted shell with access only to rsync:

<pre>
$ sudo apt-get install rssh
$ sudo useradd -d /home/archivematica -m -s /usr/bin/rssh archivematica
$ sudo passswd -l archivematica
$ sudo vim /etc/rssh.conf // Make sure that allowrsync is uncommented!
</pre>

Add the SSH key that we generated before:

<pre>
$ sudo mkdir /home/archivematica/.ssh
$ chmod 700 /home/archivematica/.ssh/
$ sudo vim /home/archivematica/.ssh/authorized_keys // Paste here the contents of id_dsa.pub
$ chown -R archivematica:archivematica /home/archivematica
</pre>

In Archivematica, make sure that you update the <code>--rsync-target</code> accordingly. 
These are the parameters that we are passing to the upload-qubit microservice. 
Go to the Administration > Upload DIP page in the dashboard.

Generic parameters:

<pre>
--url="http://atom-hostname/index.php" \
--email="demo@example.com" \
--password="demo" \
--uuid="%SIPUUID%" \
--rsync-target="archivematica@atom-hostname:/tmp" \
--debug
</pre>

== CONTENTdm DIP upload ==

Archivematica can also upload DIPs to [http://www.contentdm.org/ CONTENTdm] instances. Multiple CONTENTdm destinations may be configured.

For each possible CONTENTdm DIP upload destination, you'll specify a brief description and configuration parameters appropriate for the destination. Paramters include <code>%ContentdmServer%</code> (full path to the CONTENTdm API, including the leading 'http://' or 'https://', for example http://example.com:81/dmwebservices/index.php), <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> (Linux user and group on the CONTENTdm server, not a CONTENTdm username). Note that only <code>%ContentdmServer%</code> is required is you are going to produce CONTENTdm Project Client packages; <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> are also required if you are going to use the "direct upload" option for uploading your DIPs into CONTENTdm.

When changing parameters for a CONTENTdm DIP upload destination simply change the values, preserving the existing format they're specified in. To add an upload destination fill in the form at the bottom of the page with the appropriate values. When you've completed your changes click the "Save" button.

== PREMIS agent ==

The PREMIS agent name and code can be set via the administration interface.
[[File:Premisagent-10.png|center|900px|thumbs]]

== Rest API ==

In addition to automation using the processingMCP.xml file, Archivematica includes a REST API for automating transfer approval. Using this API, you can create a custom script that copies a transfer to the appropriate directory then uses the <code>curl</code> command, or some other means, to let Archivematica know that the copy is complete.

=== API keys ===

Use of the REST API requires the use of API keys. An API key is associated with a specific user. To generate an API key for a user:

# Browse to <code>/administration/accounts/list/</code>
# Click the "Edit" button for the user you'd like to generate an API key for
# Click the "Regenerate API key" checkbox
# Click "Save"

After generating an API key, you can click the "Edit" button for the user and you should see the API key.

=== IP whitelist ===

In addition to creating API keys, you'll need to add the IP of any computer making REST requests to the REST API whitelist. The IP whitelist can be edited in the administration interface at <code>/administration/api/</code>.

=== Approving a transfer ===

The REST API can be used to approve a transfer. The transfer must first be copied into the appropriate watch directory. To determine the location of the appropriate watch directory, first figure out where the shared directory is from the <code>watchDirectoryPath</code> value of <code>/etc/archivematica/MCPServer/serverConfig.conf</code>. Within that directory is a subdirectory <code>activeTransfers</code>. In this subdirectory are watch directories for the various transfer types.

When using the REST API to approve a transfer, if a transfer type isn't specified, the transfer will be deemed a standard transfer.

'''HTTP Method:''' POST

'''URL:''' <code>/api/transfer/approve</code>

'''Parameters:'''

<code>directory</code>: directory name of the transfer

<code>type</code> (optional): transfer type [standard|dspace|unzipped bag|zipped bag]

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl --data "username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6&type=standard&directory=MyTransfer" http://127.0.0.1/api/transfer/approve

Example result:

{"message": "Approval successful."}

=== Listing unapproved transfers ===

The REST API can be used to get a list of unapproved transfers. Each transfer's directory name and type is returned.

'''Method:''' <code>GET</code>

'''URL:''' <code>/api/transfer/unapproved</code>

'''Parameters:'''

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl "http://127.0.0.1/api/transfer/unapproved?username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6"

Example result:

{
"message": "Fetched unapproved transfers successfully.",
"results": [{
"directory": "MyTransfer",
"type": "standard"
}
]
}
== Users ==

The dashboard provides a simple cookie-based user authentication system using the [https://docs.djangoproject.com/en/1.4/topics/auth/ Django authentication framework]. Access to the dashboard is limited only to logged-in users and a login page will be shown when the user is not recognized. If the application can't find any user in the database, the user creation page will be shown instead, allowing the creation of an administrator account.

Users can be also created, modified and deleted from the Administration tab. Only users who are administrators can create and edit user accounts.

You can add a new user to the system by clicking the "Add new" button on the user administration page. By adding a user you provide a way to access Archivematica using a username/password combination. Should you need to change a user's username or password, you can do so by clicking the "Edit" button, corresponding to the user, on the administration page. Should you need to revoke a user's access, you can click the corresponding "Delete" button.

=== CLI creation of administrative users ===

If you need an additional administrator user one can be created via the command-line, issue the following commands:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py createsuperuser

=== CLI password resetting ===

If you've forgotten the password for your administrator user, or any other user, you can change it via the command-line:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py changepassword <username>

===Security===

Archivematica uses [http://en.wikipedia.org/wiki/PBKDF2 PBKDF2] as the default algorithm to store passwords. This should be sufficient for most users: it's quite secure, requiring massive amounts of computing time to break. However, other algorithms could be used as the following document explains: [https://docs.djangoproject.com/en/1.4/topics/auth/#how-django-stores-passwords How Django stores passwords].

Our plan is to extend this functionality in the future adding groups and granular permissions support.

= Dashboard preservation planning tab =

== Format Policy Registry (FPR) ==

=== Introduction to the Format Policy Registry ===

The Format Policy Registry (FPR) is a database which allows Archivematica users to define format policies for handling file formats. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format). Format policies will change as community standards, practices and tools evolve. Format policies are maintained by Artefactual, who provides a freely-available FPR server hosted at [http://fpr.archivematica.org fpr.archivematica.org]. This server stores structured information about normalization format policies for preservation and access. You can update your local FPR from the FPR server using the UPDATE button in the preservation planning tab of the dashboard. In addition, you can maintain local rules to add new formats or customize the behaviour of Archivematica. The Archivematica dashboard communicates with the FPR server via a REST API.

==== First-time configuration ====

The first time a new Archivematica installation is set up, it will attempt to connect to the FPR server as part of the initial configuration process. As a part of the setup, it will register the Archivematica install with the server and pull down the current set of format policies. In order to register the server, Archivematica will send the following information to the FPR Server, over an encrypted connection:

#Agent Identifier (supplied by the user during registration while installing Archivematica)
#Agent Name (supplied by the user during registration while installing Archivematica)
#IP address of host
#UUID of Archivematica instance
#current time

*The only information that will be passed back and forth between Archivematica and the FPR Server would be these format policies - what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. No information about the content that has been run through Archivematica, or any details about the Archivematica installation or configuration would be sent to the FPR Server.

* Because Archivematica is an open source project, it is possible for any organization to conduct a software audit/code review before running Archivematica in a production environment in order to independently verify the information being shared with the FPR Server. An organization could choose to run a private FPR Server, accessible only within their own network(s), to provide at least a limited version of the benefits of sharing format policies, while guaranteeing a completely self-contained preservation system. This is something that Artefactual is not intending to develop, but anyone is free to extend the software as they see fit, or to hire us or other developers to do so.

=== Updating format policies ===

FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica. Clicking the "update" button will initiate an FPR pull which will bring in any new or altered rules since the last time an update was performed.

=== Types of FPR entries ===

==== Format ====

In the FPR, a "format" is a record representing one or more related ''format versions'', which are records representing a specific file format. For example, the format record for "Graphics Interchange Format" (GIF) is comprised of format versions for both GIF 1987a and 1989a.

When creating a new format version, the following fields are available:

* Description (required) - Text describing the format. This will be saved in METS files.
* Version (required) - The version number for this specific format version (not the FPR record). For example, for Adobe Illustrator 14 .ai files, you might choose "14".
* Pronom id - The specific format version's unique identifier in [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx PRONOM], the UK National Archives's format registry. This is optional, but highly recommended.
* Access format and Preservation format - Indicates whether this format is suitable as an access format for end users, and for preservation.

==== Format Group ====

A format group is a convenient grouping of related file formats which share common properties. For instance, the FPR includes an "Image (raster)" group which contains format records for GIF, JPEG, and PNG. Each format can belong to one (and only one) format group.

==== Characterization ====
Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization micro-service always ran the [http://projects.iq.harvard.edu/fits FITS] tool. As of Archivematica 1.2, characterization is fully customizable by the Archivematica administrator.

===== Characterization tools =====

Archivematica has four default characterization tools upon installation. Which tool will run on a given file depends on the type of file, as determined by the selected identification tool.

====== Default ======

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

====== Multimedia ======

Archivematica 1.2 introduced three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

* [http://ffmpeg.org/ FFprobe], a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
* [http://mediaarea.net/en/MediaInfo MediaInfo], a characterization tool oriented towards audio and video data
* [http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool], a characterization tool oriented towards still image data and extraction of embedded metadata

===== Writing a new characterization command =====

Information on writing new characterization commands can be found in the [[Administrator_manual_1.2#Format_Policy_Rules|FPR administrator's manual]].

Writing a characterization command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]]. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

==== Extraction ====

Archivematica supports extracting contents from files during the transfer phase.

Many transfers contain files which are packages of other files; examples of these include compressed archives, such as ZIP files, or disk images. Archivematica provides a transcription microservice which comes with several predefined rules to extract packages, and which is fully customizeable by Archivematica administrators. Administrators can write new commands, and assign existing formats to run for other file formats.

===== Writing a new extraction command =====

Writing an extraction command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

An extraction command is passed two arguments: the ''file to extract'', and the ''path to which the package should be extracted''. Similar to [[Administrator_manual_1.2#Normalization Command|normalization commands]], these arguments will be interpolated directly into "bashScript" and "command" scripts, and passed as positional arguments to "pythonScript" and "asIs" scripts.

{|
|Name (bashScript and command)||Commandline position (pythonScript and asIs)||Description||Sample value
|-
|%outputDirectory%||First||The full path to the directory in which the package's contents should be extracted||/path/to/filename-uuid/
|-
|%inputFile%||Second||The full path to the package file||/path/to/filename
|}

Here's a simple example of how to call an existing tool (7-zip) without any extra logic:

<pre>7z x -bd -o"%outputDirectory%" "%inputFile%"</pre>

This Python script example is more complex, and attempts to determine whether any files were extracted in order to determine whether to exit 0 or 1 (and report success or failure):

<pre>
from __future__ import print_function
import re
import subprocess
import sys

def extract(package, outdir):
# -a extracts only allocated files; we're not capturing unallocated files
try:
process = subprocess.Popen(['tsk_recover', package, '-a', outdir],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
stdout, stderr = process.communicate()

match = re.match(r'Files Recovered: (\d+)', stdout.splitlines()[0])
if match:
if match.groups()[0] == '0':
raise Exception('tsk_recover failed to extract any files with the message: {}'.format(stdout))
else:
print(stdout)
except Exception as e:
return e

return 0

def main(package, outdir):
return extract(package, outdir)

if __name__ == '__main__':
package = sys.argv[1]
outdir = sys.argv[2]
sys.exit(main(package, outdir))
</pre>

==== Transcription ====

Archivematica 1.2 introduces a new transcription microservice. This microservice provides tools to transcribe the contents of media objects. In Archivematica 1.2 it is used to perform OCR on images of textual material, but it can also be used to create commands which perform other kinds of transcription.

===== Writing transcription commands =====

Writing a transcription command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

Transcription commands are expected to write their data to disk inside the SIP. For commands which perform OCR, metadata can be placed inside the "metadata/OCRfiles" directory inside the SIP; other kinds of transcription should produce files within "metadata".

For example, the following bash script is used by Archivematica to transcribe images using the [https://code.google.com/p/tesseract-ocr/ Tesseract] software:

<pre>
ocrfiles="%SIPObjectsDirectory%metadata/OCRfiles"
test -d "$ocrfiles" || mkdir -p "$ocrfiles"

tesseract %fileFullName% "$ocrfiles/%fileName%"
</pre>

==== Identification ====

===== Identification Tools =====

The identification tool properties in Archivematica control the ways in which Archivematica identifies files and associates them with the FPR's version records. The current version of the FPR server contains two tools: a script based on the [http://www.openplanetsfoundation.org/ Open Planets Foundation's] [https://github.com/openplanets/fido/ FIDO] tool, which identifies based on the IDs in PRONOM, and a simple script which identifies files by their file extension. You can use the identification tools portion of FPR to customize the behaviour of the existing tools, or to write your own.

===== Identification Commands =====

Identification commands contain the actual code that a tool will run when identifying a file. This command will be run on every file in a transfer.

When adding a new command, the following fields are available:

* Identifier (mandatory) - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Script type (mandatory) - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.

When coding a command, you should expect your script to take the path to the file to be identifed as the first commandline argument. When returning an identification, the tool should print a single line containing ''only'' the identifier, and should exit 0. Any informative, diagnostic, and error message can be printed to stderr, where it will be visible to Archivematica users monitoring tool results. On failure, the tool should exit non-zero.

===== Identification Rules =====

These identification rules allow you to define the relationship between the output created by an identification tool, and one of the formats which exists in the FPR. This must be done for the format to be tracked internally by Archivematica, and for it to be used by normalization later on. For instance, if you created a FIDO configuration which returns MIME types, you could create a rule which associates the output "image/jpeg" with the "Generic JPEG" format in the FPR.

Identification rules are necessary only when a tool is configured to return file extensions or MIME types. Because PUIDs are universal, Archivematica will always look these up for you without requiring any rules to be created, regardless of what tool is being used.

When creating an identification rule, the following mandatory fields must be filled out:

* Format - Allows you to select one of the formats which already exists in the FPR.
* Command - Indicates the command that produces this specific identification.
* Output - The text which is written to standard output by the specified command, such as "image/jpeg"

==== Format Policy Tools ====

Format policy tools control how Archivematica processes files during ingest. The most common kind of these tools are normalization tools, which produce preservation and access copies from ingested files. Archivematica comes configured with a number of commands and scripts to normalize several file formats, and you can use this section of the FPR to customize them or to create your own. These are organized similarly to the [[#Identification Tools]] documented above.

Archivematica uses the following kinds of format policy rules:

* Characterization
* Extraction
* Normalization - Access, preservation and thumbnails
* Event detail - Extracts information about a given tool in order to be inserted into a generated METS file.
* Transcription
* Verification - Validates a file produced by another command. For instance, a tool could use Exiftool or JHOVE to determine whether a thumbnail produced by a normalization command was valid and well-formed.

=== Format Policy Commands ===

Like the [[#Identification Commands]] above, format policy commands are scripts or command line statements which control how a normalization tool runs. This command will be run once on every file being normalized using this tool in a transfer.

When creating a normalization command, the following mandatory fields must be filled out:

* Tool - One or more tools to be associated with this command.
* Description - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Command - The script's source, or the commandline statement to execute.
* Script type - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.
* Output format (optional) - The format the command outputs. For example, a command to normalize audio to MP3 using ffmpeg would select the appropriate MP3 format from the dropdown.
* Output location (optional) - The path the normalized file will be written to. See the [[#Writing a command]] section of the documentation for more information.
* Command usage - The purpose of the command; this will be used by Archivematica to decide whether a command is appropriate to run in different circumstances. Values are "Normalization", "Event detail", and "Verification". See the [[#Writing a command]] section of the documentation for more information.
* Event detail command - A command to provide information about the software running this command. This will be written to the METS file as the "event detail" property. For example, the normalization commands which use ffmpeg use an event detail command to extract ffmpeg's version number.

=== Format Policy Rules ===

Format policy rules allow commands to be associated with specific file types. For instance, this allows you to configure the command that uses ImageMagick to create thumbnails to be run on .gif and .jpeg files, while selecting a different command to be run on .png files.

When creating a format policy rule, the following mandatory fields must be filled out:

* Purpose - Allows Archivematica to distinguish rules that should be used to normalize for preservation, normalize for access, to extract information, etc.
* Format - The file format the associated command should be selected for.
* Command - The specific command to call when this rule is used.

=== Writing a command ===

==== Identification command ====

Identification commands are very simple to write, though they require some familiarity with Unix scripting.

An identification command run once for every file in a transfer. It will be passed a single argument (the path to the file to identify), and no switches.

On success, a command should:

* Print the identifier to stdout
* Exit 0

On failure, a command should:

* Print nothing to stdout
* Exit non-zero (Archivematica does not assign special significance to non-zero exit codes)

A command can print anything to stderr on success or error, but this is purely informational - Archivematica won't do anything special with it. Anything printed to stderr by the command will be shown to the user in the Archivematica dashboard's detailed tool output page. You should print any useful error output to stderr if identification fails, but you can also print any useful extra information to stderr if identification succeeds.

Here's a very simple Python script that identifies files by their file extension:

<pre>import os.path, sys
(_, extension) = os.path.splitext(sys.argv[1])
if len(extension) == 0:
exit(1)
else:
print extension.lower()</pre>

Here's a more complex Python example, which uses [http://www.sno.phy.queensu.ca/~phil/exiftool/ Exiftool]'s XML output to return the MIME type of a file:

<pre>#!/usr/bin/env python

from lxml import etree
import subprocess
import sys

try:
xml = subprocess.check_output(['exiftool', '-X', sys.argv[1]])
doc = etree.fromstring(xml)
print doc.find('.//{http://ns.exiftool.ca/File/1.0/}MIMEType').text
except Exception as e:
print >> sys.stderr, e
exit(1)</pre>

Once you've written an identification command, you can register it in the FPR using the following steps:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for the command itself:

# Click "Create New Command".
# Select your tool from the "Tool" dropdown box.
# Fill out the Identifier with text to describe to a user what this tool does. For instance, we might choose "Identify MIME-type using Exiftool".
# Select the appropriate script type - in this case, "Python Script".
# Enter the source code for your script in the "Command" box.
# Click "Create Command".

Finally, you must create rules which associate the possible outputs of your tool with the FPR's format records. This needs to be done once for every supported format; we'll show it with MP3, as an example.

# Navigate to the "Identification Rules" page, and click "Create New Rule".
# Choose the appropriate foramt from the Format dropdown - in our case, "Audio: MPEG Audio: MPEG 1/2 Audio Layer 3".
# Choose your command from the Command dropdown.
# Enter the text your command will output when it identifies this format. For example, when our Exiftool command identifies an MP3 file, it will output "audio/mpeg".
# Click "Create".

Once this is complete, any new transfers you create will be able to use your new tool in the identification step.

==== Normalization Command ====

Normalization commands are a bit more complex to write because they take a few extra parameters.

The goal of a normalization command is to take an input file and transform it into a new format. For instance, Archivematica provides commands to transform video content into FFV1 for preservation, and into H.264 for access.

Archivematica provides several parameters specifying input and output filenames and other useful information. Several of the most common are shown in the examples below; a more complete list is in a later section of the documentation: [[#Normalization command variables and arguments]]

When writing a bash script or a command line, you can reference the variables directly in your code, like this:

<pre>inkscape -z "%fileFullName%" --export-pdf="%outputDirectory%%prefix%%fileName%%postfix%.pdf"</pre>

When writing a script in Python or other languages, the values will be passed to your script as commandline options, which you will need to parse. The following script provides an example using the argparse module that comes with Python:

<pre>import argparse
import subprocess

parser = argparse.ArgumentParser()

parser.add_argument('--file-full-name', dest='filename')
parser.add_argument('--output-file-name', dest='output')
parsed, _ = parser.parse_known_args()
args = [
'ffmpeg', '-vsync', 'passthrough',
'-i', parsed.filename,
'-map', '0:v', '-map', '0:a',
'-vcodec', 'ffv1', '-g', '1',
'-acodec', 'pcm_s16le',
parsed.output+'.mkv'
]

subprocess.call(args)</pre>

Once you've created a command, the process of registering it is similar to creating a new identification tool. The folling examples will use the Python normalization script above.

First, create a new tool record:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for your new command:

# Click "Create New Tool Command".
# Fill out the Description with text to describe to a user what this tool does. For instance, we might choose "Normalize to mkv using ffmpeg".
# Enter the source for your command in the Command textbox.
# Select the appropriate script type - in this case, "Python Script".
# Select the appropriate output format from the dropdown. This indicates to Archivematica what kind of file this command will produce. In this case, choose "Video: Matroska: Generic MKV".
# Enter the location the video will be saved to, using the script variables. You can usually use the "%outputFileName%" variable, and add the file extension - in this case "%outputFileName%.mkv"
# Select a verification command. Archivematica will try to use this tool to ensure that the file your command created works. Archivematica ships with two simple tools, which test whether the file exists and whether it's larger than 0 bytes, but you can create new commands that perform more complicated verifications.
# Finally, choose a command to produce the "Event detail" text that will be written in the section of the METS file covering the normalization event. Archivematica already includes a suitable command for ffmpeg, but you can also create a custom command.
# Click "Create command".

Finally, you must create rules which will associate your command with the formats it should run on.

==== Normalization command variables and arguments ====

The following variables and arguments control the behaviour of format policy command scripts.

{|
|Name (bashScript and command)||Commandline option (pythonScript and asIs)||Description||Sample value
|-
|%fileName%||--input-file=||The filename of the file to process. This variable holds the file's basename, not the whole path.||video.mov
|-
|%fileDirectory%||--file-directory=||The directory containing the input file.||/path/to
|-
|%inputFile%||--file-name=||The fully-qualified path to the file to process.||/path/to/video.mov
|-
|%fileExtension%||--file-extension=||The file extension of the input file.||mov
|-
|%fileExtensionWithDot%||--file-extension-with-dot=||As above, without stripping the period.||.mov
|-
|%outputDirectory%||--output-directory=||The directory to which the output file should be saved.||/path/to/access/copies
|-
|%outputFileUUID%||--output-file-uuid=||The unique identifier assigned by Archivematica to the output file.||1abedf3e-3a4b-46d7-97da-bd9ae13859f5
|-
|%outputDirectory%||--output-directory=||The fully-qualified path to the directory where the new file should be written.||/var/archivematica/sharedDirectory/www/AIPsStore/uuid
|-
|%outputFileName%||--output-file-name=||The fully-qualified path to the output file, minus the file extension.||/path/to/access/copies/video-uuid
|}

= Customization and automation =
* Workflow processing decisions can be made in the processingMCP.xml file. [https://www.archivematica.org/wiki/Administrator_manual_0.10#Processing_configuration See here.]
* Workflows are currently created at the development level.
*: Some resources avialable
*:* [[MCP_Basic_Configuration]]
*:* [[MCP]]
*:* [[Creating_Custom_Workflows]]
*:* [[Development]]
* Normalization commands can be viewed in the preservation planning tab.
* Normalization paths and commands are currently editable under the preservation planning tab in the dashboard.

= Elasticsearch =

Archivematica has the capability of indexing data about files contained in AIPs and this data can be [[Elasticsearch Development|accessed programatically]] for various applications.

If, for whatever reason, you need to delete an ElasticSearch index please see [[ElasticSearch Administration]].

If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

<pre>
import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
conn.delete_index('aips')
except:
print "Error deleting index or index already deleted."
</pre>

=== Rebuilding the AIP index ===

To rebuild the ElasticSearch AIP index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-aip-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-aip-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-aip-index-from-files <location of your AIP store>

=== Rebuilding the transfer index ===

Similarly, to rebuild the ElasticSearch transfer data index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-transfer-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-transfer-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-transfer-index-from-files <location of your AIP store>

= Data backup =

In Archivematica there are three types of data you'll likely want to back up:
* Filesystem (particularly your storage directories)
* MySQL
* ElasticSearch

MySQL is used to store short-term processing data. You can back up the MySQL database by using the following command:

<pre>mysqldump -u <your username> -p<your password> -c MCP > <filename of backup></pre>

ElasticSearch is used to store long-term data. Instructions and scripts for backing up and restoring ElasticSearch are available [http://tech.superhappykittymeow.com/?p=296 here].

= Security =

Once you've set up Archivematica it's a good practice, for the sake of security, to change the default passwords.

== MySQL ==

You should create a new MySQL user or change the password of the default "archivematica" MySQL user. The change the password of the default user, enter the following into the command-line:

$ mysql -u root -p<your MyQL root password> -D mysql \
-e "SET PASSWORD FOR 'archivematica'@'localhost' = PASSWORD('<new password>'); \
FLUSH PRIVILEGES;"

Once you've done this you can change Archivematica's MySQL database access credentials by editing these two files:
* <code>/etc/archivematica/archivematicaCommon/dbsettings</code> (change the <code>user</code> and <code>password</code> settings)
* <code>/usr/share/archivematica/dashboard/settings/common.py</code> (change the <code>USER</code> and <code>PASSWORD</code> settings in the <code>DATABASES</code> section)

Archivematica does not presently support secured MySQL communication so MySQL should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1645 1645].

== AtoM ==

In addition to changing the MySQL credentials, if you've also installed AtoM you'll want to set the password for it as well. Note that after changing your AtoM credentials you should update the credentials on the AtoM DIP upload administration page as well.

== Gearman ==

Archivematica relies on the German server for queuing work that needs to be done. Gearman currently doesn't support secured connections so Gearman should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1345 1345].

= Questions =

If you run into any difficulties while administrating Archivematica, please check out our FAQ and, if that doesn't help you, contain us using the Archivematica discussion group.

== Frequently asked questions ==
* [[AM_FAQ|Solutions to common questions]]

== Discussion group ==
* [http://groups.google.com/group/archivematica?hl=en Discussion group] for questions not covered by the FAQ

Administrator manual 1.2

2014-08-07T22:54:46Z

Mdemeo: /* Users */ Clarify difference between user types

[[Main Page]] > [[Documentation]] > Administrator manual 1.2

This manual covers administrator-specific instructions for Archivematica. It will also provide help for using forms in the Administration tab of the Archivematica dashboard and the administrator capabilities in the Format Policy Registry (FPR), which you will find in the Preservation planning tab of the dashboard.

For end-user instructions, please see the [[User_manual_1.2|user manual]].

= Installation =
* [[Installation|Instructions for installing the latest build of Archivematica on your server]]

= Upgrading =

Currently, Archivematica does not support upgrading from one version to the next. A re-install is required. After re-installing, you can restore Archivematica's knowledge of your AIPs, by [[#Rebuilding_the_AIP_index|rebuilding the AIP index]] and, if you have transfers stored in the backlog, [[#Rebuilding_the_transfer_index|rebuilding the transfer index]].

= Storage service =
The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

[[File:SS1-0.png|700px|center|thumb|Home page of Storage Service]]

TODO Discuss how spaces and locations fit into each other, pipelines fit to locations, spaces=config, locations=purpose, packages in locations

== Archivematica Configuration ==

When installing Archivematica, options to configure it with the Storage Service will be presented.

[[File:Install3.png|600px|center]]

If you have installed the Storage Service at a different URL, you may change that here.

The top button 'Use default transfer source & AIP storage locations' will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The bottom button 'Register this pipeline & set up transfer source and AIP storage locations' will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be entered, and Archivematica will attempt to register its dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed, and a Pipeline for this Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

=== Change the port in the web server configuration ===

The storage services uses nginx by default, so you can edit /etc/nginx/sites-enabled/storage and change the line that says

listen 8000;

change 8000 to whatever port you prefer to use.

Keep in mind that in a default installation of Archivematica 1.0, the dashboard is running in Apache on port 80. So it is not possible to make nginx run on port 80 on the same machine. If you install the storage service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

== Spaces ==
[[File:Spaces.png|600px|center]]
A storage Space contains all the information necessary to connect to the physical storage. It is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored. All locations must be contained in a space.

A space is usually the immediate parent of the Location folders. For example, if you had transfer source locations at <tt>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</tt> and <tt>/home/artefactual/maildir_transfers</tt>, the Space's path would be <tt>/home/artefactual/</tt>

Currently supported protocols are local filesystem, NFS, and pipeline local filesystem.

=== Local Filesystem ===

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

==== Fields ====
* ''Path'': Absolute path to the Space on the local filesystem
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.

=== NFS ===

NFS spaces are for NFS exports mounted on the Storage Service server, and the Archivematica pipeline.

==== Fields ====
* ''Path'': Absolute path the space is mounted at on the filesystem local to the storage service
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the remote computer exporting the NFS mount.
* ''Remote path'': Export path on the NFS server
* ''Version'': nfs or nfs4 - as would be passed to the <tt>mount</tt> command.
* ''Manually Mounted'': Check this if it has been mounted already. Otherwise, the Storage Service will try to mount it. ''Note: this feature is not yet available.''

=== Pipeline Local Filesystem ===

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the storage service is hosted on <tt>storage_service_host</tt> and Archivematica is running on <tt>archivematica1</tt> . The transfer sources for Archivematica are stored locally on <tt>archivematica1</tt>, but the storage service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

'''Note: Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.'''

==== Fields ====
* ''Path'': Absolute path to the space on the remote machine.
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the computer running Archivematica. Should be SSH accessible from the Storage Service computer.
* ''Remote user'': Username on the remote host

== Locations ==
[[File:Locations.png|600px|center]]
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. A Location is also where Packages are stored. Each Location is associated with a pipeline and can only be accessed by that pipeline.

Currently, a Location can have one of three purposes: Transfer Source, Currently Processing, or AIP Storage. Transfer source locations display in Archivematica's Transfer tab, and any folder in a transfer source can be selected to become a Transfer. AIP storage locations are where the completed AIPs are put for long-term storage. During processing, Archivematica uses the currently processing location associated with that pipeline. Only one currently processing location should be associated with a given pipeline. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

==== Fields ====
* ''Purpose'': What use the Location is for
* ''Pipeline'': Which pipelines this location is available to.
* ''Relative Path'': Path to this Location, relative to the space that contains it.
* ''Description'': Description of the Location to be displayed to the user.
* ''Quota'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Enabled'': If checked, this location is accessible to pipelines associated with it. If unchecked, it will not show up to any pipeline.

== Pipeline ==
[[File:Pipelines.png|600px|center]]
A pipeline is an Archivematica instance registered with the Storage Service, including the server and all associated clients. Each pipeline is uniquely identified by a UUID, which can be found in the dashboard under Administration -> General Configuration. When installing Archivematica, it will attempt to register its UUID with the Storage Service, with a description of "Archivematica on <hostname>".

==== Fields ====
* ''UUID'': Unique identifier of the Archivematica pipeline
* ''Description'': Description of the pipeline displayed to the user. e.g. Sankofa demo site
* ''Enabled'': If checked, this pipeline can access locations associate with it. If unchecked, all locations will be disabled, even if associated.
* ''Default Locations'': If checked, the default locations configured in Administration -> Configuration will be created or associated with the new pipeline.

== Packages ==
[[File:Packages.png|600px|center]]
A Package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). They cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the storage service administrator. To learn more about deleting an AIP, see [[UM_archival_storage_1.2#Deleting_an_AIP|Deleting an AIP]].

== Administration ==
[[File:StorageserviceAdmin1.png|600px|center]]
[[File:StorageserviceAdmin2.png|600px|center]]
The Administration section manages the users and settings for the Storage Service.

=== Users ===

Only registered users can long into the storage service, and the Users page is where users can be created or modified.

The storage service has two types of users: administrative users, and regular users. In the 0.4.0 release of the storage service, the only distinction between the two types is for email notifications; administrators will be notified by email when special events occur, while regular users will not.

=== Settings ===

Settings control the behavior of the Storage Service. Default Locations are the created or associated with pipelines when they are created.

'''Pipelines are disabled upon creation?''' sets whether a newly created Pipeline can access its Locations. If a Pipeline is disabled, it cannot access any of its locations. By disabling newly created Pipelines, it provides some security against unwanted perusal of the files in Locations, or use by unauthorized Archivematica instances. This can be configured individually when creating a Pipeline manually through the Storage Service website.

'''Default Locations''' set what existing locations should be associated with a newly created Pipeline, or what new Locations should be created for each new Pipeline. No matter what is configured here, a Currently Processing location is created for all Pipelines, since one is required. Multiple Transfer Source or AIP Storage Locations can be configured by holding down Ctrl when selecting them. New Locations in an existing Space can be created for Pipelines that use default locations by entering the relevant information.

== How to Configure a Location ==

For Spaces of the type "Local Filesystem," Locations are basically directories (or more accurately, paths to directories). You can create Locations for Transfer Source, Currently Processing, or AIP Storage directories.

To create and configure a new Location:

# In the Storage Service, click on the "Spaces" tab.
# Under the Space that you want to add the Location to, click on the "Create Location here" link.
# Choose a purpose (e.g. AIP Storage) and pipeline, and enter a "Relative Path" (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to, e.g. for the default Space, the Path is '/' so your Location path would be relative to that (in the example here, the complete path would end up being '/var/mylocation'). Note: if the path you are defining in your Location doesn't exist, you must create it manually and make sure it is writable by the archivematica user.
# Save the Location settings.
# The new location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard "Administration" tab) or as a destination for AIP storage.

== Store DIP ==

= Dashboard administration tab =

The Archivematica administration pages, under the Administration tab of the dashboard, allows you to configure application components and manage users.

== Processing configuration ==

When processing a SIP or transfer, you may want to automate some of the workflow choices. Choices can be preconfigured by putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

If a SIP or transfer is submitted with a 'processingMCP.xml' file, processing decisions will be made with the included file.

The XML file format is:
<pre><processingMCP>
<preconfiguredChoices>

<preconfiguredChoice>
<appliesTo>755b4177-c587-41a7-8c52-015277568302</appliesTo>
<goToChain>d4404ab1-dc7f-4e9e-b1f8-aa861e766b8e</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>eeb23509-57e2-4529-8857-9d62525db048</appliesTo>
<goToChain>5727faac-88af-40e8-8c10-268644b0142d</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>19adb668-b19a-4fcb-8938-f49d7485eaf3</appliesTo>
<goToChain>333643b7-122a-4019-8bef-996443f3ecc5</goToChain>
<delay unitCtime="yes">2419200.0</delay>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>dec97e3c-5598-4b99-b26e-f87a435a6b7f</appliesTo>
<goToChain>01d80b27-4ad1-4bd1-8f8d-f819f18bf685</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>f19926dd-8fb5-4c79-8ade-c83f61f55b40</appliesTo>
<goToChain>85b1e45d-8f98-4cae-8336-72f40e12cbef</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>7a024896-c4f7-4808-a240-44c87c762bc5</appliesTo>
<goToChain>3c1faec7-7e1e-4cdd-b3bd-e2f05f4baa9b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01d64f58-8295-4b7b-9cab-8f1b153a504f</appliesTo>
<goToChain>9475447c-9889-430c-9477-6287a9574c5b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01c651cb-c174-4ba4-b985-1d87a44d6754</appliesTo>
<goToChain>414da421-b83f-4648-895f-a34840e3c3f5</goToChain>
</preconfiguredChoice>
</preconfiguredChoices>
</processingMCP>
</pre>

Where appliesTo is the UUID associated with the micro-service job presented in the dashboard, and goToChain is the UUID of the desired selection. The default processingMCP.xml file is located at '/var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml'.

The processing configuration administration page of the dashboard provides you with an easy form to configure the default 'processingMCP.xml' that's added to a SIP or transfer if it doesn't already contain one. When you change the options using the web interface the necessary XML will be written behind the scenes.

[[File:ProcessingConfig1-1.png|1000px|center|thumb|Processing configuration form in Administration tab of the dashboard]]

*For the approval (yes/no) steps, the user ticks the box on the left-hand side to make a choice. If the box is not ticked, the approval step will appear in the dashboard.
*For the other steps, if no actions are selected the choices appear in the dashboard
*You can select whether or not to send transfers to quarantine (yes/no) and decide how long you'd like them to stay there.
*You can select whether to extract packages as well as whether to keep and/or delete the extracted objects and/or the package itself.
*You can approve normalization, sending the AIP to storage, and uploading the DIP without interrupting the workflow in the dashboard.
*You can pre-select which format identification tool and command to run in both/either transfer and/or ingest to base your normalization upon.
*You can choose to send a transfer to backlog or to create a SIP every time.
*You can select to be reminded to add PREMIS event metadata about manual normalization should you choose to use that capability.
*You can select between 7z using lzma and 7zip using bzip or parallel bzip2 algorithms for AIP compression.
*For select compression level, the options are as follows:
**1 - fastest mode
**3 - fast compression mode
**5 - normal compression mode
**7 - maximum compression
**9 - ultra compression
*You can select one archival storage location where you will consistently send your AIPs.

== General ==

In the general configuration section, you can select interface options and set [[Administrator_manual_1.2#Storage_service|Storage Service]] options for your Archivematica client.

[[File:Generalconfig.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

=== Interface options ===

Here, you can hide parts of the interface that you don't need to use. In particular, you can hide CONTENTdm DIP upload link, AtoM DIP upload link and DSpace transfer type.

=== Storage Service options ===

This is where you'll find the complete URL for the Storage Service. See [[Administrator_manual_1.2#Storage_service|Storage Service]] for more information about this feature.

== Failures ==

Archivematica 1.2 includes dashboard failure reporting.
[[File:FailuresAdmin.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

== Transfer source location ==

Archivematica allows you to start transfers using the operating system's file browser or via a web interface. Source files for transfers, however, can't be uploaded using the web interface: they must exist on volumes accessible to the Archivematica MCP server and configured via the [[Administrator_manual_1.2#Storage_service|Storage Service]].

When starting a transfer you're required to select one or more directories of files to add to the transfer.

You can view your transfer source directories in the Administrative tab of the dashboard under "Transfer source locations".

== AIP storage locations ==

AIP storage directories are directories in which completed AIPs are stored. Storage directories can be specified in a manner similar to transfer source directories using the [[Administrator_manual_1.2#Storage_service|Storage Service]].

You can view your transfer source directories in the Administrative tab of the dashboard under "AIP storage locations"

== AtoM DIP upload ==

Archivematica can upload DIPs directly to an [https://www.ica-atom.org/ AtoM] website so the contents can be accessed online. The AtoM DIP upload configuration page is where you specify the details of the AtoM installation you'd like the DIPs uploaded to (and, if using Rsync to transfer the DIP files, Rsync transfer details).

The parameters that you'll most likely want to set are <code>url</code>, <code>email</code>, and <code>password</code>. These parameters, respectively, specify the destination AtoM website's URL, the email address used to log in to the website, and the password used to log in to the website.

AtoM DIP upload can also use [http://en.wikipedia.org/wiki/Rsync Rsync] as a transfer mechanism. Rsync is an open source utility for efficiently transferring files. The <code>rsync-target</code> parameter is used to specify an Rsync-style target host/directory pairing, "foobar.com:~/dips/" for example. The <code>rsync-command</code> parameter is used to specify rsync connection options, "ssh -p 22222 -l user" for example. If you are using the rsync option, please see AtoM server configuration below.

To set any parameters for AtoM DIP upload change the values, preserving the existing format they're specified in, in the "Command arguments" field then click "Save".

Note that in AtoM, the sword plugin (Admin --> Plugins --> qtSwordPlugin) must be enabled in order for AtoM to receive uploaded DIPs. Enabling Job scheduling (Admin --> Settings --> Job scheduling) is also recommended.

=== AtoM server configuration ===

This server configuration step is necessary to allow Archivematica to log in to the AtoM server without passwords, and only when the user is deploying the rsync option described above in the AtoM DIP upload section.

To enable sending DIPs from Archivematica to the AtoM server:

Generate SSH keys for the Archivematica user. Leave the passphrase field blank.
<pre>
$ sudo -i -u archivematica
$ cd ~
$ ssh-keygen
</pre>

Copy the contents of <code>/var/lib/archivematica/.ssh/id_rsa.pub</code> somewhere handy, you will need it later.

Now, it's time to configure the AtoM server so Archivematica can send the DIPs using SSH/rsync. For that purpose, you will create a user called <code>archivematica</code> and we are going to assign that user a restricted shell with access only to rsync:

<pre>
$ sudo apt-get install rssh
$ sudo useradd -d /home/archivematica -m -s /usr/bin/rssh archivematica
$ sudo passswd -l archivematica
$ sudo vim /etc/rssh.conf // Make sure that allowrsync is uncommented!
</pre>

Add the SSH key that we generated before:

<pre>
$ sudo mkdir /home/archivematica/.ssh
$ chmod 700 /home/archivematica/.ssh/
$ sudo vim /home/archivematica/.ssh/authorized_keys // Paste here the contents of id_dsa.pub
$ chown -R archivematica:archivematica /home/archivematica
</pre>

In Archivematica, make sure that you update the <code>--rsync-target</code> accordingly. 
These are the parameters that we are passing to the upload-qubit microservice. 
Go to the Administration > Upload DIP page in the dashboard.

Generic parameters:

<pre>
--url="http://atom-hostname/index.php" \
--email="demo@example.com" \
--password="demo" \
--uuid="%SIPUUID%" \
--rsync-target="archivematica@atom-hostname:/tmp" \
--debug
</pre>

== CONTENTdm DIP upload ==

Archivematica can also upload DIPs to [http://www.contentdm.org/ CONTENTdm] instances. Multiple CONTENTdm destinations may be configured.

For each possible CONTENTdm DIP upload destination, you'll specify a brief description and configuration parameters appropriate for the destination. Paramters include <code>%ContentdmServer%</code> (full path to the CONTENTdm API, including the leading 'http://' or 'https://', for example http://example.com:81/dmwebservices/index.php), <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> (Linux user and group on the CONTENTdm server, not a CONTENTdm username). Note that only <code>%ContentdmServer%</code> is required is you are going to produce CONTENTdm Project Client packages; <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> are also required if you are going to use the "direct upload" option for uploading your DIPs into CONTENTdm.

When changing parameters for a CONTENTdm DIP upload destination simply change the values, preserving the existing format they're specified in. To add an upload destination fill in the form at the bottom of the page with the appropriate values. When you've completed your changes click the "Save" button.

== PREMIS agent ==

The PREMIS agent name and code can be set via the administration interface.
[[File:Premisagent-10.png|center|900px|thumbs]]

== Rest API ==

In addition to automation using the processingMCP.xml file, Archivematica includes a REST API for automating transfer approval. Using this API, you can create a custom script that copies a transfer to the appropriate directory then uses the <code>curl</code> command, or some other means, to let Archivematica know that the copy is complete.

=== API keys ===

Use of the REST API requires the use of API keys. An API key is associated with a specific user. To generate an API key for a user:

# Browse to <code>/administration/accounts/list/</code>
# Click the "Edit" button for the user you'd like to generate an API key for
# Click the "Regenerate API key" checkbox
# Click "Save"

After generating an API key, you can click the "Edit" button for the user and you should see the API key.

=== IP whitelist ===

In addition to creating API keys, you'll need to add the IP of any computer making REST requests to the REST API whitelist. The IP whitelist can be edited in the administration interface at <code>/administration/api/</code>.

=== Approving a transfer ===

The REST API can be used to approve a transfer. The transfer must first be copied into the appropriate watch directory. To determine the location of the appropriate watch directory, first figure out where the shared directory is from the <code>watchDirectoryPath</code> value of <code>/etc/archivematica/MCPServer/serverConfig.conf</code>. Within that directory is a subdirectory <code>activeTransfers</code>. In this subdirectory are watch directories for the various transfer types.

When using the REST API to approve a transfer, if a transfer type isn't specified, the transfer will be deemed a standard transfer.

'''HTTP Method:''' POST

'''URL:''' <code>/api/transfer/approve</code>

'''Parameters:'''

<code>directory</code>: directory name of the transfer

<code>type</code> (optional): transfer type [standard|dspace|unzipped bag|zipped bag]

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl --data "username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6&type=standard&directory=MyTransfer" http://127.0.0.1/api/transfer/approve

Example result:

{"message": "Approval successful."}

=== Listing unapproved transfers ===

The REST API can be used to get a list of unapproved transfers. Each transfer's directory name and type is returned.

'''Method:''' <code>GET</code>

'''URL:''' <code>/api/transfer/unapproved</code>

'''Parameters:'''

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl "http://127.0.0.1/api/transfer/unapproved?username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6"

Example result:

{
"message": "Fetched unapproved transfers successfully.",
"results": [{
"directory": "MyTransfer",
"type": "standard"
}
]
}
== Users ==

The dashboard provides a simple cookie-based user authentication system using the [https://docs.djangoproject.com/en/1.4/topics/auth/ Django authentication framework]. Access to the dashboard is limited only to logged-in users and a login page will be shown when the user is not recognized. If the application can't find any user in the database, the user creation page will be shown instead, allowing the creation of an administrator account.

Users can be also created, modified and deleted from the Administration tab. Only users who are administrators can create and edit user accounts.

You can add a new user to the system by clicking the "Add new" button on the user administration page. By adding a user you provide a way to access Archivematica using a username/password combination. Should you need to change a user's username or password, you can do so by clicking the "Edit" button, corresponding to the user, on the administration page. Should you need to revoke a user's access, you can click the corresponding "Delete" button.

=== CLI creation of administrative users ===

If you need an additional administrator user one can be created via the command-line, issue the following commands:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py createsuperuser

=== CLI password resetting ===

If you've forgotten the password for your administrator user, or any other user, you can change it via the command-line:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py changepassword <username>

===Security===

Archivematica uses [http://en.wikipedia.org/wiki/PBKDF2 PBKDF2] as the default algorithm to store passwords. This should be sufficient for most users: it's quite secure, requiring massive amounts of computing time to break. However, other algorithms could be used as the following document explains: [https://docs.djangoproject.com/en/1.4/topics/auth/#how-django-stores-passwords How Django stores passwords].

Our plan is to extend this functionality in the future adding groups and granular permissions support.

= Dashboard preservation planning tab =

== Format Policy Registry (FPR) ==

=== Introduction to the Format Policy Registry ===

The Format Policy Registry (FPR) is a database which allows Archivematica users to define format policies for handling file formats. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format). Format policies will change as community standards, practices and tools evolve. Format policies are maintained by Artefactual, who provides a freely-available FPR server hosted at [http://fpr.archivematica.org fpr.archivematica.org]. This server stores structured information about normalization format policies for preservation and access. You can update your local FPR from the FPR server using the UPDATE button in the preservation planning tab of the dashboard. In addition, you can maintain local rules to add new formats or customize the behaviour of Archivematica. The Archivematica dashboard communicates with the FPR server via a REST API.

==== First-time configuration ====

The first time a new Archivematica installation is set up, it will attempt to connect to the FPR server as part of the initial configuration process. As a part of the setup, it will register the Archivematica install with the server and pull down the current set of format policies. In order to register the server, Archivematica will send the following information to the FPR Server, over an encrypted connection:

#Agent Identifier (supplied by the user during registration while installing Archivematica)
#Agent Name (supplied by the user during registration while installing Archivematica)
#IP address of host
#UUID of Archivematica instance
#current time

*The only information that will be passed back and forth between Archivematica and the FPR Server would be these format policies - what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. No information about the content that has been run through Archivematica, or any details about the Archivematica installation or configuration would be sent to the FPR Server.

* Because Archivematica is an open source project, it is possible for any organization to conduct a software audit/code review before running Archivematica in a production environment in order to independently verify the information being shared with the FPR Server. An organization could choose to run a private FPR Server, accessible only within their own network(s), to provide at least a limited version of the benefits of sharing format policies, while guaranteeing a completely self-contained preservation system. This is something that Artefactual is not intending to develop, but anyone is free to extend the software as they see fit, or to hire us or other developers to do so.

=== Updating format policies ===

FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica. Clicking the "update" button will initiate an FPR pull which will bring in any new or altered rules since the last time an update was performed.

=== Types of FPR entries ===

==== Format ====

In the FPR, a "format" is a record representing one or more related ''format versions'', which are records representing a specific file format. For example, the format record for "Graphics Interchange Format" (GIF) is comprised of format versions for both GIF 1987a and 1989a.

When creating a new format version, the following fields are available:

* Description (required) - Text describing the format. This will be saved in METS files.
* Version (required) - The version number for this specific format version (not the FPR record). For example, for Adobe Illustrator 14 .ai files, you might choose "14".
* Pronom id - The specific format version's unique identifier in [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx PRONOM], the UK National Archives's format registry. This is optional, but highly recommended.
* Access format and Preservation format - Indicates whether this format is suitable as an access format for end users, and for preservation.

==== Format Group ====

A format group is a convenient grouping of related file formats which share common properties. For instance, the FPR includes an "Image (raster)" group which contains format records for GIF, JPEG, and PNG. Each format can belong to one (and only one) format group.

==== Characterization ====
Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization micro-service always ran the [http://projects.iq.harvard.edu/fits FITS] tool. As of Archivematica 1.2, characterization is fully customizable by the Archivematica administrator.

===== Characterization tools =====

Archivematica has four default characterization tools upon installation. Which tool will run on a given file depends on the type of file, as determined by the selected identification tool.

====== Default ======

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

====== Multimedia ======

Archivematica 1.2 introduced three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

* [http://ffmpeg.org/ FFprobe], a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
* [http://mediaarea.net/en/MediaInfo MediaInfo], a characterization tool oriented towards audio and video data
* [http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool], a characterization tool oriented towards still image data and extraction of embedded metadata

===== Writing a new characterization command =====

Information on writing new characterization commands can be found in the [[Administrator_manual_1.2#Format_Policy_Rules|FPR administrator's manual]].

Writing a characterization command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]]. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

==== Extraction ====

Archivematica supports extracting contents from files during the transfer phase.

Many transfers contain files which are packages of other files; examples of these include compressed archives, such as ZIP files, or disk images. Archivematica provides a transcription microservice which comes with several predefined rules to extract packages, and which is fully customizeable by Archivematica administrators. Administrators can write new commands, and assign existing formats to run for other file formats.

===== Writing a new extraction command =====

Writing an extraction command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

An extraction command is passed two arguments: the ''file to extract'', and the ''path to which the package should be extracted''. Similar to [[Administrator_manual_1.2#Normalization Command|normalization commands]], these arguments will be interpolated directly into "bashScript" and "command" scripts, and passed as positional arguments to "pythonScript" and "asIs" scripts.

{|
|Name (bashScript and command)||Commandline position (pythonScript and asIs)||Description||Sample value
|-
|%outputDirectory%||First||The full path to the directory in which the package's contents should be extracted||/path/to/filename-uuid/
|-
|%inputFile%||Second||The full path to the package file||/path/to/filename
|}

Here's a simple example of how to call an existing tool (7-zip) without any extra logic:

<pre>7z x -bd -o"%outputDirectory%" "%inputFile%"</pre>

This Python script example is more complex, and attempts to determine whether any files were extracted in order to determine whether to exit 0 or 1 (and report success or failure):

<pre>
from __future__ import print_function
import re
import subprocess
import sys

def extract(package, outdir):
# -a extracts only allocated files; we're not capturing unallocated files
try:
process = subprocess.Popen(['tsk_recover', package, '-a', outdir],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
stdout, stderr = process.communicate()

match = re.match(r'Files Recovered: (\d+)', stdout.splitlines()[0])
if match:
if match.groups()[0] == '0':
raise Exception('tsk_recover failed to extract any files with the message: {}'.format(stdout))
else:
print(stdout)
except Exception as e:
return e

return 0

def main(package, outdir):
return extract(package, outdir)

if __name__ == '__main__':
package = sys.argv[1]
outdir = sys.argv[2]
sys.exit(main(package, outdir))
</pre>

==== Transcription ====

Archivematica 1.2 introduces a new transcription microservice. This microservice provides tools to transcribe the contents of media objects. In Archivematica 1.2 it is used to perform OCR on images of textual material, but it can also be used to create commands which perform other kinds of transcription.

===== Writing transcription commands =====

Writing a transcription command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

Transcription commands are expected to write their data to disk inside the SIP. For commands which perform OCR, metadata can be placed inside the "metadata/OCRfiles" directory inside the SIP; other kinds of transcription should produce files within "metadata".

For example, the following bash script is used by Archivematica to transcribe images using the [https://code.google.com/p/tesseract-ocr/ Tesseract] software:

<pre>
ocrfiles="%SIPObjectsDirectory%metadata/OCRfiles"
test -d "$ocrfiles" || mkdir -p "$ocrfiles"

tesseract %fileFullName% "$ocrfiles/%fileName%"
</pre>

==== Identification ====

===== Identification Tools =====

The identification tool properties in Archivematica control the ways in which Archivematica identifies files and associates them with the FPR's version records. The current version of the FPR server contains two tools: a script based on the [http://www.openplanetsfoundation.org/ Open Planets Foundation's] [https://github.com/openplanets/fido/ FIDO] tool, which identifies based on the IDs in PRONOM, and a simple script which identifies files by their file extension. You can use the identification tools portion of FPR to customize the behaviour of the existing tools, or to write your own.

===== Identification Commands =====

Identification commands contain the actual code that a tool will run when identifying a file. This command will be run on every file in a transfer.

When adding a new command, the following fields are available:

* Identifier (mandatory) - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Script type (mandatory) - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.

When coding a command, you should expect your script to take the path to the file to be identifed as the first commandline argument. When returning an identification, the tool should print a single line containing ''only'' the identifier, and should exit 0. Any informative, diagnostic, and error message can be printed to stderr, where it will be visible to Archivematica users monitoring tool results. On failure, the tool should exit non-zero.

===== Identification Rules =====

These identification rules allow you to define the relationship between the output created by an identification tool, and one of the formats which exists in the FPR. This must be done for the format to be tracked internally by Archivematica, and for it to be used by normalization later on. For instance, if you created a FIDO configuration which returns MIME types, you could create a rule which associates the output "image/jpeg" with the "Generic JPEG" format in the FPR.

Identification rules are necessary only when a tool is configured to return file extensions or MIME types. Because PUIDs are universal, Archivematica will always look these up for you without requiring any rules to be created, regardless of what tool is being used.

When creating an identification rule, the following mandatory fields must be filled out:

* Format - Allows you to select one of the formats which already exists in the FPR.
* Command - Indicates the command that produces this specific identification.
* Output - The text which is written to standard output by the specified command, such as "image/jpeg"

==== Format Policy Tools ====

Format policy tools control how Archivematica processes files during ingest. The most common kind of these tools are normalization tools, which produce preservation and access copies from ingested files. Archivematica comes configured with a number of commands and scripts to normalize several file formats, and you can use this section of the FPR to customize them or to create your own. These are organized similarly to the [[#Identification Tools]] documented above.

Archivematica uses the following kinds of format policy rules:

* Characterization
* Extraction
* Normalization - Access, preservation and thumbnails
* Event detail - Extracts information about a given tool in order to be inserted into a generated METS file.
* Transcription
* Verification - Validates a file produced by another command. For instance, a tool could use Exiftool or JHOVE to determine whether a thumbnail produced by a normalization command was valid and well-formed.

=== Format Policy Commands ===

Like the [[#Identification Commands]] above, format policy commands are scripts or command line statements which control how a normalization tool runs. This command will be run once on every file being normalized using this tool in a transfer.

When creating a normalization command, the following mandatory fields must be filled out:

* Tool - One or more tools to be associated with this command.
* Description - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Command - The script's source, or the commandline statement to execute.
* Script type - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.
* Output format (optional) - The format the command outputs. For example, a command to normalize audio to MP3 using ffmpeg would select the appropriate MP3 format from the dropdown.
* Output location (optional) - The path the normalized file will be written to. See the [[#Writing a command]] section of the documentation for more information.
* Command usage - The purpose of the command; this will be used by Archivematica to decide whether a command is appropriate to run in different circumstances. Values are "Normalization", "Event detail", and "Verification". See the [[#Writing a command]] section of the documentation for more information.
* Event detail command - A command to provide information about the software running this command. This will be written to the METS file as the "event detail" property. For example, the normalization commands which use ffmpeg use an event detail command to extract ffmpeg's version number.

=== Format Policy Rules ===

Format policy rules allow commands to be associated with specific file types. For instance, this allows you to configure the command that uses ImageMagick to create thumbnails to be run on .gif and .jpeg files, while selecting a different command to be run on .png files.

When creating a format policy rule, the following mandatory fields must be filled out:

* Purpose - Allows Archivematica to distinguish rules that should be used to normalize for preservation, normalize for access, to extract information, etc.
* Format - The file format the associated command should be selected for.
* Command - The specific command to call when this rule is used.

=== Writing a command ===

==== Identification command ====

Identification commands are very simple to write, though they require some familiarity with Unix scripting.

An identification command run once for every file in a transfer. It will be passed a single argument (the path to the file to identify), and no switches.

On success, a command should:

* Print the identifier to stdout
* Exit 0

On failure, a command should:

* Print nothing to stdout
* Exit non-zero (Archivematica does not assign special significance to non-zero exit codes)

A command can print anything to stderr on success or error, but this is purely informational - Archivematica won't do anything special with it. Anything printed to stderr by the command will be shown to the user in the Archivematica dashboard's detailed tool output page. You should print any useful error output to stderr if identification fails, but you can also print any useful extra information to stderr if identification succeeds.

Here's a very simple Python script that identifies files by their file extension:

<pre>import os.path, sys
(_, extension) = os.path.splitext(sys.argv[1])
if len(extension) == 0:
exit(1)
else:
print extension.lower()</pre>

Here's a more complex Python example, which uses [http://www.sno.phy.queensu.ca/~phil/exiftool/ Exiftool]'s XML output to return the MIME type of a file:

<pre>#!/usr/bin/env python

from lxml import etree
import subprocess
import sys

try:
xml = subprocess.check_output(['exiftool', '-X', sys.argv[1]])
doc = etree.fromstring(xml)
print doc.find('.//{http://ns.exiftool.ca/File/1.0/}MIMEType').text
except Exception as e:
print >> sys.stderr, e
exit(1)</pre>

Once you've written an identification command, you can register it in the FPR using the following steps:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for the command itself:

# Click "Create New Command".
# Select your tool from the "Tool" dropdown box.
# Fill out the Identifier with text to describe to a user what this tool does. For instance, we might choose "Identify MIME-type using Exiftool".
# Select the appropriate script type - in this case, "Python Script".
# Enter the source code for your script in the "Command" box.
# Click "Create Command".

Finally, you must create rules which associate the possible outputs of your tool with the FPR's format records. This needs to be done once for every supported format; we'll show it with MP3, as an example.

# Navigate to the "Identification Rules" page, and click "Create New Rule".
# Choose the appropriate foramt from the Format dropdown - in our case, "Audio: MPEG Audio: MPEG 1/2 Audio Layer 3".
# Choose your command from the Command dropdown.
# Enter the text your command will output when it identifies this format. For example, when our Exiftool command identifies an MP3 file, it will output "audio/mpeg".
# Click "Create".

Once this is complete, any new transfers you create will be able to use your new tool in the identification step.

==== Normalization Command ====

Normalization commands are a bit more complex to write because they take a few extra parameters.

The goal of a normalization command is to take an input file and transform it into a new format. For instance, Archivematica provides commands to transform video content into FFV1 for preservation, and into H.264 for access.

Archivematica provides several parameters specifying input and output filenames and other useful information. Several of the most common are shown in the examples below; a more complete list is in a later section of the documentation: [[#Normalization command variables and arguments]]

When writing a bash script or a command line, you can reference the variables directly in your code, like this:

<pre>inkscape -z "%fileFullName%" --export-pdf="%outputDirectory%%prefix%%fileName%%postfix%.pdf"</pre>

When writing a script in Python or other languages, the values will be passed to your script as commandline options, which you will need to parse. The following script provides an example using the argparse module that comes with Python:

<pre>import argparse
import subprocess

parser = argparse.ArgumentParser()

parser.add_argument('--file-full-name', dest='filename')
parser.add_argument('--output-file-name', dest='output')
parsed, _ = parser.parse_known_args()
args = [
'ffmpeg', '-vsync', 'passthrough',
'-i', parsed.filename,
'-map', '0:v', '-map', '0:a',
'-vcodec', 'ffv1', '-g', '1',
'-acodec', 'pcm_s16le',
parsed.output+'.mkv'
]

subprocess.call(args)</pre>

Once you've created a command, the process of registering it is similar to creating a new identification tool. The folling examples will use the Python normalization script above.

First, create a new tool record:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for your new command:

# Click "Create New Tool Command".
# Fill out the Description with text to describe to a user what this tool does. For instance, we might choose "Normalize to mkv using ffmpeg".
# Enter the source for your command in the Command textbox.
# Select the appropriate script type - in this case, "Python Script".
# Select the appropriate output format from the dropdown. This indicates to Archivematica what kind of file this command will produce. In this case, choose "Video: Matroska: Generic MKV".
# Enter the location the video will be saved to, using the script variables. You can usually use the "%outputFileName%" variable, and add the file extension - in this case "%outputFileName%.mkv"
# Select a verification command. Archivematica will try to use this tool to ensure that the file your command created works. Archivematica ships with two simple tools, which test whether the file exists and whether it's larger than 0 bytes, but you can create new commands that perform more complicated verifications.
# Finally, choose a command to produce the "Event detail" text that will be written in the section of the METS file covering the normalization event. Archivematica already includes a suitable command for ffmpeg, but you can also create a custom command.
# Click "Create command".

Finally, you must create rules which will associate your command with the formats it should run on.

==== Normalization command variables and arguments ====

The following variables and arguments control the behaviour of format policy command scripts.

{|
|Name (bashScript and command)||Commandline option (pythonScript and asIs)||Description||Sample value
|-
|%fileName%||--input-file=||The filename of the file to process. This variable holds the file's basename, not the whole path.||video.mov
|-
|%fileDirectory%||--file-directory=||The directory containing the input file.||/path/to
|-
|%inputFile%||--file-name=||The fully-qualified path to the file to process.||/path/to/video.mov
|-
|%fileExtension%||--file-extension=||The file extension of the input file.||mov
|-
|%fileExtensionWithDot%||--file-extension-with-dot=||As above, without stripping the period.||.mov
|-
|%outputDirectory%||--output-directory=||The directory to which the output file should be saved.||/path/to/access/copies
|-
|%outputFileUUID%||--output-file-uuid=||The unique identifier assigned by Archivematica to the output file.||1abedf3e-3a4b-46d7-97da-bd9ae13859f5
|-
|%outputDirectory%||--output-directory=||The fully-qualified path to the directory where the new file should be written.||/var/archivematica/sharedDirectory/www/AIPsStore/uuid
|-
|%outputFileName%||--output-file-name=||The fully-qualified path to the output file, minus the file extension.||/path/to/access/copies/video-uuid
|}

= Customization and automation =
* Workflow processing decisions can be made in the processingMCP.xml file. [https://www.archivematica.org/wiki/Administrator_manual_0.10#Processing_configuration See here.]
* Workflows are currently created at the development level.
*: Some resources avialable
*:* [[MCP_Basic_Configuration]]
*:* [[MCP]]
*:* [[Creating_Custom_Workflows]]
*:* [[Development]]
* Normalization commands can be viewed in the preservation planning tab.
* Normalization paths and commands are currently editable under the preservation planning tab in the dashboard.

= Elasticsearch =

Archivematica has the capability of indexing data about files contained in AIPs and this data can be [[Elasticsearch Development|accessed programatically]] for various applications.

If, for whatever reason, you need to delete an ElasticSearch index please see [[ElasticSearch Administration]].

If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

<pre>
import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
conn.delete_index('aips')
except:
print "Error deleting index or index already deleted."
</pre>

=== Rebuilding the AIP index ===

To rebuild the ElasticSearch AIP index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-aip-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-aip-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-aip-index-from-files <location of your AIP store>

=== Rebuilding the transfer index ===

Similarly, to rebuild the ElasticSearch transfer data index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-transfer-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-transfer-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-transfer-index-from-files <location of your AIP store>

= Data backup =

In Archivematica there are three types of data you'll likely want to back up:
* Filesystem (particularly your storage directories)
* MySQL
* ElasticSearch

MySQL is used to store short-term processing data. You can back up the MySQL database by using the following command:

<pre>mysqldump -u <your username> -p<your password> -c MCP > <filename of backup></pre>

ElasticSearch is used to store long-term data. Instructions and scripts for backing up and restoring ElasticSearch are available [http://tech.superhappykittymeow.com/?p=296 here].

= Security =

Once you've set up Archivematica it's a good practice, for the sake of security, to change the default passwords.

== MySQL ==

You should create a new MySQL user or change the password of the default "archivematica" MySQL user. The change the password of the default user, enter the following into the command-line:

$ mysql -u root -p<your MyQL root password> -D mysql \
-e "SET PASSWORD FOR 'archivematica'@'localhost' = PASSWORD('<new password>'); \
FLUSH PRIVILEGES;"

Once you've done this you can change Archivematica's MySQL database access credentials by editing these two files:
* <code>/etc/archivematica/archivematicaCommon/dbsettings</code> (change the <code>user</code> and <code>password</code> settings)
* <code>/usr/share/archivematica/dashboard/settings/common.py</code> (change the <code>USER</code> and <code>PASSWORD</code> settings in the <code>DATABASES</code> section)

Archivematica does not presently support secured MySQL communication so MySQL should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1645 1645].

== AtoM ==

In addition to changing the MySQL credentials, if you've also installed AtoM you'll want to set the password for it as well. Note that after changing your AtoM credentials you should update the credentials on the AtoM DIP upload administration page as well.

== Gearman ==

Archivematica relies on the German server for queuing work that needs to be done. Gearman currently doesn't support secured connections so Gearman should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1345 1345].

= Questions =

If you run into any difficulties while administrating Archivematica, please check out our FAQ and, if that doesn't help you, contain us using the Archivematica discussion group.

== Frequently asked questions ==
* [[AM_FAQ|Solutions to common questions]]

== Discussion group ==
* [http://groups.google.com/group/archivematica?hl=en Discussion group] for questions not covered by the FAQ

Administrator manual 1.2

2014-08-07T22:31:21Z

Mdemeo: /* Types of FPR entries */ Add Transcription section

[[Main Page]] > [[Documentation]] > Administrator manual 1.2

This manual covers administrator-specific instructions for Archivematica. It will also provide help for using forms in the Administration tab of the Archivematica dashboard and the administrator capabilities in the Format Policy Registry (FPR), which you will find in the Preservation planning tab of the dashboard.

For end-user instructions, please see the [[User_manual_1.2|user manual]].

= Installation =
* [[Installation|Instructions for installing the latest build of Archivematica on your server]]

= Upgrading =

Currently, Archivematica does not support upgrading from one version to the next. A re-install is required. After re-installing, you can restore Archivematica's knowledge of your AIPs, by [[#Rebuilding_the_AIP_index|rebuilding the AIP index]] and, if you have transfers stored in the backlog, [[#Rebuilding_the_transfer_index|rebuilding the transfer index]].

= Storage service =
The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

[[File:SS1-0.png|700px|center|thumb|Home page of Storage Service]]

TODO Discuss how spaces and locations fit into each other, pipelines fit to locations, spaces=config, locations=purpose, packages in locations

== Archivematica Configuration ==

When installing Archivematica, options to configure it with the Storage Service will be presented.

[[File:Install3.png|600px|center]]

If you have installed the Storage Service at a different URL, you may change that here.

The top button 'Use default transfer source & AIP storage locations' will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The bottom button 'Register this pipeline & set up transfer source and AIP storage locations' will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be entered, and Archivematica will attempt to register its dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed, and a Pipeline for this Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

=== Change the port in the web server configuration ===

The storage services uses nginx by default, so you can edit /etc/nginx/sites-enabled/storage and change the line that says

listen 8000;

change 8000 to whatever port you prefer to use.

Keep in mind that in a default installation of Archivematica 1.0, the dashboard is running in Apache on port 80. So it is not possible to make nginx run on port 80 on the same machine. If you install the storage service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

== Spaces ==
[[File:Spaces.png|600px|center]]
A storage Space contains all the information necessary to connect to the physical storage. It is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored. All locations must be contained in a space.

A space is usually the immediate parent of the Location folders. For example, if you had transfer source locations at <tt>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</tt> and <tt>/home/artefactual/maildir_transfers</tt>, the Space's path would be <tt>/home/artefactual/</tt>

Currently supported protocols are local filesystem, NFS, and pipeline local filesystem.

=== Local Filesystem ===

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

==== Fields ====
* ''Path'': Absolute path to the Space on the local filesystem
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.

=== NFS ===

NFS spaces are for NFS exports mounted on the Storage Service server, and the Archivematica pipeline.

==== Fields ====
* ''Path'': Absolute path the space is mounted at on the filesystem local to the storage service
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the remote computer exporting the NFS mount.
* ''Remote path'': Export path on the NFS server
* ''Version'': nfs or nfs4 - as would be passed to the <tt>mount</tt> command.
* ''Manually Mounted'': Check this if it has been mounted already. Otherwise, the Storage Service will try to mount it. ''Note: this feature is not yet available.''

=== Pipeline Local Filesystem ===

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the storage service is hosted on <tt>storage_service_host</tt> and Archivematica is running on <tt>archivematica1</tt> . The transfer sources for Archivematica are stored locally on <tt>archivematica1</tt>, but the storage service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

'''Note: Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.'''

==== Fields ====
* ''Path'': Absolute path to the space on the remote machine.
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the computer running Archivematica. Should be SSH accessible from the Storage Service computer.
* ''Remote user'': Username on the remote host

== Locations ==
[[File:Locations.png|600px|center]]
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. A Location is also where Packages are stored. Each Location is associated with a pipeline and can only be accessed by that pipeline.

Currently, a Location can have one of three purposes: Transfer Source, Currently Processing, or AIP Storage. Transfer source locations display in Archivematica's Transfer tab, and any folder in a transfer source can be selected to become a Transfer. AIP storage locations are where the completed AIPs are put for long-term storage. During processing, Archivematica uses the currently processing location associated with that pipeline. Only one currently processing location should be associated with a given pipeline. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

==== Fields ====
* ''Purpose'': What use the Location is for
* ''Pipeline'': Which pipelines this location is available to.
* ''Relative Path'': Path to this Location, relative to the space that contains it.
* ''Description'': Description of the Location to be displayed to the user.
* ''Quota'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Enabled'': If checked, this location is accessible to pipelines associated with it. If unchecked, it will not show up to any pipeline.

== Pipeline ==
[[File:Pipelines.png|600px|center]]
A pipeline is an Archivematica instance registered with the Storage Service, including the server and all associated clients. Each pipeline is uniquely identified by a UUID, which can be found in the dashboard under Administration -> General Configuration. When installing Archivematica, it will attempt to register its UUID with the Storage Service, with a description of "Archivematica on <hostname>".

==== Fields ====
* ''UUID'': Unique identifier of the Archivematica pipeline
* ''Description'': Description of the pipeline displayed to the user. e.g. Sankofa demo site
* ''Enabled'': If checked, this pipeline can access locations associate with it. If unchecked, all locations will be disabled, even if associated.
* ''Default Locations'': If checked, the default locations configured in Administration -> Configuration will be created or associated with the new pipeline.

== Packages ==
[[File:Packages.png|600px|center]]
A Package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). They cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the storage service administrator. To learn more about deleting an AIP, see [[UM_archival_storage_1.2#Deleting_an_AIP|Deleting an AIP]].

== Administration ==
[[File:StorageserviceAdmin1.png|600px|center]]
[[File:StorageserviceAdmin2.png|600px|center]]
The Administration section manages the users and settings for the Storage Service.

=== Users ===

Only registered users can long into the storage service, and the Users page is where users can be created or modified.

TODO what info means, what admin/active mean, who can edit what

=== Settings ===

Settings control the behavior of the Storage Service. Default Locations are the created or associated with pipelines when they are created.

'''Pipelines are disabled upon creation?''' sets whether a newly created Pipeline can access its Locations. If a Pipeline is disabled, it cannot access any of its locations. By disabling newly created Pipelines, it provides some security against unwanted perusal of the files in Locations, or use by unauthorized Archivematica instances. This can be configured individually when creating a Pipeline manually through the Storage Service website.

'''Default Locations''' set what existing locations should be associated with a newly created Pipeline, or what new Locations should be created for each new Pipeline. No matter what is configured here, a Currently Processing location is created for all Pipelines, since one is required. Multiple Transfer Source or AIP Storage Locations can be configured by holding down Ctrl when selecting them. New Locations in an existing Space can be created for Pipelines that use default locations by entering the relevant information.

== How to Configure a Location ==

For Spaces of the type "Local Filesystem," Locations are basically directories (or more accurately, paths to directories). You can create Locations for Transfer Source, Currently Processing, or AIP Storage directories.

To create and configure a new Location:

# In the Storage Service, click on the "Spaces" tab.
# Under the Space that you want to add the Location to, click on the "Create Location here" link.
# Choose a purpose (e.g. AIP Storage) and pipeline, and enter a "Relative Path" (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to, e.g. for the default Space, the Path is '/' so your Location path would be relative to that (in the example here, the complete path would end up being '/var/mylocation'). Note: if the path you are defining in your Location doesn't exist, you must create it manually and make sure it is writable by the archivematica user.
# Save the Location settings.
# The new location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard "Administration" tab) or as a destination for AIP storage.

== Store DIP ==

= Dashboard administration tab =

The Archivematica administration pages, under the Administration tab of the dashboard, allows you to configure application components and manage users.

== Processing configuration ==

When processing a SIP or transfer, you may want to automate some of the workflow choices. Choices can be preconfigured by putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

If a SIP or transfer is submitted with a 'processingMCP.xml' file, processing decisions will be made with the included file.

The XML file format is:
<pre><processingMCP>
<preconfiguredChoices>

<preconfiguredChoice>
<appliesTo>755b4177-c587-41a7-8c52-015277568302</appliesTo>
<goToChain>d4404ab1-dc7f-4e9e-b1f8-aa861e766b8e</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>eeb23509-57e2-4529-8857-9d62525db048</appliesTo>
<goToChain>5727faac-88af-40e8-8c10-268644b0142d</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>19adb668-b19a-4fcb-8938-f49d7485eaf3</appliesTo>
<goToChain>333643b7-122a-4019-8bef-996443f3ecc5</goToChain>
<delay unitCtime="yes">2419200.0</delay>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>dec97e3c-5598-4b99-b26e-f87a435a6b7f</appliesTo>
<goToChain>01d80b27-4ad1-4bd1-8f8d-f819f18bf685</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>f19926dd-8fb5-4c79-8ade-c83f61f55b40</appliesTo>
<goToChain>85b1e45d-8f98-4cae-8336-72f40e12cbef</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>7a024896-c4f7-4808-a240-44c87c762bc5</appliesTo>
<goToChain>3c1faec7-7e1e-4cdd-b3bd-e2f05f4baa9b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01d64f58-8295-4b7b-9cab-8f1b153a504f</appliesTo>
<goToChain>9475447c-9889-430c-9477-6287a9574c5b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01c651cb-c174-4ba4-b985-1d87a44d6754</appliesTo>
<goToChain>414da421-b83f-4648-895f-a34840e3c3f5</goToChain>
</preconfiguredChoice>
</preconfiguredChoices>
</processingMCP>
</pre>

Where appliesTo is the UUID associated with the micro-service job presented in the dashboard, and goToChain is the UUID of the desired selection. The default processingMCP.xml file is located at '/var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml'.

The processing configuration administration page of the dashboard provides you with an easy form to configure the default 'processingMCP.xml' that's added to a SIP or transfer if it doesn't already contain one. When you change the options using the web interface the necessary XML will be written behind the scenes.

[[File:ProcessingConfig1-1.png|1000px|center|thumb|Processing configuration form in Administration tab of the dashboard]]

*For the approval (yes/no) steps, the user ticks the box on the left-hand side to make a choice. If the box is not ticked, the approval step will appear in the dashboard.
*For the other steps, if no actions are selected the choices appear in the dashboard
*You can select whether or not to send transfers to quarantine (yes/no) and decide how long you'd like them to stay there.
*You can select whether to extract packages as well as whether to keep and/or delete the extracted objects and/or the package itself.
*You can approve normalization, sending the AIP to storage, and uploading the DIP without interrupting the workflow in the dashboard.
*You can pre-select which format identification tool and command to run in both/either transfer and/or ingest to base your normalization upon.
*You can choose to send a transfer to backlog or to create a SIP every time.
*You can select to be reminded to add PREMIS event metadata about manual normalization should you choose to use that capability.
*You can select between 7z using lzma and 7zip using bzip or parallel bzip2 algorithms for AIP compression.
*For select compression level, the options are as follows:
**1 - fastest mode
**3 - fast compression mode
**5 - normal compression mode
**7 - maximum compression
**9 - ultra compression
*You can select one archival storage location where you will consistently send your AIPs.

== General ==

In the general configuration section, you can select interface options and set [[Administrator_manual_1.2#Storage_service|Storage Service]] options for your Archivematica client.

[[File:Generalconfig.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

=== Interface options ===

Here, you can hide parts of the interface that you don't need to use. In particular, you can hide CONTENTdm DIP upload link, AtoM DIP upload link and DSpace transfer type.

=== Storage Service options ===

This is where you'll find the complete URL for the Storage Service. See [[Administrator_manual_1.2#Storage_service|Storage Service]] for more information about this feature.

== Failures ==

Archivematica 1.2 includes dashboard failure reporting.
[[File:FailuresAdmin.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

== Transfer source location ==

Archivematica allows you to start transfers using the operating system's file browser or via a web interface. Source files for transfers, however, can't be uploaded using the web interface: they must exist on volumes accessible to the Archivematica MCP server and configured via the [[Administrator_manual_1.2#Storage_service|Storage Service]].

When starting a transfer you're required to select one or more directories of files to add to the transfer.

You can view your transfer source directories in the Administrative tab of the dashboard under "Transfer source locations".

== AIP storage locations ==

AIP storage directories are directories in which completed AIPs are stored. Storage directories can be specified in a manner similar to transfer source directories using the [[Administrator_manual_1.2#Storage_service|Storage Service]].

You can view your transfer source directories in the Administrative tab of the dashboard under "AIP storage locations"

== AtoM DIP upload ==

Archivematica can upload DIPs directly to an [https://www.ica-atom.org/ AtoM] website so the contents can be accessed online. The AtoM DIP upload configuration page is where you specify the details of the AtoM installation you'd like the DIPs uploaded to (and, if using Rsync to transfer the DIP files, Rsync transfer details).

The parameters that you'll most likely want to set are <code>url</code>, <code>email</code>, and <code>password</code>. These parameters, respectively, specify the destination AtoM website's URL, the email address used to log in to the website, and the password used to log in to the website.

AtoM DIP upload can also use [http://en.wikipedia.org/wiki/Rsync Rsync] as a transfer mechanism. Rsync is an open source utility for efficiently transferring files. The <code>rsync-target</code> parameter is used to specify an Rsync-style target host/directory pairing, "foobar.com:~/dips/" for example. The <code>rsync-command</code> parameter is used to specify rsync connection options, "ssh -p 22222 -l user" for example. If you are using the rsync option, please see AtoM server configuration below.

To set any parameters for AtoM DIP upload change the values, preserving the existing format they're specified in, in the "Command arguments" field then click "Save".

Note that in AtoM, the sword plugin (Admin --> Plugins --> qtSwordPlugin) must be enabled in order for AtoM to receive uploaded DIPs. Enabling Job scheduling (Admin --> Settings --> Job scheduling) is also recommended.

=== AtoM server configuration ===

This server configuration step is necessary to allow Archivematica to log in to the AtoM server without passwords, and only when the user is deploying the rsync option described above in the AtoM DIP upload section.

To enable sending DIPs from Archivematica to the AtoM server:

Generate SSH keys for the Archivematica user. Leave the passphrase field blank.
<pre>
$ sudo -i -u archivematica
$ cd ~
$ ssh-keygen
</pre>

Copy the contents of <code>/var/lib/archivematica/.ssh/id_rsa.pub</code> somewhere handy, you will need it later.

Now, it's time to configure the AtoM server so Archivematica can send the DIPs using SSH/rsync. For that purpose, you will create a user called <code>archivematica</code> and we are going to assign that user a restricted shell with access only to rsync:

<pre>
$ sudo apt-get install rssh
$ sudo useradd -d /home/archivematica -m -s /usr/bin/rssh archivematica
$ sudo passswd -l archivematica
$ sudo vim /etc/rssh.conf // Make sure that allowrsync is uncommented!
</pre>

Add the SSH key that we generated before:

<pre>
$ sudo mkdir /home/archivematica/.ssh
$ chmod 700 /home/archivematica/.ssh/
$ sudo vim /home/archivematica/.ssh/authorized_keys // Paste here the contents of id_dsa.pub
$ chown -R archivematica:archivematica /home/archivematica
</pre>

In Archivematica, make sure that you update the <code>--rsync-target</code> accordingly. 
These are the parameters that we are passing to the upload-qubit microservice. 
Go to the Administration > Upload DIP page in the dashboard.

Generic parameters:

<pre>
--url="http://atom-hostname/index.php" \
--email="demo@example.com" \
--password="demo" \
--uuid="%SIPUUID%" \
--rsync-target="archivematica@atom-hostname:/tmp" \
--debug
</pre>

== CONTENTdm DIP upload ==

Archivematica can also upload DIPs to [http://www.contentdm.org/ CONTENTdm] instances. Multiple CONTENTdm destinations may be configured.

For each possible CONTENTdm DIP upload destination, you'll specify a brief description and configuration parameters appropriate for the destination. Paramters include <code>%ContentdmServer%</code> (full path to the CONTENTdm API, including the leading 'http://' or 'https://', for example http://example.com:81/dmwebservices/index.php), <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> (Linux user and group on the CONTENTdm server, not a CONTENTdm username). Note that only <code>%ContentdmServer%</code> is required is you are going to produce CONTENTdm Project Client packages; <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> are also required if you are going to use the "direct upload" option for uploading your DIPs into CONTENTdm.

When changing parameters for a CONTENTdm DIP upload destination simply change the values, preserving the existing format they're specified in. To add an upload destination fill in the form at the bottom of the page with the appropriate values. When you've completed your changes click the "Save" button.

== PREMIS agent ==

The PREMIS agent name and code can be set via the administration interface.
[[File:Premisagent-10.png|center|900px|thumbs]]

== Rest API ==

In addition to automation using the processingMCP.xml file, Archivematica includes a REST API for automating transfer approval. Using this API, you can create a custom script that copies a transfer to the appropriate directory then uses the <code>curl</code> command, or some other means, to let Archivematica know that the copy is complete.

=== API keys ===

Use of the REST API requires the use of API keys. An API key is associated with a specific user. To generate an API key for a user:

# Browse to <code>/administration/accounts/list/</code>
# Click the "Edit" button for the user you'd like to generate an API key for
# Click the "Regenerate API key" checkbox
# Click "Save"

After generating an API key, you can click the "Edit" button for the user and you should see the API key.

=== IP whitelist ===

In addition to creating API keys, you'll need to add the IP of any computer making REST requests to the REST API whitelist. The IP whitelist can be edited in the administration interface at <code>/administration/api/</code>.

=== Approving a transfer ===

The REST API can be used to approve a transfer. The transfer must first be copied into the appropriate watch directory. To determine the location of the appropriate watch directory, first figure out where the shared directory is from the <code>watchDirectoryPath</code> value of <code>/etc/archivematica/MCPServer/serverConfig.conf</code>. Within that directory is a subdirectory <code>activeTransfers</code>. In this subdirectory are watch directories for the various transfer types.

When using the REST API to approve a transfer, if a transfer type isn't specified, the transfer will be deemed a standard transfer.

'''HTTP Method:''' POST

'''URL:''' <code>/api/transfer/approve</code>

'''Parameters:'''

<code>directory</code>: directory name of the transfer

<code>type</code> (optional): transfer type [standard|dspace|unzipped bag|zipped bag]

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl --data "username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6&type=standard&directory=MyTransfer" http://127.0.0.1/api/transfer/approve

Example result:

{"message": "Approval successful."}

=== Listing unapproved transfers ===

The REST API can be used to get a list of unapproved transfers. Each transfer's directory name and type is returned.

'''Method:''' <code>GET</code>

'''URL:''' <code>/api/transfer/unapproved</code>

'''Parameters:'''

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl "http://127.0.0.1/api/transfer/unapproved?username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6"

Example result:

{
"message": "Fetched unapproved transfers successfully.",
"results": [{
"directory": "MyTransfer",
"type": "standard"
}
]
}
== Users ==

The dashboard provides a simple cookie-based user authentication system using the [https://docs.djangoproject.com/en/1.4/topics/auth/ Django authentication framework]. Access to the dashboard is limited only to logged-in users and a login page will be shown when the user is not recognized. If the application can't find any user in the database, the user creation page will be shown instead, allowing the creation of an administrator account.

Users can be also created, modified and deleted from the Administration tab. Only users who are administrators can create and edit user accounts.

You can add a new user to the system by clicking the "Add new" button on the user administration page. By adding a user you provide a way to access Archivematica using a username/password combination. Should you need to change a user's username or password, you can do so by clicking the "Edit" button, corresponding to the user, on the administration page. Should you need to revoke a user's access, you can click the corresponding "Delete" button.

=== CLI creation of administrative users ===

If you need an additional administrator user one can be created via the command-line, issue the following commands:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py createsuperuser

=== CLI password resetting ===

If you've forgotten the password for your administrator user, or any other user, you can change it via the command-line:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py changepassword <username>

===Security===

Archivematica uses [http://en.wikipedia.org/wiki/PBKDF2 PBKDF2] as the default algorithm to store passwords. This should be sufficient for most users: it's quite secure, requiring massive amounts of computing time to break. However, other algorithms could be used as the following document explains: [https://docs.djangoproject.com/en/1.4/topics/auth/#how-django-stores-passwords How Django stores passwords].

Our plan is to extend this functionality in the future adding groups and granular permissions support.

= Dashboard preservation planning tab =

== Format Policy Registry (FPR) ==

=== Introduction to the Format Policy Registry ===

The Format Policy Registry (FPR) is a database which allows Archivematica users to define format policies for handling file formats. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format). Format policies will change as community standards, practices and tools evolve. Format policies are maintained by Artefactual, who provides a freely-available FPR server hosted at [http://fpr.archivematica.org fpr.archivematica.org]. This server stores structured information about normalization format policies for preservation and access. You can update your local FPR from the FPR server using the UPDATE button in the preservation planning tab of the dashboard. In addition, you can maintain local rules to add new formats or customize the behaviour of Archivematica. The Archivematica dashboard communicates with the FPR server via a REST API.

==== First-time configuration ====

The first time a new Archivematica installation is set up, it will attempt to connect to the FPR server as part of the initial configuration process. As a part of the setup, it will register the Archivematica install with the server and pull down the current set of format policies. In order to register the server, Archivematica will send the following information to the FPR Server, over an encrypted connection:

#Agent Identifier (supplied by the user during registration while installing Archivematica)
#Agent Name (supplied by the user during registration while installing Archivematica)
#IP address of host
#UUID of Archivematica instance
#current time

*The only information that will be passed back and forth between Archivematica and the FPR Server would be these format policies - what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. No information about the content that has been run through Archivematica, or any details about the Archivematica installation or configuration would be sent to the FPR Server.

* Because Archivematica is an open source project, it is possible for any organization to conduct a software audit/code review before running Archivematica in a production environment in order to independently verify the information being shared with the FPR Server. An organization could choose to run a private FPR Server, accessible only within their own network(s), to provide at least a limited version of the benefits of sharing format policies, while guaranteeing a completely self-contained preservation system. This is something that Artefactual is not intending to develop, but anyone is free to extend the software as they see fit, or to hire us or other developers to do so.

=== Updating format policies ===

FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica. Clicking the "update" button will initiate an FPR pull which will bring in any new or altered rules since the last time an update was performed.

=== Types of FPR entries ===

==== Format ====

In the FPR, a "format" is a record representing one or more related ''format versions'', which are records representing a specific file format. For example, the format record for "Graphics Interchange Format" (GIF) is comprised of format versions for both GIF 1987a and 1989a.

When creating a new format version, the following fields are available:

* Description (required) - Text describing the format. This will be saved in METS files.
* Version (required) - The version number for this specific format version (not the FPR record). For example, for Adobe Illustrator 14 .ai files, you might choose "14".
* Pronom id - The specific format version's unique identifier in [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx PRONOM], the UK National Archives's format registry. This is optional, but highly recommended.
* Access format and Preservation format - Indicates whether this format is suitable as an access format for end users, and for preservation.

==== Format Group ====

A format group is a convenient grouping of related file formats which share common properties. For instance, the FPR includes an "Image (raster)" group which contains format records for GIF, JPEG, and PNG. Each format can belong to one (and only one) format group.

==== Characterization ====
Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization micro-service always ran the [http://projects.iq.harvard.edu/fits FITS] tool. As of Archivematica 1.2, characterization is fully customizable by the Archivematica administrator.

===== Characterization tools =====

Archivematica has four default characterization tools upon installation. Which tool will run on a given file depends on the type of file, as determined by the selected identification tool.

====== Default ======

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

====== Multimedia ======

Archivematica 1.2 introduced three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

* [http://ffmpeg.org/ FFprobe], a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
* [http://mediaarea.net/en/MediaInfo MediaInfo], a characterization tool oriented towards audio and video data
* [http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool], a characterization tool oriented towards still image data and extraction of embedded metadata

===== Writing a new characterization command =====

Information on writing new characterization commands can be found in the [[Administrator_manual_1.2#Format_Policy_Rules|FPR administrator's manual]].

Writing a characterization command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]]. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

==== Extraction ====

Archivematica supports extracting contents from files during the transfer phase.

Many transfers contain files which are packages of other files; examples of these include compressed archives, such as ZIP files, or disk images. Archivematica provides a transcription microservice which comes with several predefined rules to extract packages, and which is fully customizeable by Archivematica administrators. Administrators can write new commands, and assign existing formats to run for other file formats.

===== Writing a new extraction command =====

Writing an extraction command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

An extraction command is passed two arguments: the ''file to extract'', and the ''path to which the package should be extracted''. Similar to [[Administrator_manual_1.2#Normalization Command|normalization commands]], these arguments will be interpolated directly into "bashScript" and "command" scripts, and passed as positional arguments to "pythonScript" and "asIs" scripts.

{|
|Name (bashScript and command)||Commandline position (pythonScript and asIs)||Description||Sample value
|-
|%outputDirectory%||First||The full path to the directory in which the package's contents should be extracted||/path/to/filename-uuid/
|-
|%inputFile%||Second||The full path to the package file||/path/to/filename
|}

Here's a simple example of how to call an existing tool (7-zip) without any extra logic:

<pre>7z x -bd -o"%outputDirectory%" "%inputFile%"</pre>

This Python script example is more complex, and attempts to determine whether any files were extracted in order to determine whether to exit 0 or 1 (and report success or failure):

<pre>
from __future__ import print_function
import re
import subprocess
import sys

def extract(package, outdir):
# -a extracts only allocated files; we're not capturing unallocated files
try:
process = subprocess.Popen(['tsk_recover', package, '-a', outdir],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
stdout, stderr = process.communicate()

match = re.match(r'Files Recovered: (\d+)', stdout.splitlines()[0])
if match:
if match.groups()[0] == '0':
raise Exception('tsk_recover failed to extract any files with the message: {}'.format(stdout))
else:
print(stdout)
except Exception as e:
return e

return 0

def main(package, outdir):
return extract(package, outdir)

if __name__ == '__main__':
package = sys.argv[1]
outdir = sys.argv[2]
sys.exit(main(package, outdir))
</pre>

==== Transcription ====

Archivematica 1.2 introduces a new transcription microservice. This microservice provides tools to transcribe the contents of media objects. In Archivematica 1.2 it is used to perform OCR on images of textual material, but it can also be used to create commands which perform other kinds of transcription.

===== Writing transcription commands =====

Writing a transcription command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

Transcription commands are expected to write their data to disk inside the SIP. For commands which perform OCR, metadata can be placed inside the "metadata/OCRfiles" directory inside the SIP; other kinds of transcription should produce files within "metadata".

For example, the following bash script is used by Archivematica to transcribe images using the [https://code.google.com/p/tesseract-ocr/ Tesseract] software:

<pre>
ocrfiles="%SIPObjectsDirectory%metadata/OCRfiles"
test -d "$ocrfiles" || mkdir -p "$ocrfiles"

tesseract %fileFullName% "$ocrfiles/%fileName%"
</pre>

==== Identification ====

===== Identification Tools =====

The identification tool properties in Archivematica control the ways in which Archivematica identifies files and associates them with the FPR's version records. The current version of the FPR server contains two tools: a script based on the [http://www.openplanetsfoundation.org/ Open Planets Foundation's] [https://github.com/openplanets/fido/ FIDO] tool, which identifies based on the IDs in PRONOM, and a simple script which identifies files by their file extension. You can use the identification tools portion of FPR to customize the behaviour of the existing tools, or to write your own.

===== Identification Commands =====

Identification commands contain the actual code that a tool will run when identifying a file. This command will be run on every file in a transfer.

When adding a new command, the following fields are available:

* Identifier (mandatory) - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Script type (mandatory) - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.

When coding a command, you should expect your script to take the path to the file to be identifed as the first commandline argument. When returning an identification, the tool should print a single line containing ''only'' the identifier, and should exit 0. Any informative, diagnostic, and error message can be printed to stderr, where it will be visible to Archivematica users monitoring tool results. On failure, the tool should exit non-zero.

===== Identification Rules =====

These identification rules allow you to define the relationship between the output created by an identification tool, and one of the formats which exists in the FPR. This must be done for the format to be tracked internally by Archivematica, and for it to be used by normalization later on. For instance, if you created a FIDO configuration which returns MIME types, you could create a rule which associates the output "image/jpeg" with the "Generic JPEG" format in the FPR.

Identification rules are necessary only when a tool is configured to return file extensions or MIME types. Because PUIDs are universal, Archivematica will always look these up for you without requiring any rules to be created, regardless of what tool is being used.

When creating an identification rule, the following mandatory fields must be filled out:

* Format - Allows you to select one of the formats which already exists in the FPR.
* Command - Indicates the command that produces this specific identification.
* Output - The text which is written to standard output by the specified command, such as "image/jpeg"

==== Format Policy Tools ====

Format policy tools control how Archivematica processes files during ingest. The most common kind of these tools are normalization tools, which produce preservation and access copies from ingested files. Archivematica comes configured with a number of commands and scripts to normalize several file formats, and you can use this section of the FPR to customize them or to create your own. These are organized similarly to the [[#Identification Tools]] documented above.

Archivematica uses the following kinds of format policy rules:

* Characterization
* Extraction
* Normalization - Access, preservation and thumbnails
* Event detail - Extracts information about a given tool in order to be inserted into a generated METS file.
* Transcription
* Verification - Validates a file produced by another command. For instance, a tool could use Exiftool or JHOVE to determine whether a thumbnail produced by a normalization command was valid and well-formed.

=== Format Policy Commands ===

Like the [[#Identification Commands]] above, format policy commands are scripts or command line statements which control how a normalization tool runs. This command will be run once on every file being normalized using this tool in a transfer.

When creating a normalization command, the following mandatory fields must be filled out:

* Tool - One or more tools to be associated with this command.
* Description - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Command - The script's source, or the commandline statement to execute.
* Script type - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.
* Output format (optional) - The format the command outputs. For example, a command to normalize audio to MP3 using ffmpeg would select the appropriate MP3 format from the dropdown.
* Output location (optional) - The path the normalized file will be written to. See the [[#Writing a command]] section of the documentation for more information.
* Command usage - The purpose of the command; this will be used by Archivematica to decide whether a command is appropriate to run in different circumstances. Values are "Normalization", "Event detail", and "Verification". See the [[#Writing a command]] section of the documentation for more information.
* Event detail command - A command to provide information about the software running this command. This will be written to the METS file as the "event detail" property. For example, the normalization commands which use ffmpeg use an event detail command to extract ffmpeg's version number.

=== Format Policy Rules ===

Format policy rules allow commands to be associated with specific file types. For instance, this allows you to configure the command that uses ImageMagick to create thumbnails to be run on .gif and .jpeg files, while selecting a different command to be run on .png files.

When creating a format policy rule, the following mandatory fields must be filled out:

* Purpose - Allows Archivematica to distinguish rules that should be used to normalize for preservation, normalize for access, to extract information, etc.
* Format - The file format the associated command should be selected for.
* Command - The specific command to call when this rule is used.

=== Writing a command ===

==== Identification command ====

Identification commands are very simple to write, though they require some familiarity with Unix scripting.

An identification command run once for every file in a transfer. It will be passed a single argument (the path to the file to identify), and no switches.

On success, a command should:

* Print the identifier to stdout
* Exit 0

On failure, a command should:

* Print nothing to stdout
* Exit non-zero (Archivematica does not assign special significance to non-zero exit codes)

A command can print anything to stderr on success or error, but this is purely informational - Archivematica won't do anything special with it. Anything printed to stderr by the command will be shown to the user in the Archivematica dashboard's detailed tool output page. You should print any useful error output to stderr if identification fails, but you can also print any useful extra information to stderr if identification succeeds.

Here's a very simple Python script that identifies files by their file extension:

<pre>import os.path, sys
(_, extension) = os.path.splitext(sys.argv[1])
if len(extension) == 0:
exit(1)
else:
print extension.lower()</pre>

Here's a more complex Python example, which uses [http://www.sno.phy.queensu.ca/~phil/exiftool/ Exiftool]'s XML output to return the MIME type of a file:

<pre>#!/usr/bin/env python

from lxml import etree
import subprocess
import sys

try:
xml = subprocess.check_output(['exiftool', '-X', sys.argv[1]])
doc = etree.fromstring(xml)
print doc.find('.//{http://ns.exiftool.ca/File/1.0/}MIMEType').text
except Exception as e:
print >> sys.stderr, e
exit(1)</pre>

Once you've written an identification command, you can register it in the FPR using the following steps:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for the command itself:

# Click "Create New Command".
# Select your tool from the "Tool" dropdown box.
# Fill out the Identifier with text to describe to a user what this tool does. For instance, we might choose "Identify MIME-type using Exiftool".
# Select the appropriate script type - in this case, "Python Script".
# Enter the source code for your script in the "Command" box.
# Click "Create Command".

Finally, you must create rules which associate the possible outputs of your tool with the FPR's format records. This needs to be done once for every supported format; we'll show it with MP3, as an example.

# Navigate to the "Identification Rules" page, and click "Create New Rule".
# Choose the appropriate foramt from the Format dropdown - in our case, "Audio: MPEG Audio: MPEG 1/2 Audio Layer 3".
# Choose your command from the Command dropdown.
# Enter the text your command will output when it identifies this format. For example, when our Exiftool command identifies an MP3 file, it will output "audio/mpeg".
# Click "Create".

Once this is complete, any new transfers you create will be able to use your new tool in the identification step.

==== Normalization Command ====

Normalization commands are a bit more complex to write because they take a few extra parameters.

The goal of a normalization command is to take an input file and transform it into a new format. For instance, Archivematica provides commands to transform video content into FFV1 for preservation, and into H.264 for access.

Archivematica provides several parameters specifying input and output filenames and other useful information. Several of the most common are shown in the examples below; a more complete list is in a later section of the documentation: [[#Normalization command variables and arguments]]

When writing a bash script or a command line, you can reference the variables directly in your code, like this:

<pre>inkscape -z "%fileFullName%" --export-pdf="%outputDirectory%%prefix%%fileName%%postfix%.pdf"</pre>

When writing a script in Python or other languages, the values will be passed to your script as commandline options, which you will need to parse. The following script provides an example using the argparse module that comes with Python:

<pre>import argparse
import subprocess

parser = argparse.ArgumentParser()

parser.add_argument('--file-full-name', dest='filename')
parser.add_argument('--output-file-name', dest='output')
parsed, _ = parser.parse_known_args()
args = [
'ffmpeg', '-vsync', 'passthrough',
'-i', parsed.filename,
'-map', '0:v', '-map', '0:a',
'-vcodec', 'ffv1', '-g', '1',
'-acodec', 'pcm_s16le',
parsed.output+'.mkv'
]

subprocess.call(args)</pre>

Once you've created a command, the process of registering it is similar to creating a new identification tool. The folling examples will use the Python normalization script above.

First, create a new tool record:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for your new command:

# Click "Create New Tool Command".
# Fill out the Description with text to describe to a user what this tool does. For instance, we might choose "Normalize to mkv using ffmpeg".
# Enter the source for your command in the Command textbox.
# Select the appropriate script type - in this case, "Python Script".
# Select the appropriate output format from the dropdown. This indicates to Archivematica what kind of file this command will produce. In this case, choose "Video: Matroska: Generic MKV".
# Enter the location the video will be saved to, using the script variables. You can usually use the "%outputFileName%" variable, and add the file extension - in this case "%outputFileName%.mkv"
# Select a verification command. Archivematica will try to use this tool to ensure that the file your command created works. Archivematica ships with two simple tools, which test whether the file exists and whether it's larger than 0 bytes, but you can create new commands that perform more complicated verifications.
# Finally, choose a command to produce the "Event detail" text that will be written in the section of the METS file covering the normalization event. Archivematica already includes a suitable command for ffmpeg, but you can also create a custom command.
# Click "Create command".

Finally, you must create rules which will associate your command with the formats it should run on.

==== Normalization command variables and arguments ====

The following variables and arguments control the behaviour of format policy command scripts.

{|
|Name (bashScript and command)||Commandline option (pythonScript and asIs)||Description||Sample value
|-
|%fileName%||--input-file=||The filename of the file to process. This variable holds the file's basename, not the whole path.||video.mov
|-
|%fileDirectory%||--file-directory=||The directory containing the input file.||/path/to
|-
|%inputFile%||--file-name=||The fully-qualified path to the file to process.||/path/to/video.mov
|-
|%fileExtension%||--file-extension=||The file extension of the input file.||mov
|-
|%fileExtensionWithDot%||--file-extension-with-dot=||As above, without stripping the period.||.mov
|-
|%outputDirectory%||--output-directory=||The directory to which the output file should be saved.||/path/to/access/copies
|-
|%outputFileUUID%||--output-file-uuid=||The unique identifier assigned by Archivematica to the output file.||1abedf3e-3a4b-46d7-97da-bd9ae13859f5
|-
|%outputDirectory%||--output-directory=||The fully-qualified path to the directory where the new file should be written.||/var/archivematica/sharedDirectory/www/AIPsStore/uuid
|-
|%outputFileName%||--output-file-name=||The fully-qualified path to the output file, minus the file extension.||/path/to/access/copies/video-uuid
|}

= Customization and automation =
* Workflow processing decisions can be made in the processingMCP.xml file. [https://www.archivematica.org/wiki/Administrator_manual_0.10#Processing_configuration See here.]
* Workflows are currently created at the development level.
*: Some resources avialable
*:* [[MCP_Basic_Configuration]]
*:* [[MCP]]
*:* [[Creating_Custom_Workflows]]
*:* [[Development]]
* Normalization commands can be viewed in the preservation planning tab.
* Normalization paths and commands are currently editable under the preservation planning tab in the dashboard.

= Elasticsearch =

Archivematica has the capability of indexing data about files contained in AIPs and this data can be [[Elasticsearch Development|accessed programatically]] for various applications.

If, for whatever reason, you need to delete an ElasticSearch index please see [[ElasticSearch Administration]].

If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

<pre>
import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
conn.delete_index('aips')
except:
print "Error deleting index or index already deleted."
</pre>

=== Rebuilding the AIP index ===

To rebuild the ElasticSearch AIP index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-aip-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-aip-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-aip-index-from-files <location of your AIP store>

=== Rebuilding the transfer index ===

Similarly, to rebuild the ElasticSearch transfer data index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-transfer-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-transfer-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-transfer-index-from-files <location of your AIP store>

= Data backup =

In Archivematica there are three types of data you'll likely want to back up:
* Filesystem (particularly your storage directories)
* MySQL
* ElasticSearch

MySQL is used to store short-term processing data. You can back up the MySQL database by using the following command:

<pre>mysqldump -u <your username> -p<your password> -c MCP > <filename of backup></pre>

ElasticSearch is used to store long-term data. Instructions and scripts for backing up and restoring ElasticSearch are available [http://tech.superhappykittymeow.com/?p=296 here].

= Security =

Once you've set up Archivematica it's a good practice, for the sake of security, to change the default passwords.

== MySQL ==

You should create a new MySQL user or change the password of the default "archivematica" MySQL user. The change the password of the default user, enter the following into the command-line:

$ mysql -u root -p<your MyQL root password> -D mysql \
-e "SET PASSWORD FOR 'archivematica'@'localhost' = PASSWORD('<new password>'); \
FLUSH PRIVILEGES;"

Once you've done this you can change Archivematica's MySQL database access credentials by editing these two files:
* <code>/etc/archivematica/archivematicaCommon/dbsettings</code> (change the <code>user</code> and <code>password</code> settings)
* <code>/usr/share/archivematica/dashboard/settings/common.py</code> (change the <code>USER</code> and <code>PASSWORD</code> settings in the <code>DATABASES</code> section)

Archivematica does not presently support secured MySQL communication so MySQL should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1645 1645].

== AtoM ==

In addition to changing the MySQL credentials, if you've also installed AtoM you'll want to set the password for it as well. Note that after changing your AtoM credentials you should update the credentials on the AtoM DIP upload administration page as well.

== Gearman ==

Archivematica relies on the German server for queuing work that needs to be done. Gearman currently doesn't support secured connections so Gearman should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1345 1345].

= Questions =

If you run into any difficulties while administrating Archivematica, please check out our FAQ and, if that doesn't help you, contain us using the Archivematica discussion group.

== Frequently asked questions ==
* [[AM_FAQ|Solutions to common questions]]

== Discussion group ==
* [http://groups.google.com/group/archivematica?hl=en Discussion group] for questions not covered by the FAQ

Administrator manual 1.2

2014-08-07T22:10:25Z

Mdemeo: /* Types of FPR entries */ Indent Identification by an extra header level

[[Main Page]] > [[Documentation]] > Administrator manual 1.2

This manual covers administrator-specific instructions for Archivematica. It will also provide help for using forms in the Administration tab of the Archivematica dashboard and the administrator capabilities in the Format Policy Registry (FPR), which you will find in the Preservation planning tab of the dashboard.

For end-user instructions, please see the [[User_manual_1.2|user manual]].

= Installation =
* [[Installation|Instructions for installing the latest build of Archivematica on your server]]

= Upgrading =

Currently, Archivematica does not support upgrading from one version to the next. A re-install is required. After re-installing, you can restore Archivematica's knowledge of your AIPs, by [[#Rebuilding_the_AIP_index|rebuilding the AIP index]] and, if you have transfers stored in the backlog, [[#Rebuilding_the_transfer_index|rebuilding the transfer index]].

= Storage service =
The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

[[File:SS1-0.png|700px|center|thumb|Home page of Storage Service]]

TODO Discuss how spaces and locations fit into each other, pipelines fit to locations, spaces=config, locations=purpose, packages in locations

== Archivematica Configuration ==

When installing Archivematica, options to configure it with the Storage Service will be presented.

[[File:Install3.png|600px|center]]

If you have installed the Storage Service at a different URL, you may change that here.

The top button 'Use default transfer source & AIP storage locations' will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The bottom button 'Register this pipeline & set up transfer source and AIP storage locations' will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be entered, and Archivematica will attempt to register its dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed, and a Pipeline for this Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

=== Change the port in the web server configuration ===

The storage services uses nginx by default, so you can edit /etc/nginx/sites-enabled/storage and change the line that says

listen 8000;

change 8000 to whatever port you prefer to use.

Keep in mind that in a default installation of Archivematica 1.0, the dashboard is running in Apache on port 80. So it is not possible to make nginx run on port 80 on the same machine. If you install the storage service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

== Spaces ==
[[File:Spaces.png|600px|center]]
A storage Space contains all the information necessary to connect to the physical storage. It is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored. All locations must be contained in a space.

A space is usually the immediate parent of the Location folders. For example, if you had transfer source locations at <tt>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</tt> and <tt>/home/artefactual/maildir_transfers</tt>, the Space's path would be <tt>/home/artefactual/</tt>

Currently supported protocols are local filesystem, NFS, and pipeline local filesystem.

=== Local Filesystem ===

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

==== Fields ====
* ''Path'': Absolute path to the Space on the local filesystem
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.

=== NFS ===

NFS spaces are for NFS exports mounted on the Storage Service server, and the Archivematica pipeline.

==== Fields ====
* ''Path'': Absolute path the space is mounted at on the filesystem local to the storage service
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the remote computer exporting the NFS mount.
* ''Remote path'': Export path on the NFS server
* ''Version'': nfs or nfs4 - as would be passed to the <tt>mount</tt> command.
* ''Manually Mounted'': Check this if it has been mounted already. Otherwise, the Storage Service will try to mount it. ''Note: this feature is not yet available.''

=== Pipeline Local Filesystem ===

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the storage service is hosted on <tt>storage_service_host</tt> and Archivematica is running on <tt>archivematica1</tt> . The transfer sources for Archivematica are stored locally on <tt>archivematica1</tt>, but the storage service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

'''Note: Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.'''

==== Fields ====
* ''Path'': Absolute path to the space on the remote machine.
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the computer running Archivematica. Should be SSH accessible from the Storage Service computer.
* ''Remote user'': Username on the remote host

== Locations ==
[[File:Locations.png|600px|center]]
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. A Location is also where Packages are stored. Each Location is associated with a pipeline and can only be accessed by that pipeline.

Currently, a Location can have one of three purposes: Transfer Source, Currently Processing, or AIP Storage. Transfer source locations display in Archivematica's Transfer tab, and any folder in a transfer source can be selected to become a Transfer. AIP storage locations are where the completed AIPs are put for long-term storage. During processing, Archivematica uses the currently processing location associated with that pipeline. Only one currently processing location should be associated with a given pipeline. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

==== Fields ====
* ''Purpose'': What use the Location is for
* ''Pipeline'': Which pipelines this location is available to.
* ''Relative Path'': Path to this Location, relative to the space that contains it.
* ''Description'': Description of the Location to be displayed to the user.
* ''Quota'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Enabled'': If checked, this location is accessible to pipelines associated with it. If unchecked, it will not show up to any pipeline.

== Pipeline ==
[[File:Pipelines.png|600px|center]]
A pipeline is an Archivematica instance registered with the Storage Service, including the server and all associated clients. Each pipeline is uniquely identified by a UUID, which can be found in the dashboard under Administration -> General Configuration. When installing Archivematica, it will attempt to register its UUID with the Storage Service, with a description of "Archivematica on <hostname>".

==== Fields ====
* ''UUID'': Unique identifier of the Archivematica pipeline
* ''Description'': Description of the pipeline displayed to the user. e.g. Sankofa demo site
* ''Enabled'': If checked, this pipeline can access locations associate with it. If unchecked, all locations will be disabled, even if associated.
* ''Default Locations'': If checked, the default locations configured in Administration -> Configuration will be created or associated with the new pipeline.

== Packages ==
[[File:Packages.png|600px|center]]
A Package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). They cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the storage service administrator. To learn more about deleting an AIP, see [[UM_archival_storage_1.2#Deleting_an_AIP|Deleting an AIP]].

== Administration ==
[[File:StorageserviceAdmin1.png|600px|center]]
[[File:StorageserviceAdmin2.png|600px|center]]
The Administration section manages the users and settings for the Storage Service.

=== Users ===

Only registered users can long into the storage service, and the Users page is where users can be created or modified.

TODO what info means, what admin/active mean, who can edit what

=== Settings ===

Settings control the behavior of the Storage Service. Default Locations are the created or associated with pipelines when they are created.

'''Pipelines are disabled upon creation?''' sets whether a newly created Pipeline can access its Locations. If a Pipeline is disabled, it cannot access any of its locations. By disabling newly created Pipelines, it provides some security against unwanted perusal of the files in Locations, or use by unauthorized Archivematica instances. This can be configured individually when creating a Pipeline manually through the Storage Service website.

'''Default Locations''' set what existing locations should be associated with a newly created Pipeline, or what new Locations should be created for each new Pipeline. No matter what is configured here, a Currently Processing location is created for all Pipelines, since one is required. Multiple Transfer Source or AIP Storage Locations can be configured by holding down Ctrl when selecting them. New Locations in an existing Space can be created for Pipelines that use default locations by entering the relevant information.

== How to Configure a Location ==

For Spaces of the type "Local Filesystem," Locations are basically directories (or more accurately, paths to directories). You can create Locations for Transfer Source, Currently Processing, or AIP Storage directories.

To create and configure a new Location:

# In the Storage Service, click on the "Spaces" tab.
# Under the Space that you want to add the Location to, click on the "Create Location here" link.
# Choose a purpose (e.g. AIP Storage) and pipeline, and enter a "Relative Path" (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to, e.g. for the default Space, the Path is '/' so your Location path would be relative to that (in the example here, the complete path would end up being '/var/mylocation'). Note: if the path you are defining in your Location doesn't exist, you must create it manually and make sure it is writable by the archivematica user.
# Save the Location settings.
# The new location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard "Administration" tab) or as a destination for AIP storage.

== Store DIP ==

= Dashboard administration tab =

The Archivematica administration pages, under the Administration tab of the dashboard, allows you to configure application components and manage users.

== Processing configuration ==

When processing a SIP or transfer, you may want to automate some of the workflow choices. Choices can be preconfigured by putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

If a SIP or transfer is submitted with a 'processingMCP.xml' file, processing decisions will be made with the included file.

The XML file format is:
<pre><processingMCP>
<preconfiguredChoices>

<preconfiguredChoice>
<appliesTo>755b4177-c587-41a7-8c52-015277568302</appliesTo>
<goToChain>d4404ab1-dc7f-4e9e-b1f8-aa861e766b8e</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>eeb23509-57e2-4529-8857-9d62525db048</appliesTo>
<goToChain>5727faac-88af-40e8-8c10-268644b0142d</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>19adb668-b19a-4fcb-8938-f49d7485eaf3</appliesTo>
<goToChain>333643b7-122a-4019-8bef-996443f3ecc5</goToChain>
<delay unitCtime="yes">2419200.0</delay>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>dec97e3c-5598-4b99-b26e-f87a435a6b7f</appliesTo>
<goToChain>01d80b27-4ad1-4bd1-8f8d-f819f18bf685</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>f19926dd-8fb5-4c79-8ade-c83f61f55b40</appliesTo>
<goToChain>85b1e45d-8f98-4cae-8336-72f40e12cbef</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>7a024896-c4f7-4808-a240-44c87c762bc5</appliesTo>
<goToChain>3c1faec7-7e1e-4cdd-b3bd-e2f05f4baa9b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01d64f58-8295-4b7b-9cab-8f1b153a504f</appliesTo>
<goToChain>9475447c-9889-430c-9477-6287a9574c5b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01c651cb-c174-4ba4-b985-1d87a44d6754</appliesTo>
<goToChain>414da421-b83f-4648-895f-a34840e3c3f5</goToChain>
</preconfiguredChoice>
</preconfiguredChoices>
</processingMCP>
</pre>

Where appliesTo is the UUID associated with the micro-service job presented in the dashboard, and goToChain is the UUID of the desired selection. The default processingMCP.xml file is located at '/var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml'.

The processing configuration administration page of the dashboard provides you with an easy form to configure the default 'processingMCP.xml' that's added to a SIP or transfer if it doesn't already contain one. When you change the options using the web interface the necessary XML will be written behind the scenes.

[[File:ProcessingConfig1-1.png|1000px|center|thumb|Processing configuration form in Administration tab of the dashboard]]

*For the approval (yes/no) steps, the user ticks the box on the left-hand side to make a choice. If the box is not ticked, the approval step will appear in the dashboard.
*For the other steps, if no actions are selected the choices appear in the dashboard
*You can select whether or not to send transfers to quarantine (yes/no) and decide how long you'd like them to stay there.
*You can select whether to extract packages as well as whether to keep and/or delete the extracted objects and/or the package itself.
*You can approve normalization, sending the AIP to storage, and uploading the DIP without interrupting the workflow in the dashboard.
*You can pre-select which format identification tool and command to run in both/either transfer and/or ingest to base your normalization upon.
*You can choose to send a transfer to backlog or to create a SIP every time.
*You can select to be reminded to add PREMIS event metadata about manual normalization should you choose to use that capability.
*You can select between 7z using lzma and 7zip using bzip or parallel bzip2 algorithms for AIP compression.
*For select compression level, the options are as follows:
**1 - fastest mode
**3 - fast compression mode
**5 - normal compression mode
**7 - maximum compression
**9 - ultra compression
*You can select one archival storage location where you will consistently send your AIPs.

== General ==

In the general configuration section, you can select interface options and set [[Administrator_manual_1.2#Storage_service|Storage Service]] options for your Archivematica client.

[[File:Generalconfig.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

=== Interface options ===

Here, you can hide parts of the interface that you don't need to use. In particular, you can hide CONTENTdm DIP upload link, AtoM DIP upload link and DSpace transfer type.

=== Storage Service options ===

This is where you'll find the complete URL for the Storage Service. See [[Administrator_manual_1.2#Storage_service|Storage Service]] for more information about this feature.

== Failures ==

Archivematica 1.2 includes dashboard failure reporting.
[[File:FailuresAdmin.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

== Transfer source location ==

Archivematica allows you to start transfers using the operating system's file browser or via a web interface. Source files for transfers, however, can't be uploaded using the web interface: they must exist on volumes accessible to the Archivematica MCP server and configured via the [[Administrator_manual_1.2#Storage_service|Storage Service]].

When starting a transfer you're required to select one or more directories of files to add to the transfer.

You can view your transfer source directories in the Administrative tab of the dashboard under "Transfer source locations".

== AIP storage locations ==

AIP storage directories are directories in which completed AIPs are stored. Storage directories can be specified in a manner similar to transfer source directories using the [[Administrator_manual_1.2#Storage_service|Storage Service]].

You can view your transfer source directories in the Administrative tab of the dashboard under "AIP storage locations"

== AtoM DIP upload ==

Archivematica can upload DIPs directly to an [https://www.ica-atom.org/ AtoM] website so the contents can be accessed online. The AtoM DIP upload configuration page is where you specify the details of the AtoM installation you'd like the DIPs uploaded to (and, if using Rsync to transfer the DIP files, Rsync transfer details).

The parameters that you'll most likely want to set are <code>url</code>, <code>email</code>, and <code>password</code>. These parameters, respectively, specify the destination AtoM website's URL, the email address used to log in to the website, and the password used to log in to the website.

AtoM DIP upload can also use [http://en.wikipedia.org/wiki/Rsync Rsync] as a transfer mechanism. Rsync is an open source utility for efficiently transferring files. The <code>rsync-target</code> parameter is used to specify an Rsync-style target host/directory pairing, "foobar.com:~/dips/" for example. The <code>rsync-command</code> parameter is used to specify rsync connection options, "ssh -p 22222 -l user" for example. If you are using the rsync option, please see AtoM server configuration below.

To set any parameters for AtoM DIP upload change the values, preserving the existing format they're specified in, in the "Command arguments" field then click "Save".

Note that in AtoM, the sword plugin (Admin --> Plugins --> qtSwordPlugin) must be enabled in order for AtoM to receive uploaded DIPs. Enabling Job scheduling (Admin --> Settings --> Job scheduling) is also recommended.

=== AtoM server configuration ===

This server configuration step is necessary to allow Archivematica to log in to the AtoM server without passwords, and only when the user is deploying the rsync option described above in the AtoM DIP upload section.

To enable sending DIPs from Archivematica to the AtoM server:

Generate SSH keys for the Archivematica user. Leave the passphrase field blank.
<pre>
$ sudo -i -u archivematica
$ cd ~
$ ssh-keygen
</pre>

Copy the contents of <code>/var/lib/archivematica/.ssh/id_rsa.pub</code> somewhere handy, you will need it later.

Now, it's time to configure the AtoM server so Archivematica can send the DIPs using SSH/rsync. For that purpose, you will create a user called <code>archivematica</code> and we are going to assign that user a restricted shell with access only to rsync:

<pre>
$ sudo apt-get install rssh
$ sudo useradd -d /home/archivematica -m -s /usr/bin/rssh archivematica
$ sudo passswd -l archivematica
$ sudo vim /etc/rssh.conf // Make sure that allowrsync is uncommented!
</pre>

Add the SSH key that we generated before:

<pre>
$ sudo mkdir /home/archivematica/.ssh
$ chmod 700 /home/archivematica/.ssh/
$ sudo vim /home/archivematica/.ssh/authorized_keys // Paste here the contents of id_dsa.pub
$ chown -R archivematica:archivematica /home/archivematica
</pre>

In Archivematica, make sure that you update the <code>--rsync-target</code> accordingly. 
These are the parameters that we are passing to the upload-qubit microservice. 
Go to the Administration > Upload DIP page in the dashboard.

Generic parameters:

<pre>
--url="http://atom-hostname/index.php" \
--email="demo@example.com" \
--password="demo" \
--uuid="%SIPUUID%" \
--rsync-target="archivematica@atom-hostname:/tmp" \
--debug
</pre>

== CONTENTdm DIP upload ==

Archivematica can also upload DIPs to [http://www.contentdm.org/ CONTENTdm] instances. Multiple CONTENTdm destinations may be configured.

For each possible CONTENTdm DIP upload destination, you'll specify a brief description and configuration parameters appropriate for the destination. Paramters include <code>%ContentdmServer%</code> (full path to the CONTENTdm API, including the leading 'http://' or 'https://', for example http://example.com:81/dmwebservices/index.php), <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> (Linux user and group on the CONTENTdm server, not a CONTENTdm username). Note that only <code>%ContentdmServer%</code> is required is you are going to produce CONTENTdm Project Client packages; <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> are also required if you are going to use the "direct upload" option for uploading your DIPs into CONTENTdm.

When changing parameters for a CONTENTdm DIP upload destination simply change the values, preserving the existing format they're specified in. To add an upload destination fill in the form at the bottom of the page with the appropriate values. When you've completed your changes click the "Save" button.

== PREMIS agent ==

The PREMIS agent name and code can be set via the administration interface.
[[File:Premisagent-10.png|center|900px|thumbs]]

== Rest API ==

In addition to automation using the processingMCP.xml file, Archivematica includes a REST API for automating transfer approval. Using this API, you can create a custom script that copies a transfer to the appropriate directory then uses the <code>curl</code> command, or some other means, to let Archivematica know that the copy is complete.

=== API keys ===

Use of the REST API requires the use of API keys. An API key is associated with a specific user. To generate an API key for a user:

# Browse to <code>/administration/accounts/list/</code>
# Click the "Edit" button for the user you'd like to generate an API key for
# Click the "Regenerate API key" checkbox
# Click "Save"

After generating an API key, you can click the "Edit" button for the user and you should see the API key.

=== IP whitelist ===

In addition to creating API keys, you'll need to add the IP of any computer making REST requests to the REST API whitelist. The IP whitelist can be edited in the administration interface at <code>/administration/api/</code>.

=== Approving a transfer ===

The REST API can be used to approve a transfer. The transfer must first be copied into the appropriate watch directory. To determine the location of the appropriate watch directory, first figure out where the shared directory is from the <code>watchDirectoryPath</code> value of <code>/etc/archivematica/MCPServer/serverConfig.conf</code>. Within that directory is a subdirectory <code>activeTransfers</code>. In this subdirectory are watch directories for the various transfer types.

When using the REST API to approve a transfer, if a transfer type isn't specified, the transfer will be deemed a standard transfer.

'''HTTP Method:''' POST

'''URL:''' <code>/api/transfer/approve</code>

'''Parameters:'''

<code>directory</code>: directory name of the transfer

<code>type</code> (optional): transfer type [standard|dspace|unzipped bag|zipped bag]

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl --data "username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6&type=standard&directory=MyTransfer" http://127.0.0.1/api/transfer/approve

Example result:

{"message": "Approval successful."}

=== Listing unapproved transfers ===

The REST API can be used to get a list of unapproved transfers. Each transfer's directory name and type is returned.

'''Method:''' <code>GET</code>

'''URL:''' <code>/api/transfer/unapproved</code>

'''Parameters:'''

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl "http://127.0.0.1/api/transfer/unapproved?username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6"

Example result:

{
"message": "Fetched unapproved transfers successfully.",
"results": [{
"directory": "MyTransfer",
"type": "standard"
}
]
}
== Users ==

The dashboard provides a simple cookie-based user authentication system using the [https://docs.djangoproject.com/en/1.4/topics/auth/ Django authentication framework]. Access to the dashboard is limited only to logged-in users and a login page will be shown when the user is not recognized. If the application can't find any user in the database, the user creation page will be shown instead, allowing the creation of an administrator account.

Users can be also created, modified and deleted from the Administration tab. Only users who are administrators can create and edit user accounts.

You can add a new user to the system by clicking the "Add new" button on the user administration page. By adding a user you provide a way to access Archivematica using a username/password combination. Should you need to change a user's username or password, you can do so by clicking the "Edit" button, corresponding to the user, on the administration page. Should you need to revoke a user's access, you can click the corresponding "Delete" button.

=== CLI creation of administrative users ===

If you need an additional administrator user one can be created via the command-line, issue the following commands:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py createsuperuser

=== CLI password resetting ===

If you've forgotten the password for your administrator user, or any other user, you can change it via the command-line:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py changepassword <username>

===Security===

Archivematica uses [http://en.wikipedia.org/wiki/PBKDF2 PBKDF2] as the default algorithm to store passwords. This should be sufficient for most users: it's quite secure, requiring massive amounts of computing time to break. However, other algorithms could be used as the following document explains: [https://docs.djangoproject.com/en/1.4/topics/auth/#how-django-stores-passwords How Django stores passwords].

Our plan is to extend this functionality in the future adding groups and granular permissions support.

= Dashboard preservation planning tab =

== Format Policy Registry (FPR) ==

=== Introduction to the Format Policy Registry ===

The Format Policy Registry (FPR) is a database which allows Archivematica users to define format policies for handling file formats. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format). Format policies will change as community standards, practices and tools evolve. Format policies are maintained by Artefactual, who provides a freely-available FPR server hosted at [http://fpr.archivematica.org fpr.archivematica.org]. This server stores structured information about normalization format policies for preservation and access. You can update your local FPR from the FPR server using the UPDATE button in the preservation planning tab of the dashboard. In addition, you can maintain local rules to add new formats or customize the behaviour of Archivematica. The Archivematica dashboard communicates with the FPR server via a REST API.

==== First-time configuration ====

The first time a new Archivematica installation is set up, it will attempt to connect to the FPR server as part of the initial configuration process. As a part of the setup, it will register the Archivematica install with the server and pull down the current set of format policies. In order to register the server, Archivematica will send the following information to the FPR Server, over an encrypted connection:

#Agent Identifier (supplied by the user during registration while installing Archivematica)
#Agent Name (supplied by the user during registration while installing Archivematica)
#IP address of host
#UUID of Archivematica instance
#current time

*The only information that will be passed back and forth between Archivematica and the FPR Server would be these format policies - what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. No information about the content that has been run through Archivematica, or any details about the Archivematica installation or configuration would be sent to the FPR Server.

* Because Archivematica is an open source project, it is possible for any organization to conduct a software audit/code review before running Archivematica in a production environment in order to independently verify the information being shared with the FPR Server. An organization could choose to run a private FPR Server, accessible only within their own network(s), to provide at least a limited version of the benefits of sharing format policies, while guaranteeing a completely self-contained preservation system. This is something that Artefactual is not intending to develop, but anyone is free to extend the software as they see fit, or to hire us or other developers to do so.

=== Updating format policies ===

FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica. Clicking the "update" button will initiate an FPR pull which will bring in any new or altered rules since the last time an update was performed.

=== Types of FPR entries ===

==== Format ====

In the FPR, a "format" is a record representing one or more related ''format versions'', which are records representing a specific file format. For example, the format record for "Graphics Interchange Format" (GIF) is comprised of format versions for both GIF 1987a and 1989a.

When creating a new format version, the following fields are available:

* Description (required) - Text describing the format. This will be saved in METS files.
* Version (required) - The version number for this specific format version (not the FPR record). For example, for Adobe Illustrator 14 .ai files, you might choose "14".
* Pronom id - The specific format version's unique identifier in [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx PRONOM], the UK National Archives's format registry. This is optional, but highly recommended.
* Access format and Preservation format - Indicates whether this format is suitable as an access format for end users, and for preservation.

==== Format Group ====

A format group is a convenient grouping of related file formats which share common properties. For instance, the FPR includes an "Image (raster)" group which contains format records for GIF, JPEG, and PNG. Each format can belong to one (and only one) format group.

==== Characterization ====
Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization micro-service always ran the [http://projects.iq.harvard.edu/fits FITS] tool. As of Archivematica 1.2, characterization is fully customizable by the Archivematica administrator.

===== Characterization tools =====

Archivematica has four default characterization tools upon installation. Which tool will run on a given file depends on the type of file, as determined by the selected identification tool.

====== Default ======

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

====== Multimedia ======

Archivematica 1.2 introduced three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

* [http://ffmpeg.org/ FFprobe], a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
* [http://mediaarea.net/en/MediaInfo MediaInfo], a characterization tool oriented towards audio and video data
* [http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool], a characterization tool oriented towards still image data and extraction of embedded metadata

===== Writing a new characterization command =====

Information on writing new characterization commands can be found in the [[Administrator_manual_1.2#Format_Policy_Rules|FPR administrator's manual]].

Writing a characterization command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]]. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

==== Extraction ====

Archivematica supports extracting contents from files during the transfer phase.

Many transfers contain files which are packages of other files; examples of these include compressed archives, such as ZIP files, or disk images. Archivematica provides a transcription microservice which comes with several predefined rules to extract packages, and which is fully customizeable by Archivematica administrators. Administrators can write new commands, and assign existing formats to run for other file formats.

===== Writing a new extraction command =====

Writing an extraction command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

An extraction command is passed two arguments: the ''file to extract'', and the ''path to which the package should be extracted''. Similar to [[Administrator_manual_1.2#Normalization Command|normalization commands]], these arguments will be interpolated directly into "bashScript" and "command" scripts, and passed as positional arguments to "pythonScript" and "asIs" scripts.

{|
|Name (bashScript and command)||Commandline position (pythonScript and asIs)||Description||Sample value
|-
|%outputDirectory%||First||The full path to the directory in which the package's contents should be extracted||/path/to/filename-uuid/
|-
|%inputFile%||Second||The full path to the package file||/path/to/filename
|}

Here's a simple example of how to call an existing tool (7-zip) without any extra logic:

<pre>7z x -bd -o"%outputDirectory%" "%inputFile%"</pre>

This Python script example is more complex, and attempts to determine whether any files were extracted in order to determine whether to exit 0 or 1 (and report success or failure):

<pre>
from __future__ import print_function
import re
import subprocess
import sys

def extract(package, outdir):
# -a extracts only allocated files; we're not capturing unallocated files
try:
process = subprocess.Popen(['tsk_recover', package, '-a', outdir],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
stdout, stderr = process.communicate()

match = re.match(r'Files Recovered: (\d+)', stdout.splitlines()[0])
if match:
if match.groups()[0] == '0':
raise Exception('tsk_recover failed to extract any files with the message: {}'.format(stdout))
else:
print(stdout)
except Exception as e:
return e

return 0

def main(package, outdir):
return extract(package, outdir)

if __name__ == '__main__':
package = sys.argv[1]
outdir = sys.argv[2]
sys.exit(main(package, outdir))
</pre>

==== Identification ====

===== Identification Tools =====

The identification tool properties in Archivematica control the ways in which Archivematica identifies files and associates them with the FPR's version records. The current version of the FPR server contains two tools: a script based on the [http://www.openplanetsfoundation.org/ Open Planets Foundation's] [https://github.com/openplanets/fido/ FIDO] tool, which identifies based on the IDs in PRONOM, and a simple script which identifies files by their file extension. You can use the identification tools portion of FPR to customize the behaviour of the existing tools, or to write your own.

===== Identification Commands =====

Identification commands contain the actual code that a tool will run when identifying a file. This command will be run on every file in a transfer.

When adding a new command, the following fields are available:

* Identifier (mandatory) - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Script type (mandatory) - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.

When coding a command, you should expect your script to take the path to the file to be identifed as the first commandline argument. When returning an identification, the tool should print a single line containing ''only'' the identifier, and should exit 0. Any informative, diagnostic, and error message can be printed to stderr, where it will be visible to Archivematica users monitoring tool results. On failure, the tool should exit non-zero.

===== Identification Rules =====

These identification rules allow you to define the relationship between the output created by an identification tool, and one of the formats which exists in the FPR. This must be done for the format to be tracked internally by Archivematica, and for it to be used by normalization later on. For instance, if you created a FIDO configuration which returns MIME types, you could create a rule which associates the output "image/jpeg" with the "Generic JPEG" format in the FPR.

Identification rules are necessary only when a tool is configured to return file extensions or MIME types. Because PUIDs are universal, Archivematica will always look these up for you without requiring any rules to be created, regardless of what tool is being used.

When creating an identification rule, the following mandatory fields must be filled out:

* Format - Allows you to select one of the formats which already exists in the FPR.
* Command - Indicates the command that produces this specific identification.
* Output - The text which is written to standard output by the specified command, such as "image/jpeg"

==== Format Policy Tools ====

Format policy tools control how Archivematica processes files during ingest. The most common kind of these tools are normalization tools, which produce preservation and access copies from ingested files. Archivematica comes configured with a number of commands and scripts to normalize several file formats, and you can use this section of the FPR to customize them or to create your own. These are organized similarly to the [[#Identification Tools]] documented above.

Archivematica uses the following kinds of format policy rules:

* Characterization
* Extraction
* Normalization - Access, preservation and thumbnails
* Event detail - Extracts information about a given tool in order to be inserted into a generated METS file.
* Transcription
* Verification - Validates a file produced by another command. For instance, a tool could use Exiftool or JHOVE to determine whether a thumbnail produced by a normalization command was valid and well-formed.

=== Format Policy Commands ===

Like the [[#Identification Commands]] above, format policy commands are scripts or command line statements which control how a normalization tool runs. This command will be run once on every file being normalized using this tool in a transfer.

When creating a normalization command, the following mandatory fields must be filled out:

* Tool - One or more tools to be associated with this command.
* Description - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Command - The script's source, or the commandline statement to execute.
* Script type - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.
* Output format (optional) - The format the command outputs. For example, a command to normalize audio to MP3 using ffmpeg would select the appropriate MP3 format from the dropdown.
* Output location (optional) - The path the normalized file will be written to. See the [[#Writing a command]] section of the documentation for more information.
* Command usage - The purpose of the command; this will be used by Archivematica to decide whether a command is appropriate to run in different circumstances. Values are "Normalization", "Event detail", and "Verification". See the [[#Writing a command]] section of the documentation for more information.
* Event detail command - A command to provide information about the software running this command. This will be written to the METS file as the "event detail" property. For example, the normalization commands which use ffmpeg use an event detail command to extract ffmpeg's version number.

=== Format Policy Rules ===

Format policy rules allow commands to be associated with specific file types. For instance, this allows you to configure the command that uses ImageMagick to create thumbnails to be run on .gif and .jpeg files, while selecting a different command to be run on .png files.

When creating a format policy rule, the following mandatory fields must be filled out:

* Purpose - Allows Archivematica to distinguish rules that should be used to normalize for preservation, normalize for access, to extract information, etc.
* Format - The file format the associated command should be selected for.
* Command - The specific command to call when this rule is used.

=== Writing a command ===

==== Identification command ====

Identification commands are very simple to write, though they require some familiarity with Unix scripting.

An identification command run once for every file in a transfer. It will be passed a single argument (the path to the file to identify), and no switches.

On success, a command should:

* Print the identifier to stdout
* Exit 0

On failure, a command should:

* Print nothing to stdout
* Exit non-zero (Archivematica does not assign special significance to non-zero exit codes)

A command can print anything to stderr on success or error, but this is purely informational - Archivematica won't do anything special with it. Anything printed to stderr by the command will be shown to the user in the Archivematica dashboard's detailed tool output page. You should print any useful error output to stderr if identification fails, but you can also print any useful extra information to stderr if identification succeeds.

Here's a very simple Python script that identifies files by their file extension:

<pre>import os.path, sys
(_, extension) = os.path.splitext(sys.argv[1])
if len(extension) == 0:
exit(1)
else:
print extension.lower()</pre>

Here's a more complex Python example, which uses [http://www.sno.phy.queensu.ca/~phil/exiftool/ Exiftool]'s XML output to return the MIME type of a file:

<pre>#!/usr/bin/env python

from lxml import etree
import subprocess
import sys

try:
xml = subprocess.check_output(['exiftool', '-X', sys.argv[1]])
doc = etree.fromstring(xml)
print doc.find('.//{http://ns.exiftool.ca/File/1.0/}MIMEType').text
except Exception as e:
print >> sys.stderr, e
exit(1)</pre>

Once you've written an identification command, you can register it in the FPR using the following steps:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for the command itself:

# Click "Create New Command".
# Select your tool from the "Tool" dropdown box.
# Fill out the Identifier with text to describe to a user what this tool does. For instance, we might choose "Identify MIME-type using Exiftool".
# Select the appropriate script type - in this case, "Python Script".
# Enter the source code for your script in the "Command" box.
# Click "Create Command".

Finally, you must create rules which associate the possible outputs of your tool with the FPR's format records. This needs to be done once for every supported format; we'll show it with MP3, as an example.

# Navigate to the "Identification Rules" page, and click "Create New Rule".
# Choose the appropriate foramt from the Format dropdown - in our case, "Audio: MPEG Audio: MPEG 1/2 Audio Layer 3".
# Choose your command from the Command dropdown.
# Enter the text your command will output when it identifies this format. For example, when our Exiftool command identifies an MP3 file, it will output "audio/mpeg".
# Click "Create".

Once this is complete, any new transfers you create will be able to use your new tool in the identification step.

==== Normalization Command ====

Normalization commands are a bit more complex to write because they take a few extra parameters.

The goal of a normalization command is to take an input file and transform it into a new format. For instance, Archivematica provides commands to transform video content into FFV1 for preservation, and into H.264 for access.

Archivematica provides several parameters specifying input and output filenames and other useful information. Several of the most common are shown in the examples below; a more complete list is in a later section of the documentation: [[#Normalization command variables and arguments]]

When writing a bash script or a command line, you can reference the variables directly in your code, like this:

<pre>inkscape -z "%fileFullName%" --export-pdf="%outputDirectory%%prefix%%fileName%%postfix%.pdf"</pre>

When writing a script in Python or other languages, the values will be passed to your script as commandline options, which you will need to parse. The following script provides an example using the argparse module that comes with Python:

<pre>import argparse
import subprocess

parser = argparse.ArgumentParser()

parser.add_argument('--file-full-name', dest='filename')
parser.add_argument('--output-file-name', dest='output')
parsed, _ = parser.parse_known_args()
args = [
'ffmpeg', '-vsync', 'passthrough',
'-i', parsed.filename,
'-map', '0:v', '-map', '0:a',
'-vcodec', 'ffv1', '-g', '1',
'-acodec', 'pcm_s16le',
parsed.output+'.mkv'
]

subprocess.call(args)</pre>

Once you've created a command, the process of registering it is similar to creating a new identification tool. The folling examples will use the Python normalization script above.

First, create a new tool record:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for your new command:

# Click "Create New Tool Command".
# Fill out the Description with text to describe to a user what this tool does. For instance, we might choose "Normalize to mkv using ffmpeg".
# Enter the source for your command in the Command textbox.
# Select the appropriate script type - in this case, "Python Script".
# Select the appropriate output format from the dropdown. This indicates to Archivematica what kind of file this command will produce. In this case, choose "Video: Matroska: Generic MKV".
# Enter the location the video will be saved to, using the script variables. You can usually use the "%outputFileName%" variable, and add the file extension - in this case "%outputFileName%.mkv"
# Select a verification command. Archivematica will try to use this tool to ensure that the file your command created works. Archivematica ships with two simple tools, which test whether the file exists and whether it's larger than 0 bytes, but you can create new commands that perform more complicated verifications.
# Finally, choose a command to produce the "Event detail" text that will be written in the section of the METS file covering the normalization event. Archivematica already includes a suitable command for ffmpeg, but you can also create a custom command.
# Click "Create command".

Finally, you must create rules which will associate your command with the formats it should run on.

==== Normalization command variables and arguments ====

The following variables and arguments control the behaviour of format policy command scripts.

{|
|Name (bashScript and command)||Commandline option (pythonScript and asIs)||Description||Sample value
|-
|%fileName%||--input-file=||The filename of the file to process. This variable holds the file's basename, not the whole path.||video.mov
|-
|%fileDirectory%||--file-directory=||The directory containing the input file.||/path/to
|-
|%inputFile%||--file-name=||The fully-qualified path to the file to process.||/path/to/video.mov
|-
|%fileExtension%||--file-extension=||The file extension of the input file.||mov
|-
|%fileExtensionWithDot%||--file-extension-with-dot=||As above, without stripping the period.||.mov
|-
|%outputDirectory%||--output-directory=||The directory to which the output file should be saved.||/path/to/access/copies
|-
|%outputFileUUID%||--output-file-uuid=||The unique identifier assigned by Archivematica to the output file.||1abedf3e-3a4b-46d7-97da-bd9ae13859f5
|-
|%outputDirectory%||--output-directory=||The fully-qualified path to the directory where the new file should be written.||/var/archivematica/sharedDirectory/www/AIPsStore/uuid
|-
|%outputFileName%||--output-file-name=||The fully-qualified path to the output file, minus the file extension.||/path/to/access/copies/video-uuid
|}

= Customization and automation =
* Workflow processing decisions can be made in the processingMCP.xml file. [https://www.archivematica.org/wiki/Administrator_manual_0.10#Processing_configuration See here.]
* Workflows are currently created at the development level.
*: Some resources avialable
*:* [[MCP_Basic_Configuration]]
*:* [[MCP]]
*:* [[Creating_Custom_Workflows]]
*:* [[Development]]
* Normalization commands can be viewed in the preservation planning tab.
* Normalization paths and commands are currently editable under the preservation planning tab in the dashboard.

= Elasticsearch =

Archivematica has the capability of indexing data about files contained in AIPs and this data can be [[Elasticsearch Development|accessed programatically]] for various applications.

If, for whatever reason, you need to delete an ElasticSearch index please see [[ElasticSearch Administration]].

If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

<pre>
import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
conn.delete_index('aips')
except:
print "Error deleting index or index already deleted."
</pre>

=== Rebuilding the AIP index ===

To rebuild the ElasticSearch AIP index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-aip-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-aip-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-aip-index-from-files <location of your AIP store>

=== Rebuilding the transfer index ===

Similarly, to rebuild the ElasticSearch transfer data index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-transfer-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-transfer-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-transfer-index-from-files <location of your AIP store>

= Data backup =

In Archivematica there are three types of data you'll likely want to back up:
* Filesystem (particularly your storage directories)
* MySQL
* ElasticSearch

MySQL is used to store short-term processing data. You can back up the MySQL database by using the following command:

<pre>mysqldump -u <your username> -p<your password> -c MCP > <filename of backup></pre>

ElasticSearch is used to store long-term data. Instructions and scripts for backing up and restoring ElasticSearch are available [http://tech.superhappykittymeow.com/?p=296 here].

= Security =

Once you've set up Archivematica it's a good practice, for the sake of security, to change the default passwords.

== MySQL ==

You should create a new MySQL user or change the password of the default "archivematica" MySQL user. The change the password of the default user, enter the following into the command-line:

$ mysql -u root -p<your MyQL root password> -D mysql \
-e "SET PASSWORD FOR 'archivematica'@'localhost' = PASSWORD('<new password>'); \
FLUSH PRIVILEGES;"

Once you've done this you can change Archivematica's MySQL database access credentials by editing these two files:
* <code>/etc/archivematica/archivematicaCommon/dbsettings</code> (change the <code>user</code> and <code>password</code> settings)
* <code>/usr/share/archivematica/dashboard/settings/common.py</code> (change the <code>USER</code> and <code>PASSWORD</code> settings in the <code>DATABASES</code> section)

Archivematica does not presently support secured MySQL communication so MySQL should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1645 1645].

== AtoM ==

In addition to changing the MySQL credentials, if you've also installed AtoM you'll want to set the password for it as well. Note that after changing your AtoM credentials you should update the credentials on the AtoM DIP upload administration page as well.

== Gearman ==

Archivematica relies on the German server for queuing work that needs to be done. Gearman currently doesn't support secured connections so Gearman should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1345 1345].

= Questions =

If you run into any difficulties while administrating Archivematica, please check out our FAQ and, if that doesn't help you, contain us using the Archivematica discussion group.

== Frequently asked questions ==
* [[AM_FAQ|Solutions to common questions]]

== Discussion group ==
* [http://groups.google.com/group/archivematica?hl=en Discussion group] for questions not covered by the FAQ

Administrator manual 1.2

2014-08-07T21:47:35Z

Mdemeo: Document extraction commands

[[Main Page]] > [[Documentation]] > Administrator manual 1.2

This manual covers administrator-specific instructions for Archivematica. It will also provide help for using forms in the Administration tab of the Archivematica dashboard and the administrator capabilities in the Format Policy Registry (FPR), which you will find in the Preservation planning tab of the dashboard.

For end-user instructions, please see the [[User_manual_1.2|user manual]].

= Installation =
* [[Installation|Instructions for installing the latest build of Archivematica on your server]]

= Upgrading =

Currently, Archivematica does not support upgrading from one version to the next. A re-install is required. After re-installing, you can restore Archivematica's knowledge of your AIPs, by [[#Rebuilding_the_AIP_index|rebuilding the AIP index]] and, if you have transfers stored in the backlog, [[#Rebuilding_the_transfer_index|rebuilding the transfer index]].

= Storage service =
The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

[[File:SS1-0.png|700px|center|thumb|Home page of Storage Service]]

TODO Discuss how spaces and locations fit into each other, pipelines fit to locations, spaces=config, locations=purpose, packages in locations

== Archivematica Configuration ==

When installing Archivematica, options to configure it with the Storage Service will be presented.

[[File:Install3.png|600px|center]]

If you have installed the Storage Service at a different URL, you may change that here.

The top button 'Use default transfer source & AIP storage locations' will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The bottom button 'Register this pipeline & set up transfer source and AIP storage locations' will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be entered, and Archivematica will attempt to register its dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed, and a Pipeline for this Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

=== Change the port in the web server configuration ===

The storage services uses nginx by default, so you can edit /etc/nginx/sites-enabled/storage and change the line that says

listen 8000;

change 8000 to whatever port you prefer to use.

Keep in mind that in a default installation of Archivematica 1.0, the dashboard is running in Apache on port 80. So it is not possible to make nginx run on port 80 on the same machine. If you install the storage service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

== Spaces ==
[[File:Spaces.png|600px|center]]
A storage Space contains all the information necessary to connect to the physical storage. It is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored. All locations must be contained in a space.

A space is usually the immediate parent of the Location folders. For example, if you had transfer source locations at <tt>/home/artefactual/archivematica-sampledata-2013-10-10-09-17-20</tt> and <tt>/home/artefactual/maildir_transfers</tt>, the Space's path would be <tt>/home/artefactual/</tt>

Currently supported protocols are local filesystem, NFS, and pipeline local filesystem.

=== Local Filesystem ===

Local Filesystem spaces handle storage that is available locally on the machine running the storage service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

==== Fields ====
* ''Path'': Absolute path to the Space on the local filesystem
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.

=== NFS ===

NFS spaces are for NFS exports mounted on the Storage Service server, and the Archivematica pipeline.

==== Fields ====
* ''Path'': Absolute path the space is mounted at on the filesystem local to the storage service
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the remote computer exporting the NFS mount.
* ''Remote path'': Export path on the NFS server
* ''Version'': nfs or nfs4 - as would be passed to the <tt>mount</tt> command.
* ''Manually Mounted'': Check this if it has been mounted already. Otherwise, the Storage Service will try to mount it. ''Note: this feature is not yet available.''

=== Pipeline Local Filesystem ===

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the storage service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the storage service is hosted on <tt>storage_service_host</tt> and Archivematica is running on <tt>archivematica1</tt> . The transfer sources for Archivematica are stored locally on <tt>archivematica1</tt>, but the storage service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

'''Note: Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.'''

==== Fields ====
* ''Path'': Absolute path to the space on the remote machine.
* ''Size'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Remote name'': Hostname or IP address of the computer running Archivematica. Should be SSH accessible from the Storage Service computer.
* ''Remote user'': Username on the remote host

== Locations ==
[[File:Locations.png|600px|center]]
A storage Location is contained in a Space, and knows its purpose in the Archivematica system. A Location is also where Packages are stored. Each Location is associated with a pipeline and can only be accessed by that pipeline.

Currently, a Location can have one of three purposes: Transfer Source, Currently Processing, or AIP Storage. Transfer source locations display in Archivematica's Transfer tab, and any folder in a transfer source can be selected to become a Transfer. AIP storage locations are where the completed AIPs are put for long-term storage. During processing, Archivematica uses the currently processing location associated with that pipeline. Only one currently processing location should be associated with a given pipeline. If you want the same directory on disk to have multiple purposes, multiple Locations with different purposes can be created.

==== Fields ====
* ''Purpose'': What use the Location is for
* ''Pipeline'': Which pipelines this location is available to.
* ''Relative Path'': Path to this Location, relative to the space that contains it.
* ''Description'': Description of the Location to be displayed to the user.
* ''Quota'': (Optional) Maximum size allowed for this space. Set to 0 or leave blank for unlimited.
* ''Enabled'': If checked, this location is accessible to pipelines associated with it. If unchecked, it will not show up to any pipeline.

== Pipeline ==
[[File:Pipelines.png|600px|center]]
A pipeline is an Archivematica instance registered with the Storage Service, including the server and all associated clients. Each pipeline is uniquely identified by a UUID, which can be found in the dashboard under Administration -> General Configuration. When installing Archivematica, it will attempt to register its UUID with the Storage Service, with a description of "Archivematica on <hostname>".

==== Fields ====
* ''UUID'': Unique identifier of the Archivematica pipeline
* ''Description'': Description of the pipeline displayed to the user. e.g. Sankofa demo site
* ''Enabled'': If checked, this pipeline can access locations associate with it. If unchecked, all locations will be disabled, even if associated.
* ''Default Locations'': If checked, the default locations configured in Administration -> Configuration will be created or associated with the new pipeline.

== Packages ==
[[File:Packages.png|600px|center]]
A Package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). They cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the storage service administrator. To learn more about deleting an AIP, see [[UM_archival_storage_1.2#Deleting_an_AIP|Deleting an AIP]].

== Administration ==
[[File:StorageserviceAdmin1.png|600px|center]]
[[File:StorageserviceAdmin2.png|600px|center]]
The Administration section manages the users and settings for the Storage Service.

=== Users ===

Only registered users can long into the storage service, and the Users page is where users can be created or modified.

TODO what info means, what admin/active mean, who can edit what

=== Settings ===

Settings control the behavior of the Storage Service. Default Locations are the created or associated with pipelines when they are created.

'''Pipelines are disabled upon creation?''' sets whether a newly created Pipeline can access its Locations. If a Pipeline is disabled, it cannot access any of its locations. By disabling newly created Pipelines, it provides some security against unwanted perusal of the files in Locations, or use by unauthorized Archivematica instances. This can be configured individually when creating a Pipeline manually through the Storage Service website.

'''Default Locations''' set what existing locations should be associated with a newly created Pipeline, or what new Locations should be created for each new Pipeline. No matter what is configured here, a Currently Processing location is created for all Pipelines, since one is required. Multiple Transfer Source or AIP Storage Locations can be configured by holding down Ctrl when selecting them. New Locations in an existing Space can be created for Pipelines that use default locations by entering the relevant information.

== How to Configure a Location ==

For Spaces of the type "Local Filesystem," Locations are basically directories (or more accurately, paths to directories). You can create Locations for Transfer Source, Currently Processing, or AIP Storage directories.

To create and configure a new Location:

# In the Storage Service, click on the "Spaces" tab.
# Under the Space that you want to add the Location to, click on the "Create Location here" link.
# Choose a purpose (e.g. AIP Storage) and pipeline, and enter a "Relative Path" (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to, e.g. for the default Space, the Path is '/' so your Location path would be relative to that (in the example here, the complete path would end up being '/var/mylocation'). Note: if the path you are defining in your Location doesn't exist, you must create it manually and make sure it is writable by the archivematica user.
# Save the Location settings.
# The new location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard "Administration" tab) or as a destination for AIP storage.

== Store DIP ==

= Dashboard administration tab =

The Archivematica administration pages, under the Administration tab of the dashboard, allows you to configure application components and manage users.

== Processing configuration ==

When processing a SIP or transfer, you may want to automate some of the workflow choices. Choices can be preconfigured by putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

If a SIP or transfer is submitted with a 'processingMCP.xml' file, processing decisions will be made with the included file.

The XML file format is:
<pre><processingMCP>
<preconfiguredChoices>

<preconfiguredChoice>
<appliesTo>755b4177-c587-41a7-8c52-015277568302</appliesTo>
<goToChain>d4404ab1-dc7f-4e9e-b1f8-aa861e766b8e</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>eeb23509-57e2-4529-8857-9d62525db048</appliesTo>
<goToChain>5727faac-88af-40e8-8c10-268644b0142d</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>19adb668-b19a-4fcb-8938-f49d7485eaf3</appliesTo>
<goToChain>333643b7-122a-4019-8bef-996443f3ecc5</goToChain>
<delay unitCtime="yes">2419200.0</delay>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>dec97e3c-5598-4b99-b26e-f87a435a6b7f</appliesTo>
<goToChain>01d80b27-4ad1-4bd1-8f8d-f819f18bf685</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>f19926dd-8fb5-4c79-8ade-c83f61f55b40</appliesTo>
<goToChain>85b1e45d-8f98-4cae-8336-72f40e12cbef</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>7a024896-c4f7-4808-a240-44c87c762bc5</appliesTo>
<goToChain>3c1faec7-7e1e-4cdd-b3bd-e2f05f4baa9b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01d64f58-8295-4b7b-9cab-8f1b153a504f</appliesTo>
<goToChain>9475447c-9889-430c-9477-6287a9574c5b</goToChain>
</preconfiguredChoice>

<preconfiguredChoice>
<appliesTo>01c651cb-c174-4ba4-b985-1d87a44d6754</appliesTo>
<goToChain>414da421-b83f-4648-895f-a34840e3c3f5</goToChain>
</preconfiguredChoice>
</preconfiguredChoices>
</processingMCP>
</pre>

Where appliesTo is the UUID associated with the micro-service job presented in the dashboard, and goToChain is the UUID of the desired selection. The default processingMCP.xml file is located at '/var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml'.

The processing configuration administration page of the dashboard provides you with an easy form to configure the default 'processingMCP.xml' that's added to a SIP or transfer if it doesn't already contain one. When you change the options using the web interface the necessary XML will be written behind the scenes.

[[File:ProcessingConfig1-1.png|1000px|center|thumb|Processing configuration form in Administration tab of the dashboard]]

*For the approval (yes/no) steps, the user ticks the box on the left-hand side to make a choice. If the box is not ticked, the approval step will appear in the dashboard.
*For the other steps, if no actions are selected the choices appear in the dashboard
*You can select whether or not to send transfers to quarantine (yes/no) and decide how long you'd like them to stay there.
*You can select whether to extract packages as well as whether to keep and/or delete the extracted objects and/or the package itself.
*You can approve normalization, sending the AIP to storage, and uploading the DIP without interrupting the workflow in the dashboard.
*You can pre-select which format identification tool and command to run in both/either transfer and/or ingest to base your normalization upon.
*You can choose to send a transfer to backlog or to create a SIP every time.
*You can select to be reminded to add PREMIS event metadata about manual normalization should you choose to use that capability.
*You can select between 7z using lzma and 7zip using bzip or parallel bzip2 algorithms for AIP compression.
*For select compression level, the options are as follows:
**1 - fastest mode
**3 - fast compression mode
**5 - normal compression mode
**7 - maximum compression
**9 - ultra compression
*You can select one archival storage location where you will consistently send your AIPs.

== General ==

In the general configuration section, you can select interface options and set [[Administrator_manual_1.2#Storage_service|Storage Service]] options for your Archivematica client.

[[File:Generalconfig.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

=== Interface options ===

Here, you can hide parts of the interface that you don't need to use. In particular, you can hide CONTENTdm DIP upload link, AtoM DIP upload link and DSpace transfer type.

=== Storage Service options ===

This is where you'll find the complete URL for the Storage Service. See [[Administrator_manual_1.2#Storage_service|Storage Service]] for more information about this feature.

== Failures ==

Archivematica 1.2 includes dashboard failure reporting.
[[File:FailuresAdmin.png|1000px|center|thumb|General configuration options in Administration tab of the dashboard]]

== Transfer source location ==

Archivematica allows you to start transfers using the operating system's file browser or via a web interface. Source files for transfers, however, can't be uploaded using the web interface: they must exist on volumes accessible to the Archivematica MCP server and configured via the [[Administrator_manual_1.2#Storage_service|Storage Service]].

When starting a transfer you're required to select one or more directories of files to add to the transfer.

You can view your transfer source directories in the Administrative tab of the dashboard under "Transfer source locations".

== AIP storage locations ==

AIP storage directories are directories in which completed AIPs are stored. Storage directories can be specified in a manner similar to transfer source directories using the [[Administrator_manual_1.2#Storage_service|Storage Service]].

You can view your transfer source directories in the Administrative tab of the dashboard under "AIP storage locations"

== AtoM DIP upload ==

Archivematica can upload DIPs directly to an [https://www.ica-atom.org/ AtoM] website so the contents can be accessed online. The AtoM DIP upload configuration page is where you specify the details of the AtoM installation you'd like the DIPs uploaded to (and, if using Rsync to transfer the DIP files, Rsync transfer details).

The parameters that you'll most likely want to set are <code>url</code>, <code>email</code>, and <code>password</code>. These parameters, respectively, specify the destination AtoM website's URL, the email address used to log in to the website, and the password used to log in to the website.

AtoM DIP upload can also use [http://en.wikipedia.org/wiki/Rsync Rsync] as a transfer mechanism. Rsync is an open source utility for efficiently transferring files. The <code>rsync-target</code> parameter is used to specify an Rsync-style target host/directory pairing, "foobar.com:~/dips/" for example. The <code>rsync-command</code> parameter is used to specify rsync connection options, "ssh -p 22222 -l user" for example. If you are using the rsync option, please see AtoM server configuration below.

To set any parameters for AtoM DIP upload change the values, preserving the existing format they're specified in, in the "Command arguments" field then click "Save".

Note that in AtoM, the sword plugin (Admin --> Plugins --> qtSwordPlugin) must be enabled in order for AtoM to receive uploaded DIPs. Enabling Job scheduling (Admin --> Settings --> Job scheduling) is also recommended.

=== AtoM server configuration ===

This server configuration step is necessary to allow Archivematica to log in to the AtoM server without passwords, and only when the user is deploying the rsync option described above in the AtoM DIP upload section.

To enable sending DIPs from Archivematica to the AtoM server:

Generate SSH keys for the Archivematica user. Leave the passphrase field blank.
<pre>
$ sudo -i -u archivematica
$ cd ~
$ ssh-keygen
</pre>

Copy the contents of <code>/var/lib/archivematica/.ssh/id_rsa.pub</code> somewhere handy, you will need it later.

Now, it's time to configure the AtoM server so Archivematica can send the DIPs using SSH/rsync. For that purpose, you will create a user called <code>archivematica</code> and we are going to assign that user a restricted shell with access only to rsync:

<pre>
$ sudo apt-get install rssh
$ sudo useradd -d /home/archivematica -m -s /usr/bin/rssh archivematica
$ sudo passswd -l archivematica
$ sudo vim /etc/rssh.conf // Make sure that allowrsync is uncommented!
</pre>

Add the SSH key that we generated before:

<pre>
$ sudo mkdir /home/archivematica/.ssh
$ chmod 700 /home/archivematica/.ssh/
$ sudo vim /home/archivematica/.ssh/authorized_keys // Paste here the contents of id_dsa.pub
$ chown -R archivematica:archivematica /home/archivematica
</pre>

In Archivematica, make sure that you update the <code>--rsync-target</code> accordingly. 
These are the parameters that we are passing to the upload-qubit microservice. 
Go to the Administration > Upload DIP page in the dashboard.

Generic parameters:

<pre>
--url="http://atom-hostname/index.php" \
--email="demo@example.com" \
--password="demo" \
--uuid="%SIPUUID%" \
--rsync-target="archivematica@atom-hostname:/tmp" \
--debug
</pre>

== CONTENTdm DIP upload ==

Archivematica can also upload DIPs to [http://www.contentdm.org/ CONTENTdm] instances. Multiple CONTENTdm destinations may be configured.

For each possible CONTENTdm DIP upload destination, you'll specify a brief description and configuration parameters appropriate for the destination. Paramters include <code>%ContentdmServer%</code> (full path to the CONTENTdm API, including the leading 'http://' or 'https://', for example http://example.com:81/dmwebservices/index.php), <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> (Linux user and group on the CONTENTdm server, not a CONTENTdm username). Note that only <code>%ContentdmServer%</code> is required is you are going to produce CONTENTdm Project Client packages; <code>%ContentdmUser%</code>, and <code>%ContentdmGroup%</code> are also required if you are going to use the "direct upload" option for uploading your DIPs into CONTENTdm.

When changing parameters for a CONTENTdm DIP upload destination simply change the values, preserving the existing format they're specified in. To add an upload destination fill in the form at the bottom of the page with the appropriate values. When you've completed your changes click the "Save" button.

== PREMIS agent ==

The PREMIS agent name and code can be set via the administration interface.
[[File:Premisagent-10.png|center|900px|thumbs]]

== Rest API ==

In addition to automation using the processingMCP.xml file, Archivematica includes a REST API for automating transfer approval. Using this API, you can create a custom script that copies a transfer to the appropriate directory then uses the <code>curl</code> command, or some other means, to let Archivematica know that the copy is complete.

=== API keys ===

Use of the REST API requires the use of API keys. An API key is associated with a specific user. To generate an API key for a user:

# Browse to <code>/administration/accounts/list/</code>
# Click the "Edit" button for the user you'd like to generate an API key for
# Click the "Regenerate API key" checkbox
# Click "Save"

After generating an API key, you can click the "Edit" button for the user and you should see the API key.

=== IP whitelist ===

In addition to creating API keys, you'll need to add the IP of any computer making REST requests to the REST API whitelist. The IP whitelist can be edited in the administration interface at <code>/administration/api/</code>.

=== Approving a transfer ===

The REST API can be used to approve a transfer. The transfer must first be copied into the appropriate watch directory. To determine the location of the appropriate watch directory, first figure out where the shared directory is from the <code>watchDirectoryPath</code> value of <code>/etc/archivematica/MCPServer/serverConfig.conf</code>. Within that directory is a subdirectory <code>activeTransfers</code>. In this subdirectory are watch directories for the various transfer types.

When using the REST API to approve a transfer, if a transfer type isn't specified, the transfer will be deemed a standard transfer.

'''HTTP Method:''' POST

'''URL:''' <code>/api/transfer/approve</code>

'''Parameters:'''

<code>directory</code>: directory name of the transfer

<code>type</code> (optional): transfer type [standard|dspace|unzipped bag|zipped bag]

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl --data "username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6&type=standard&directory=MyTransfer" http://127.0.0.1/api/transfer/approve

Example result:

{"message": "Approval successful."}

=== Listing unapproved transfers ===

The REST API can be used to get a list of unapproved transfers. Each transfer's directory name and type is returned.

'''Method:''' <code>GET</code>

'''URL:''' <code>/api/transfer/unapproved</code>

'''Parameters:'''

<code>api_key</code>: an API key

<code>username</code>: the username associated with the API key

Example curl command:

curl "http://127.0.0.1/api/transfer/unapproved?username=rick&api_key=f12d6b323872b3cef0b71be64eddd52f87b851a6"

Example result:

{
"message": "Fetched unapproved transfers successfully.",
"results": [{
"directory": "MyTransfer",
"type": "standard"
}
]
}
== Users ==

The dashboard provides a simple cookie-based user authentication system using the [https://docs.djangoproject.com/en/1.4/topics/auth/ Django authentication framework]. Access to the dashboard is limited only to logged-in users and a login page will be shown when the user is not recognized. If the application can't find any user in the database, the user creation page will be shown instead, allowing the creation of an administrator account.

Users can be also created, modified and deleted from the Administration tab. Only users who are administrators can create and edit user accounts.

You can add a new user to the system by clicking the "Add new" button on the user administration page. By adding a user you provide a way to access Archivematica using a username/password combination. Should you need to change a user's username or password, you can do so by clicking the "Edit" button, corresponding to the user, on the administration page. Should you need to revoke a user's access, you can click the corresponding "Delete" button.

=== CLI creation of administrative users ===

If you need an additional administrator user one can be created via the command-line, issue the following commands:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py createsuperuser

=== CLI password resetting ===

If you've forgotten the password for your administrator user, or any other user, you can change it via the command-line:

cd /usr/share/archivematica/dashboard
export PATH=$PATH:/usr/share/archivematica/dashboard
export DJANGO_SETTINGS_MODULE=settings.common
python manage.py changepassword <username>

===Security===

Archivematica uses [http://en.wikipedia.org/wiki/PBKDF2 PBKDF2] as the default algorithm to store passwords. This should be sufficient for most users: it's quite secure, requiring massive amounts of computing time to break. However, other algorithms could be used as the following document explains: [https://docs.djangoproject.com/en/1.4/topics/auth/#how-django-stores-passwords How Django stores passwords].

Our plan is to extend this functionality in the future adding groups and granular permissions support.

= Dashboard preservation planning tab =

== Format Policy Registry (FPR) ==

=== Introduction to the Format Policy Registry ===

The Format Policy Registry (FPR) is a database which allows Archivematica users to define format policies for handling file formats. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format). Format policies will change as community standards, practices and tools evolve. Format policies are maintained by Artefactual, who provides a freely-available FPR server hosted at [http://fpr.archivematica.org fpr.archivematica.org]. This server stores structured information about normalization format policies for preservation and access. You can update your local FPR from the FPR server using the UPDATE button in the preservation planning tab of the dashboard. In addition, you can maintain local rules to add new formats or customize the behaviour of Archivematica. The Archivematica dashboard communicates with the FPR server via a REST API.

==== First-time configuration ====

The first time a new Archivematica installation is set up, it will attempt to connect to the FPR server as part of the initial configuration process. As a part of the setup, it will register the Archivematica install with the server and pull down the current set of format policies. In order to register the server, Archivematica will send the following information to the FPR Server, over an encrypted connection:

#Agent Identifier (supplied by the user during registration while installing Archivematica)
#Agent Name (supplied by the user during registration while installing Archivematica)
#IP address of host
#UUID of Archivematica instance
#current time

*The only information that will be passed back and forth between Archivematica and the FPR Server would be these format policies - what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. No information about the content that has been run through Archivematica, or any details about the Archivematica installation or configuration would be sent to the FPR Server.

* Because Archivematica is an open source project, it is possible for any organization to conduct a software audit/code review before running Archivematica in a production environment in order to independently verify the information being shared with the FPR Server. An organization could choose to run a private FPR Server, accessible only within their own network(s), to provide at least a limited version of the benefits of sharing format policies, while guaranteeing a completely self-contained preservation system. This is something that Artefactual is not intending to develop, but anyone is free to extend the software as they see fit, or to hire us or other developers to do so.

=== Updating format policies ===

FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica. Clicking the "update" button will initiate an FPR pull which will bring in any new or altered rules since the last time an update was performed.

=== Types of FPR entries ===

==== Format ====

In the FPR, a "format" is a record representing one or more related ''format versions'', which are records representing a specific file format. For example, the format record for "Graphics Interchange Format" (GIF) is comprised of format versions for both GIF 1987a and 1989a.

When creating a new format version, the following fields are available:

* Description (required) - Text describing the format. This will be saved in METS files.
* Version (required) - The version number for this specific format version (not the FPR record). For example, for Adobe Illustrator 14 .ai files, you might choose "14".
* Pronom id - The specific format version's unique identifier in [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx PRONOM], the UK National Archives's format registry. This is optional, but highly recommended.
* Access format and Preservation format - Indicates whether this format is suitable as an access format for end users, and for preservation.

==== Format Group ====

A format group is a convenient grouping of related file formats which share common properties. For instance, the FPR includes an "Image (raster)" group which contains format records for GIF, JPEG, and PNG. Each format can belong to one (and only one) format group.

==== Characterization ====
Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization micro-service always ran the [http://projects.iq.harvard.edu/fits FITS] tool. As of Archivematica 1.2, characterization is fully customizable by the Archivematica administrator.

===== Characterization tools =====

Archivematica has four default characterization tools upon installation. Which tool will run on a given file depends on the type of file, as determined by the selected identification tool.

====== Default ======

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

====== Multimedia ======

Archivematica 1.2 introduced three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

* [http://ffmpeg.org/ FFprobe], a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
* [http://mediaarea.net/en/MediaInfo MediaInfo], a characterization tool oriented towards audio and video data
* [http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool], a characterization tool oriented towards still image data and extraction of embedded metadata

===== Writing a new characterization command =====

Information on writing new characterization commands can be found in the [[Administrator_manual_1.2#Format_Policy_Rules|FPR administrator's manual]].

Writing a characterization command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]]. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

==== Extraction ====

Archivematica supports extracting contents from files during the transfer phase.

Many transfers contain files which are packages of other files; examples of these include compressed archives, such as ZIP files, or disk images. Archivematica provides a transcription microservice which comes with several predefined rules to extract packages, and which is fully customizeable by Archivematica administrators. Administrators can write new commands, and assign existing formats to run for other file formats.

===== Writing a new extraction command =====

Writing an extraction command is very similar to writing an [[Administrator_manual_1.2#Identificaton Command|identification command]] or a [[Administrator_manual_1.2#Normalization Command|normalization command]].

An extraction command is passed two arguments: the ''file to extract'', and the ''path to which the package should be extracted''. Similar to [[Administrator_manual_1.2#Normalization Command|normalization commands]], these arguments will be interpolated directly into "bashScript" and "command" scripts, and passed as positional arguments to "pythonScript" and "asIs" scripts.

{|
|Name (bashScript and command)||Commandline position (pythonScript and asIs)||Description||Sample value
|-
|%outputDirectory%||First||The full path to the directory in which the package's contents should be extracted||/path/to/filename-uuid/
|-
|%inputFile%||Second||The full path to the package file||/path/to/filename
|}

Here's a simple example of how to call an existing tool (7-zip) without any extra logic:

<pre>7z x -bd -o"%outputDirectory%" "%inputFile%"</pre>

This Python script example is more complex, and attempts to determine whether any files were extracted in order to determine whether to exit 0 or 1 (and report success or failure):

<pre>
from __future__ import print_function
import re
import subprocess
import sys

def extract(package, outdir):
# -a extracts only allocated files; we're not capturing unallocated files
try:
process = subprocess.Popen(['tsk_recover', package, '-a', outdir],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
stdout, stderr = process.communicate()

match = re.match(r'Files Recovered: (\d+)', stdout.splitlines()[0])
if match:
if match.groups()[0] == '0':
raise Exception('tsk_recover failed to extract any files with the message: {}'.format(stdout))
else:
print(stdout)
except Exception as e:
return e

return 0

def main(package, outdir):
return extract(package, outdir)

if __name__ == '__main__':
package = sys.argv[1]
outdir = sys.argv[2]
sys.exit(main(package, outdir))
</pre>

==== Identification Tools ====

The identification tool properties in Archivematica control the ways in which Archivematica identifies files and associates them with the FPR's version records. The current version of the FPR server contains two tools: a script based on the [http://www.openplanetsfoundation.org/ Open Planets Foundation's] [https://github.com/openplanets/fido/ FIDO] tool, which identifies based on the IDs in PRONOM, and a simple script which identifies files by their file extension. You can use the identification tools portion of FPR to customize the behaviour of the existing tools, or to write your own.

==== Identification Commands ====

Identification commands contain the actual code that a tool will run when identifying a file. This command will be run on every file in a transfer.

When adding a new command, the following fields are available:

* Identifier (mandatory) - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Script type (mandatory) - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.

When coding a command, you should expect your script to take the path to the file to be identifed as the first commandline argument. When returning an identification, the tool should print a single line containing ''only'' the identifier, and should exit 0. Any informative, diagnostic, and error message can be printed to stderr, where it will be visible to Archivematica users monitoring tool results. On failure, the tool should exit non-zero.

==== Identification Rules ====

These identification rules allow you to define the relationship between the output created by an identification tool, and one of the formats which exists in the FPR. This must be done for the format to be tracked internally by Archivematica, and for it to be used by normalization later on. For instance, if you created a FIDO configuration which returns MIME types, you could create a rule which associates the output "image/jpeg" with the "Generic JPEG" format in the FPR.

Identification rules are necessary only when a tool is configured to return file extensions or MIME types. Because PUIDs are universal, Archivematica will always look these up for you without requiring any rules to be created, regardless of what tool is being used.

When creating an identification rule, the following mandatory fields must be filled out:

* Format - Allows you to select one of the formats which already exists in the FPR.
* Command - Indicates the command that produces this specific identification.
* Output - The text which is written to standard output by the specified command, such as "image/jpeg"

==== Format Policy Tools ====

Format policy tools control how Archivematica processes files during ingest. The most common kind of these tools are normalization tools, which produce preservation and access copies from ingested files. Archivematica comes configured with a number of commands and scripts to normalize several file formats, and you can use this section of the FPR to customize them or to create your own. These are organized similarly to the [[#Identification Tools]] documented above.

Archivematica uses the following kinds of format policy rules:

* Characterization
* Extraction
* Normalization - Access, preservation and thumbnails
* Event detail - Extracts information about a given tool in order to be inserted into a generated METS file.
* Transcription
* Verification - Validates a file produced by another command. For instance, a tool could use Exiftool or JHOVE to determine whether a thumbnail produced by a normalization command was valid and well-formed.

=== Format Policy Commands ===

Like the [[#Identification Commands]] above, format policy commands are scripts or command line statements which control how a normalization tool runs. This command will be run once on every file being normalized using this tool in a transfer.

When creating a normalization command, the following mandatory fields must be filled out:

* Tool - One or more tools to be associated with this command.
* Description - Human-readable identifier for the command. This will be displayed to the user when choosing an identification tool, so choose carefully.
* Command - The script's source, or the commandline statement to execute.
* Script type - Options are "Bash Script", "Python Script", "Command Line", and "No shebang". The first two options will have the appropriate shebang added as the first line before being executed directly. "No shebang" allows you to write a script in any language as long as the shebang is included as the first line.
* Output format (optional) - The format the command outputs. For example, a command to normalize audio to MP3 using ffmpeg would select the appropriate MP3 format from the dropdown.
* Output location (optional) - The path the normalized file will be written to. See the [[#Writing a command]] section of the documentation for more information.
* Command usage - The purpose of the command; this will be used by Archivematica to decide whether a command is appropriate to run in different circumstances. Values are "Normalization", "Event detail", and "Verification". See the [[#Writing a command]] section of the documentation for more information.
* Event detail command - A command to provide information about the software running this command. This will be written to the METS file as the "event detail" property. For example, the normalization commands which use ffmpeg use an event detail command to extract ffmpeg's version number.

=== Format Policy Rules ===

Format policy rules allow commands to be associated with specific file types. For instance, this allows you to configure the command that uses ImageMagick to create thumbnails to be run on .gif and .jpeg files, while selecting a different command to be run on .png files.

When creating a format policy rule, the following mandatory fields must be filled out:

* Purpose - Allows Archivematica to distinguish rules that should be used to normalize for preservation, normalize for access, to extract information, etc.
* Format - The file format the associated command should be selected for.
* Command - The specific command to call when this rule is used.

=== Writing a command ===

==== Identification command ====

Identification commands are very simple to write, though they require some familiarity with Unix scripting.

An identification command run once for every file in a transfer. It will be passed a single argument (the path to the file to identify), and no switches.

On success, a command should:

* Print the identifier to stdout
* Exit 0

On failure, a command should:

* Print nothing to stdout
* Exit non-zero (Archivematica does not assign special significance to non-zero exit codes)

A command can print anything to stderr on success or error, but this is purely informational - Archivematica won't do anything special with it. Anything printed to stderr by the command will be shown to the user in the Archivematica dashboard's detailed tool output page. You should print any useful error output to stderr if identification fails, but you can also print any useful extra information to stderr if identification succeeds.

Here's a very simple Python script that identifies files by their file extension:

<pre>import os.path, sys
(_, extension) = os.path.splitext(sys.argv[1])
if len(extension) == 0:
exit(1)
else:
print extension.lower()</pre>

Here's a more complex Python example, which uses [http://www.sno.phy.queensu.ca/~phil/exiftool/ Exiftool]'s XML output to return the MIME type of a file:

<pre>#!/usr/bin/env python

from lxml import etree
import subprocess
import sys

try:
xml = subprocess.check_output(['exiftool', '-X', sys.argv[1]])
doc = etree.fromstring(xml)
print doc.find('.//{http://ns.exiftool.ca/File/1.0/}MIMEType').text
except Exception as e:
print >> sys.stderr, e
exit(1)</pre>

Once you've written an identification command, you can register it in the FPR using the following steps:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for the command itself:

# Click "Create New Command".
# Select your tool from the "Tool" dropdown box.
# Fill out the Identifier with text to describe to a user what this tool does. For instance, we might choose "Identify MIME-type using Exiftool".
# Select the appropriate script type - in this case, "Python Script".
# Enter the source code for your script in the "Command" box.
# Click "Create Command".

Finally, you must create rules which associate the possible outputs of your tool with the FPR's format records. This needs to be done once for every supported format; we'll show it with MP3, as an example.

# Navigate to the "Identification Rules" page, and click "Create New Rule".
# Choose the appropriate foramt from the Format dropdown - in our case, "Audio: MPEG Audio: MPEG 1/2 Audio Layer 3".
# Choose your command from the Command dropdown.
# Enter the text your command will output when it identifies this format. For example, when our Exiftool command identifies an MP3 file, it will output "audio/mpeg".
# Click "Create".

Once this is complete, any new transfers you create will be able to use your new tool in the identification step.

==== Normalization Command ====

Normalization commands are a bit more complex to write because they take a few extra parameters.

The goal of a normalization command is to take an input file and transform it into a new format. For instance, Archivematica provides commands to transform video content into FFV1 for preservation, and into H.264 for access.

Archivematica provides several parameters specifying input and output filenames and other useful information. Several of the most common are shown in the examples below; a more complete list is in a later section of the documentation: [[#Normalization command variables and arguments]]

When writing a bash script or a command line, you can reference the variables directly in your code, like this:

<pre>inkscape -z "%fileFullName%" --export-pdf="%outputDirectory%%prefix%%fileName%%postfix%.pdf"</pre>

When writing a script in Python or other languages, the values will be passed to your script as commandline options, which you will need to parse. The following script provides an example using the argparse module that comes with Python:

<pre>import argparse
import subprocess

parser = argparse.ArgumentParser()

parser.add_argument('--file-full-name', dest='filename')
parser.add_argument('--output-file-name', dest='output')
parsed, _ = parser.parse_known_args()
args = [
'ffmpeg', '-vsync', 'passthrough',
'-i', parsed.filename,
'-map', '0:v', '-map', '0:a',
'-vcodec', 'ffv1', '-g', '1',
'-acodec', 'pcm_s16le',
parsed.output+'.mkv'
]

subprocess.call(args)</pre>

Once you've created a command, the process of registering it is similar to creating a new identification tool. The folling examples will use the Python normalization script above.

First, create a new tool record:

# Navigate to the "Preservation Planning" tab in the Archivematica dashboard.
# Navigate to the "Identification Tools" page, and click "Create New Tool".
# Fill out the name of the tool and the version number of the tool in use. In our example, this would be "exiftool" and "9.37".
# Click "Create".

Next, create a record for your new command:

# Click "Create New Tool Command".
# Fill out the Description with text to describe to a user what this tool does. For instance, we might choose "Normalize to mkv using ffmpeg".
# Enter the source for your command in the Command textbox.
# Select the appropriate script type - in this case, "Python Script".
# Select the appropriate output format from the dropdown. This indicates to Archivematica what kind of file this command will produce. In this case, choose "Video: Matroska: Generic MKV".
# Enter the location the video will be saved to, using the script variables. You can usually use the "%outputFileName%" variable, and add the file extension - in this case "%outputFileName%.mkv"
# Select a verification command. Archivematica will try to use this tool to ensure that the file your command created works. Archivematica ships with two simple tools, which test whether the file exists and whether it's larger than 0 bytes, but you can create new commands that perform more complicated verifications.
# Finally, choose a command to produce the "Event detail" text that will be written in the section of the METS file covering the normalization event. Archivematica already includes a suitable command for ffmpeg, but you can also create a custom command.
# Click "Create command".

Finally, you must create rules which will associate your command with the formats it should run on.

==== Normalization command variables and arguments ====

The following variables and arguments control the behaviour of format policy command scripts.

{|
|Name (bashScript and command)||Commandline option (pythonScript and asIs)||Description||Sample value
|-
|%fileName%||--input-file=||The filename of the file to process. This variable holds the file's basename, not the whole path.||video.mov
|-
|%fileDirectory%||--file-directory=||The directory containing the input file.||/path/to
|-
|%inputFile%||--file-name=||The fully-qualified path to the file to process.||/path/to/video.mov
|-
|%fileExtension%||--file-extension=||The file extension of the input file.||mov
|-
|%fileExtensionWithDot%||--file-extension-with-dot=||As above, without stripping the period.||.mov
|-
|%outputDirectory%||--output-directory=||The directory to which the output file should be saved.||/path/to/access/copies
|-
|%outputFileUUID%||--output-file-uuid=||The unique identifier assigned by Archivematica to the output file.||1abedf3e-3a4b-46d7-97da-bd9ae13859f5
|-
|%outputDirectory%||--output-directory=||The fully-qualified path to the directory where the new file should be written.||/var/archivematica/sharedDirectory/www/AIPsStore/uuid
|-
|%outputFileName%||--output-file-name=||The fully-qualified path to the output file, minus the file extension.||/path/to/access/copies/video-uuid
|}

= Customization and automation =
* Workflow processing decisions can be made in the processingMCP.xml file. [https://www.archivematica.org/wiki/Administrator_manual_0.10#Processing_configuration See here.]
* Workflows are currently created at the development level.
*: Some resources avialable
*:* [[MCP_Basic_Configuration]]
*:* [[MCP]]
*:* [[Creating_Custom_Workflows]]
*:* [[Development]]
* Normalization commands can be viewed in the preservation planning tab.
* Normalization paths and commands are currently editable under the preservation planning tab in the dashboard.

= Elasticsearch =

Archivematica has the capability of indexing data about files contained in AIPs and this data can be [[Elasticsearch Development|accessed programatically]] for various applications.

If, for whatever reason, you need to delete an ElasticSearch index please see [[ElasticSearch Administration]].

If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

<pre>
import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
conn.delete_index('aips')
except:
print "Error deleting index or index already deleted."
</pre>

=== Rebuilding the AIP index ===

To rebuild the ElasticSearch AIP index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-aip-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-aip-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-aip-index-from-files <location of your AIP store>

=== Rebuilding the transfer index ===

Similarly, to rebuild the ElasticSearch transfer data index enter the following to find the location of the rebuilding script:

locate rebuild-elasticsearch-transfer-index-from-files

Copy the location of the script then enter the following to perform the rebuild (substituting "/your/script/location/rebuild-elasticsearch-transfer-index-from-files" with the location of the script):

/your/script/location/rebuild-elasticsearch-transfer-index-from-files <location of your AIP store>

= Data backup =

In Archivematica there are three types of data you'll likely want to back up:
* Filesystem (particularly your storage directories)
* MySQL
* ElasticSearch

MySQL is used to store short-term processing data. You can back up the MySQL database by using the following command:

<pre>mysqldump -u <your username> -p<your password> -c MCP > <filename of backup></pre>

ElasticSearch is used to store long-term data. Instructions and scripts for backing up and restoring ElasticSearch are available [http://tech.superhappykittymeow.com/?p=296 here].

= Security =

Once you've set up Archivematica it's a good practice, for the sake of security, to change the default passwords.

== MySQL ==

You should create a new MySQL user or change the password of the default "archivematica" MySQL user. The change the password of the default user, enter the following into the command-line:

$ mysql -u root -p<your MyQL root password> -D mysql \
-e "SET PASSWORD FOR 'archivematica'@'localhost' = PASSWORD('<new password>'); \
FLUSH PRIVILEGES;"

Once you've done this you can change Archivematica's MySQL database access credentials by editing these two files:
* <code>/etc/archivematica/archivematicaCommon/dbsettings</code> (change the <code>user</code> and <code>password</code> settings)
* <code>/usr/share/archivematica/dashboard/settings/common.py</code> (change the <code>USER</code> and <code>PASSWORD</code> settings in the <code>DATABASES</code> section)

Archivematica does not presently support secured MySQL communication so MySQL should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1645 1645].

== AtoM ==

In addition to changing the MySQL credentials, if you've also installed AtoM you'll want to set the password for it as well. Note that after changing your AtoM credentials you should update the credentials on the AtoM DIP upload administration page as well.

== Gearman ==

Archivematica relies on the German server for queuing work that needs to be done. Gearman currently doesn't support secured connections so Gearman should be run locally or on a secure, isolated network. See issue [https://projects.artefactual.com/issues/1345 1345].

= Questions =

If you run into any difficulties while administrating Archivematica, please check out our FAQ and, if that doesn't help you, contain us using the Archivematica discussion group.

== Frequently asked questions ==
* [[AM_FAQ|Solutions to common questions]]

== Discussion group ==
* [http://groups.google.com/group/archivematica?hl=en Discussion group] for questions not covered by the FAQ

User:Mdemeo/Characterization

2014-08-07T00:35:06Z

Mdemeo: Created page with "Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties, and ..."

Characterization is the process of producing technical metadata for an object. Archivematica's characterization aims both to document the object's significant properties, and to extract technical metadata contained within the object.

Prior to Archivematica 1.2, the characterization microservice always ran the [http://projects.iq.harvard.edu/fits FITS] tool. As of Archivematica 1.2, characterization is now fully customizeable by the Archivematica administrator.

== Characterization tools ==

Archivematica 1.2 ships with four characterization tools. Which tool will run on a given file depends on the type of file, as determined by Archivematica's identification tool.

=== Default ===

The default characterization tool is FITS; it will be used if no specific characterization rule exists for the file being scanned.

It is possible to create new default characterization commands, which can either replace FITS or run alongside it on every file.

=== Multimedia ===

Archivematica 1.2 introduces three new multimedia characterization tools. These tools were selected for their rich metadata extraction, as well as for their speed. Depending on the type of the file being scanned, one or more of these tools may be called instead of FITS.

* [http://ffmpeg.org/ FFprobe], a characterization tool built on top of the same core as FFmpeg, the normalization software used by Archivematica
* [http://mediaarea.net/en/MediaInfo MediaInfo], a characterization tool oriented towards audio and video data
* [http://www.sno.phy.queensu.ca/~phil/exiftool/index.html ExifTool], a characterization tool oriented towards still image data and extraction of embedded metadata

== Writing a new characterization command ==

Information on writing new characterization commands can be found in the [[Administrator_manual_1.1#Format_Policy_Rules|FPR administrator's manual]].

Writing a characterization command is very similar to writing an [[Administrator_manual_1.1#Identificaton Command|identification command]] or a [[Administrator_manual_1.1#Normalization Command|normalization command]]. Like an identification command, a characterization command is designed to run a tool and produce output to standard out. Output from characterization commands is expected to be valid XML, and will be included in the AIP's METS document within the file's <objectCharacteristicsExtension> element.

When creating a characterization command, the "output format" should be set to "XML 1.0".

Install-1.0-packages

2014-01-10T19:37:31Z

Mdemeo: Fix uwsgi instructions

== Deploying Archivematica 1.0 packages ==

Archivematica packages are hosted on Launchpad, in an Ubuntu PPA (Personal Package Archive). There are a number of Archivematica PPA's, the test versions of Archivematica 1.0 packages are hosted in the archivematica/daily ppa. In order to install software onto your Ubuntu 12.04 system from this PPA:

*Update your system to the most recent 12.04 release (12.04.3)
<pre>
sudo apt-get update
sudo apt-get upgrade
</pre>
*Add the archivematica/daily PPA to your list of trusted repositories:
<pre>
sudo add-apt-repository ppa:archivematica/daily
</pre>
Note: The daily PPA is used for daily snapshots, it is not suitable for use in a production environment. When testing of the packages in the daily PPA is complete, they will be copied to the archivematica/release ppa for production use.

*Fetch a list of the software from the archivematica/daily ppa:
<pre>
sudo apt-get update
</pre>
* Install the storage service
<pre>
sudo apt-get install archivematica-storage-service
</pre>

* Configure the storage service
note:these steps are safe to do on a desktop, or a machine dedicated to Archivematica. They may not be advisable on an existing web server. Consult with your web server administrator if you are unsure.

<pre>
sudo rm /etc/nginx/sites-enabled/default
sudo ln -s /etc/nginx/sites-available/storage /etc/nginx/sites-enabled/storage
sudo ln -s /etc/uwsgi/apps-available/storage.ini /etc/uwsgi/apps-enabled/storage.ini
sudo service uwsgi restart
sudo service nginx restart
</pre>

* Test the storage service
go to the following link in a web browser:

http://localhost:8000 (or use the ip address of the machine you have been installing on). 
log in as user: test pass: test

* Delete the test user and create a new user

The storage service runs as a separate web app from the Archivematica dashboard, and has its own set of users. You should add at least one user, and delete or modify the test user.

Forensic imaging steps for 1.1

2013-11-01T16:51:38Z

Mdemeo: Note more completed tasks

[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > [[Digital forensics image ingest]] > Forensic imaging steps for 1.1
[[Category:Development documentation]]

Archivematica 1.0 has changed the way that several processes work, which has changed the scope of what's necessary to implement for forensic disk imaging. This document outlines the mandatory steps that still need to be completed for forensic imaging in 1.1, and some additional steps that would let Archivematica generalize the functionality into the standard transfer.

== Necessary improvements ==

=== Disk image imaging metadata must be able to be added ===

Forensic image transfers need to provide the ability to include some [[Digital forensics image ingest#Metadata_requirements|metadata at the beginning of the transfer]].

This is partially implemented and needs to be rebased onto the 1.0 branch.

=== File identification commands must recognize disk images ===

Since the new extraction model is based on the FPR, and hence requires file identification, it will be necessary to ensure the identification microservices can identify disk images in order to allow them to be extracted.

=== Forensic tools must be packaged ===

=== <strike> Disk image extraction commands must be added to the FPR </strike> ===

Currently an extraction command using tsk_recover exists; this will allow sleuthkit-based images to be extracted. Other formats may be needed as well.

Being tracked in https://projects.artefactual.com/issues/5843

=== <strike> Extracted package deletion must be optional </strike> ===

Currently, whether or not extracted packages will be retained after decompression is hardcoded in the package extraction script, as it was in the old extraction code. (The current behaviour is to always delete the package after decompression.) This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

Being tracked in https://projects.artefactual.com/issues/5894

=== <strike> Users should be offered the choice of whether to extract packages </strike> ===

This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

Being tracked in https://projects.artefactual.com/issues/5894

=== <strike> New "Examine Contents" microservice must be added </strike> ===

Being tracked in https://projects.artefactual.com/issues/5880

This step [[Digital forensics image ingest#Detail|runs the bulk_extractor tool]] and indexes the output to allow for later visualization and examination.

=== <strike> New characterization scripts must be written for fiwalk </strike> ===

Being tracked in https://projects.artefactual.com/issues/5866

It's previously been suggested that Archivematica use [[Digital forensics image ingest#fiwalk|Mark Matienzo's fiwalk configuration that uses FIDO]] but this may no longer be necessary now that FIDO is implemented as a general identification tool - extracted contents will always be identifiable using FIDO if the user selects that as their identification tool.

== Potential improvements ==

=== <strike> Alternate characterization tools should be implemented </strike> ===

Implemented in https://projects.artefactual.com/issues/5866

Disk image characterization should be [[Digital forensics image ingest#Detail|done with fiwalk]].

Currently the "characterize and extract metadata" step always uses FITS, but in 1.0 the groundwork was laid for allowing this to be controllable using the FPR instead. If this is completed, then we can simply write FPR rules to control characterization of disk images.

=== Provide robust identification fallbacks using additional microservice(s) ===

Currently identification happens using a single tool; if identification fails, the file will not be identified. We provide a single case fallback in the scripts that handle file identification and FIDO. Providing a more robust fallback would be desirable - e.g., by allowing individual files to fall back to other IDTools if identification fails. This would allow alternate tools to provide identification results for things that FIDO currently can't identify, such as disk images, without needing to clutter the existing scripts.

=== Recursive package extraction ===

The current package extraction code extracts in one pass. If a package contains additional packages that Archivematica can extract, they currently won't be extracted. The code should be updated in order to allow extraction of nested packages - for instance ZIP files containing other ZIPs; tarballs in uncommon compression formats (such as .tar.xz); and disk images containing compressed archives.

Forensic imaging steps for 1.1

2013-10-30T21:58:00Z

Mdemeo: Document completed tasks

[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > [[Digital forensics image ingest]] > Forensic imaging steps for 1.1
[[Category:Development documentation]]

Archivematica 1.0 has changed the way that several processes work, which has changed the scope of what's necessary to implement for forensic disk imaging. This document outlines the mandatory steps that still need to be completed for forensic imaging in 1.1, and some additional steps that would let Archivematica generalize the functionality into the standard transfer.

== Necessary improvements ==

=== Disk image imaging metadata must be able to be added ===

Forensic image transfers need to provide the ability to include some [[Digital forensics image ingest#Metadata_requirements|metadata at the beginning of the transfer]].

This is partially implemented and needs to be rebased onto the 1.0 branch.

=== File identification commands must recognize disk images ===

Since the new extraction model is based on the FPR, and hence requires file identification, it will be necessary to ensure the identification microservices can identify disk images in order to allow them to be extracted.

=== Forensic tools must be packaged ===

=== <strike> Disk image extraction commands must be added to the FPR </strike> ===

Currently an extraction command using tsk_recover exists; this will allow sleuthkit-based images to be extracted. Other formats may be needed as well.

Being tracked in https://projects.artefactual.com/issues/5843

=== Extracted package deletion must be optional ===

Currently, whether or not extracted packages will be retained after decompression is hardcoded in the package extraction script, as it was in the old extraction code. (The current behaviour is to always delete the package after decompression.) This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== Users should be offered the choice of whether to extract packages ===

This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== <strike> New "Examine Contents" microservice must be added </strike> ===

Being tracked in https://projects.artefactual.com/issues/5880

This step [[Digital forensics image ingest#Detail|runs the bulk_extractor tool]] and indexes the output to allow for later visualization and examination.

=== <strike> New characterization scripts must be written for fiwalk </strike> ===

Being tracked in https://projects.artefactual.com/issues/5866

It's previously been suggested that Archivematica use [[Digital forensics image ingest#fiwalk|Mark Matienzo's fiwalk configuration that uses FIDO]] but this may no longer be necessary now that FIDO is implemented as a general identification tool - extracted contents will always be identifiable using FIDO if the user selects that as their identification tool.

== Potential improvements ==

=== <strike> Alternate characterization tools should be implemented </strike> ===

Implemented in https://projects.artefactual.com/issues/5866

Disk image characterization should be [[Digital forensics image ingest#Detail|done with fiwalk]].

Currently the "characterize and extract metadata" step always uses FITS, but in 1.0 the groundwork was laid for allowing this to be controllable using the FPR instead. If this is completed, then we can simply write FPR rules to control characterization of disk images.

=== Provide robust identification fallbacks using additional microservice(s) ===

Currently identification happens using a single tool; if identification fails, the file will not be identified. We provide a single case fallback in the scripts that handle file identification and FIDO. Providing a more robust fallback would be desirable - e.g., by allowing individual files to fall back to other IDTools if identification fails. This would allow alternate tools to provide identification results for things that FIDO currently can't identify, such as disk images, without needing to clutter the existing scripts.

=== Recursive package extraction ===

The current package extraction code extracts in one pass. If a package contains additional packages that Archivematica can extract, they currently won't be extracted. The code should be updated in order to allow extraction of nested packages - for instance ZIP files containing other ZIPs; tarballs in uncommon compression formats (such as .tar.xz); and disk images containing compressed archives.

Forensic imaging steps for 1.1

2013-10-21T22:20:56Z

Mdemeo: Add category

[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > [[Digital forensics image ingest]] > Forensic imaging steps for 1.1
[[Category:Development documentation]]

Archivematica 1.0 has changed the way that several processes work, which has changed the scope of what's necessary to implement for forensic disk imaging. This document outlines the mandatory steps that still need to be completed for forensic imaging in 1.1, and some additional steps that would let Archivematica generalize the functionality into the standard transfer.

== Necessary improvements ==

=== Disk image imaging metadata must be able to be added ===

Forensic image transfers need to provide the ability to include some [[Digital forensics image ingest#Metadata_requirements|metadata at the beginning of the transfer]].

This is partially implemented and needs to be rebased onto the 1.0 branch.

=== File identification commands must recognize disk images ===

Since the new extraction model is based on the FPR, and hence requires file identification, it will be necessary to ensure the identification microservices can identify disk images in order to allow them to be extracted.

=== Forensic tools must be packaged ===

=== Disk image extraction commands must be added to the FPR ===

=== Extracted package deletion must be optional ===

Currently, whether or not extracted packages will be retained after decompression is hardcoded in the package extraction script, as it was in the old extraction code. (The current behaviour is to always delete the package after decompression.) This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== Users should be offered the choice of whether to extract packages ===

This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== New "Examine Contents" microservice must be added ===

This step [[Digital forensics image ingest#Detail|runs the bulk_extractor tool]] and indexes the output to allow for later visualization and examination.

=== New characterization scripts must be written for fiwalk ===

It's previously been suggested that Archivematica use [[Digital forensics image ingest#fiwalk|Mark Matienzo's fiwalk configuration that uses FIDO]] but this may no longer be necessary now that FIDO is implemented as a general identification tool - extracted contents will always be identifiable using FIDO if the user selects that as their identification tool.

== Potential improvements ==

=== Alternate characterization tools should be implemented ===

Disk image characterization should be [[Digital forensics image ingest#Detail|done with fiwalk]].

Currently the "characterize and extract metadata" step always uses FITS, but in 1.0 the groundwork was laid for allowing this to be controllable using the FPR instead. If this is completed, then we can simply write FPR rules to control characterization of disk images.

=== Provide robust identification fallbacks using additional microservice(s) ===

Currently identification happens using a single tool; if identification fails, the file will not be identified. We provide a single case fallback in the scripts that handle file identification and FIDO. Providing a more robust fallback would be desirable - e.g., by allowing individual files to fall back to other IDTools if identification fails. This would allow alternate tools to provide identification results for things that FIDO currently can't identify, such as disk images, without needing to clutter the existing scripts.

=== Recursive package extraction ===

The current package extraction code extracts in one pass. If a package contains additional packages that Archivematica can extract, they currently won't be extracted. The code should be updated in order to allow extraction of nested packages - for instance ZIP files containing other ZIPs; tarballs in uncommon compression formats (such as .tar.xz); and disk images containing compressed archives.

Forensic imaging steps for 1.1

2013-10-21T22:17:47Z

Mdemeo: Add version header

[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > [[Digital forensics image ingest]] > Forensic imaging steps for 1.1

Archivematica 1.0 has changed the way that several processes work, which has changed the scope of what's necessary to implement for forensic disk imaging. This document outlines the mandatory steps that still need to be completed for forensic imaging in 1.1, and some additional steps that would let Archivematica generalize the functionality into the standard transfer.

== Necessary improvements ==

=== Disk image imaging metadata must be able to be added ===

Forensic image transfers need to provide the ability to include some [[Digital forensics image ingest#Metadata_requirements|metadata at the beginning of the transfer]].

This is partially implemented and needs to be rebased onto the 1.0 branch.

=== File identification commands must recognize disk images ===

Since the new extraction model is based on the FPR, and hence requires file identification, it will be necessary to ensure the identification microservices can identify disk images in order to allow them to be extracted.

=== Forensic tools must be packaged ===

=== Disk image extraction commands must be added to the FPR ===

=== Extracted package deletion must be optional ===

Currently, whether or not extracted packages will be retained after decompression is hardcoded in the package extraction script, as it was in the old extraction code. (The current behaviour is to always delete the package after decompression.) This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== Users should be offered the choice of whether to extract packages ===

This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== New "Examine Contents" microservice must be added ===

This step [[Digital forensics image ingest#Detail|runs the bulk_extractor tool]] and indexes the output to allow for later visualization and examination.

=== New characterization scripts must be written for fiwalk ===

It's previously been suggested that Archivematica use [[Digital forensics image ingest#fiwalk|Mark Matienzo's fiwalk configuration that uses FIDO]] but this may no longer be necessary now that FIDO is implemented as a general identification tool - extracted contents will always be identifiable using FIDO if the user selects that as their identification tool.

== Potential improvements ==

=== Alternate characterization tools should be implemented ===

Disk image characterization should be [[Digital forensics image ingest#Detail|done with fiwalk]].

Currently the "characterize and extract metadata" step always uses FITS, but in 1.0 the groundwork was laid for allowing this to be controllable using the FPR instead. If this is completed, then we can simply write FPR rules to control characterization of disk images.

=== Provide robust identification fallbacks using additional microservice(s) ===

Currently identification happens using a single tool; if identification fails, the file will not be identified. We provide a single case fallback in the scripts that handle file identification and FIDO. Providing a more robust fallback would be desirable - e.g., by allowing individual files to fall back to other IDTools if identification fails. This would allow alternate tools to provide identification results for things that FIDO currently can't identify, such as disk images, without needing to clutter the existing scripts.

=== Recursive package extraction ===

The current package extraction code extracts in one pass. If a package contains additional packages that Archivematica can extract, they currently won't be extracted. The code should be updated in order to allow extraction of nested packages - for instance ZIP files containing other ZIPs; tarballs in uncommon compression formats (such as .tar.xz); and disk images containing compressed archives.

Forensic imaging steps for 1.1

2013-10-21T22:16:40Z

Mdemeo: Created page with "Archivematica 1.0 has changed the way that several processes work, which has changed the scope of what's necessary to implement for forensic disk imaging. This document outlin..."

Archivematica 1.0 has changed the way that several processes work, which has changed the scope of what's necessary to implement for forensic disk imaging. This document outlines the mandatory steps that still need to be completed for forensic imaging in 1.1, and some additional steps that would let Archivematica generalize the functionality into the standard transfer.

== Necessary improvements ==

=== Disk image imaging metadata must be able to be added ===

Forensic image transfers need to provide the ability to include some [[Digital forensics image ingest#Metadata_requirements|metadata at the beginning of the transfer]].

This is partially implemented and needs to be rebased onto the 1.0 branch.

=== File identification commands must recognize disk images ===

Since the new extraction model is based on the FPR, and hence requires file identification, it will be necessary to ensure the identification microservices can identify disk images in order to allow them to be extracted.

=== Forensic tools must be packaged ===

=== Disk image extraction commands must be added to the FPR ===

=== Extracted package deletion must be optional ===

Currently, whether or not extracted packages will be retained after decompression is hardcoded in the package extraction script, as it was in the old extraction code. (The current behaviour is to always delete the package after decompression.) This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== Users should be offered the choice of whether to extract packages ===

This must be made optional via user choice in the UI, and should be exposed as a persistent option in the processing configuration.

=== New "Examine Contents" microservice must be added ===

This step [[Digital forensics image ingest#Detail|runs the bulk_extractor tool]] and indexes the output to allow for later visualization and examination.

=== New characterization scripts must be written for fiwalk ===

It's previously been suggested that Archivematica use [[Digital forensics image ingest#fiwalk|Mark Matienzo's fiwalk configuration that uses FIDO]] but this may no longer be necessary now that FIDO is implemented as a general identification tool - extracted contents will always be identifiable using FIDO if the user selects that as their identification tool.

== Potential improvements ==

=== Alternate characterization tools should be implemented ===

Disk image characterization should be [[Digital forensics image ingest#Detail|done with fiwalk]].

Currently the "characterize and extract metadata" step always uses FITS, but in 1.0 the groundwork was laid for allowing this to be controllable using the FPR instead. If this is completed, then we can simply write FPR rules to control characterization of disk images.

=== Provide robust identification fallbacks using additional microservice(s) ===

Currently identification happens using a single tool; if identification fails, the file will not be identified. We provide a single case fallback in the scripts that handle file identification and FIDO. Providing a more robust fallback would be desirable - e.g., by allowing individual files to fall back to other IDTools if identification fails. This would allow alternate tools to provide identification results for things that FIDO currently can't identify, such as disk images, without needing to clutter the existing scripts.

=== Recursive package extraction ===

The current package extraction code extracts in one pass. If a package contains additional packages that Archivematica can extract, they currently won't be extracted. The code should be updated in order to allow extraction of nested packages - for instance ZIP files containing other ZIPs; tarballs in uncommon compression formats (such as .tar.xz); and disk images containing compressed archives.

Digital forensics image ingest

2013-10-21T21:59:35Z

Mdemeo: Add link to forensic images development steps document

[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > Digital forensics image ingest
[[Category:Development documentation]]

Related issues: #5265
NOTE: Wherever possible, use BitCurator packages for forensics tools.

The current status of implementation can be found at: [[Forensic imaging steps for 1.1]]

== Forensics image transfer type ==

* Archivematica transfer type: forensic image
** One or more images make up a transfer
** Repository makes image using outside imaging software prior to ingest
** Some metadata from ingest process will be included, first from FTK Imager, but later from other tools like Guymager (see metadata requirements below)
* Image types to base development on (more analysis needed): raw sector images =dd, bin, ISO; E01, AFF, AD1; ISO images with CUE files that contain track information; STREAM images (Kryoflux STREAM: This is a representation of non-decoded raw magnetic flux transitions acquired using a Kryoflux.) (these formats are sponsored, support for the formats listed [http://www.forensicswiki.org/wiki/Forensic_file_formats here] is desirable in future releases)

== Forensics image transfer workflow ==
[[File:ArchivematicaForensicImageIngest.png|900px|thumb|center|]]
[[File:ArchivematicaForensicImageIngest(2).png|700px|thumb|center|]]
====Detail====

* User images external media outside the Archivematica workflow
* User uploads image(s) into the Archivematica transfer tab of the dashboard by browsing to the appropriate transfer source directory and selecting a directory containing their image(s)
* User enters transfer name and accession number
* User selects MD entry template for entering MD about the imaging process
** User enters MD (see MD requirements below)
** User saves MD and starts transfer processes
* User selects Start transfer to begin Archivematica transfer processing
* Fiwalk with Fido or BitCurator fiwalk package completes the Characterize and extract metadata micro-service
* Archivematica runs [http://www.forensicswiki.org/wiki/Bulk_extractor Bulk Extractor] tool('''Examine contents micro-service''') and indexes output (this is to allow for reporting and visualization in the transfer backlog search for SIP creation and/or the AIP advanced search to allow for minimal description)
* Transfer micro-services complete
* At Create SIP from Transfer micro-service, user selects one of two options:
** If the user is an archivist/curator ready to process the image through to storage and/or access, choose Create single SIP and continue processing
** If the user is uploading multiple images as part of one accession, for processing by an archivist/curator later, choose Send to backlog
*** In the second scenario, once all images from an accession are in the backlog, user alerts archivist/curator that the accession is ready for further processing
** Archivist searches for the accession in the transfer backlog, selects the appropriate transfers, and selects Create SIP
* In ingest tab, user approves SIP creation
* In ingest tab, prior to normalization, there is a decision point at '''Extract packages micro-service''' - User selects from drop-down: Extract objects from image, Do not extract objects from image, Reject
** If user chooses not to extract objects, then skip micro-service decision about tool output to base normalization on, choose normalization for preservation only or no normalization, and continue standard micro-services to store AIP.
** If user chooses to extract objects, Archivematica runs FITS on the extracted contents. The user continues standard workflow, choosing any of the normalization options (including manual normalization) and continues processing to storage and/or access.

* EXCEPTIONS:
** In the case of AD1 images, user should be able to choose to extract the objects from the AD1 image before transfer. Archivematica should recognize an AD1 image and issue an alert/warning. If the user chooses to proceed with the transfer/ingest then the AD1 file just gets stored without any normalization or metadata extraction.

==Metadata requirements==
When the user selects Forensic image transfer type, each image uploaded as part of the transfer will include a metadata form icon that, if selected, will open a form in another browser tab. There, the user will enter some or all of the MD indicated below in the Template for manual data entry list.

* Template for manual data entry
**accession number - recorded in transfer upload in dashboard
**media number - manual
**label text - manual (long text field)
**media manufacture - manual
**serial number - manual
**media format - manual, could be controlled value list
**media density - manual, could be controlled value list
**source filesystem
**notes about the imaging process
**imaging interface - manual, could be controlled value list
**examiner - AUTOPOPULATED based on Archivematica user (PREMIS agent)
**image format - manual, could be controlled value list
**imaging software - manual, could be controlled value list
**notes about the imaging process - manual (long text field)

* Import from imaging tool FTK or fiwalk/sleuthkit
**imaging date (FTK or other imaging tool output)
**imaging success - Yes, Yes with errors (FTK or other imaging tool output)
**image fixity (FTK or other imaging tool output
**source filesystem (fiwalk)
**accession data about extent (fiwalk)

{| border="1" cellpadding="10" cellspacing="0" width="100%"
|-
!'''element'''
!'''description'''
!'''DACS (2013)'''
!'''ISAD(G)'''
!'''EAD'''
!'''PREMIS 2.2'''
|-
|media number
|repository specific alphanumeric designation assigned to individual physical media/carrier
|2.1.3 local identifier - At the highest level of a multilevel description or in a single level description,
provide a unique identifier for the materials being described in accordance with the
institution’s administrative control system. Optionally, devise unique identifiers at lower
levels of a multilevel description.
|3.1.1 - Reference codes
|<unitid>
|
|-
|label text
|textual transcription
|7.1.2 Record, as needed, information not accommodated by any of the defined elements of description.
|3.6.1 Note
|<odd>, <note>
|
|-
|media manufacturer
|
|7.1.4 - If the materials being described are in electronic form, give details of any migration or logical reformatting since its transfer to archival custody. Indicate the location of any relevant documentation. Information regarding digitization is provided in the Existence and Location of Copies Element (6.2).
|3.6.1 Note
|<odd>, <note>
|
|-
|serial number
|when applicable to external media
|7.1.4 or 7.1.6 If appropriate at the file or item level of description, make a note of any important numbers borne by the unit being described.
|3.6.1 Note
|<odd>, <note>
|
|-
|media format
|a controlled value list (e.g. 3.5" floppy, 5.25" floppy, CD-R, etc)
|7.1.4
|3.6.1 Note
|<odd>, <note>
|
|-
|media density
|a controlled value list (e.g. single density, double density, quad density, high density)
|7.1.4
|3.6.1 Note
|<odd>, <note>
|
|-
|source filesystem
|a controlled file list(e.g. HFS, FAT, etc.) with the ability to add terms
|7.1.4
|3.6.1 Note
|<odd>, <note>
|
|-
|notes about the imaging process
|textual field to describe more detail about the imaging process
|7.1.4
|3.6.1 Note
|<odd>, <note>
|
|-
|imaging interface
|a controlled value list (e.g. Catweasel, Firewire, USB, IDE, etc.)
|
|
|
|2.2 - Event: Image capture
|-
|examiner
|the person doing the imaging
|
|
|
|2.2 - Agent
|-
|imaging date
|date of imaging
|
|
|
|2.2 - Event: Image capture
|-
|imaging success
|ex Yes/Yes, with errors
|
|
|
|2.2 - Event: Image capture
|-
|image format
|a controlled value list (e.g. AFF3, dd/secort image, AD1, etc.) with the ability to add terms
|
|
|
|2.2 - Object
|-
|imaging software
|a controlled value list (e.g. FTK imager 3.1.0.1514, Kryoflux, DTC 2.00 beta 9, etc.) with the ability to add terms
|
|
|
|2.2 - Agent
|-
|image fixity
|type(s) and value(s) from FTK csv output
|
|
|
|2.2 - Object
|-
|}



==Forensic image transfer tools ==

===fiwalk===

* Characterize and extract metadata micro-service
* Use Mark Matienzo's github version which includes FIDO for format identification since fiwalk's format identification is libmagic (unsatisfactory for our purposes)

Sample fiwalk xml output:

<pre>

<?xml version='1.0' encoding='ISO-8859-1'?>
<fiwalk xmloutputversion='0.2'>
<metadata
xmlns='http://example.org/myapp/'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:type>Disk Image</dc:type>
</metadata>
<creator>
<program>fiwalk</program>
<version>0.5.7</version>
<os>Darwin</os>
<library name="tsk" version="3.0.1"></library>
<library name="afflib" version="3.5.2"></library>
<command_line>fiwalk -x /dev/disk2</command_line>
</creator>
<source>
<imagefile>/dev/disk2</imagefile>
</source>

<volume offset='512'>
<Partition_Offset>512</Partition_Offset>
<block_size>512</block_size>
<ftype>2</ftype>
<ftype_str>fat12</ftype_str>
<block_count>5062</block_count>
<first_block>0</first_block>
<last_block>5061</last_block>
<fileobject>
<filename>README.txt</filename>
<id>2</id>
<filesize>43</filesize>
<partition>1</partition>
<alloc>1</alloc>
<used>1</used>
<inode>6</inode>
<type>1</type>
<mode>511</mode>
<nlink>1</nlink>
<uid>0</uid>
<gid>0</gid>
<mtime>1258916904</mtime>
<atime>1258876800</atime>
<crtime>1258916900</crtime>
<byte_runs>
<run file_offset='0' fs_offset='37376' img_offset='37888' len='43'/>
</byte_runs>
<hashdigest type='md5'>2bbe5c3b554b14ff710a0a2e77ce8c4d</hashdigest>
<hashdigest type='sha1'>b3ccdbe2db1c568e817c25bf516e3bf976a1dea6</hashdigest>
</fileobject>
</volume>


<runstats>
<user_seconds>0</user_seconds>
<system_seconds>0</system_seconds>
<maxrss>1814528</maxrss>
<reclaims>546</reclaims>
<faults>1</faults>
<swaps>0</swaps>
<inputs>56</inputs>
<outputs>0</outputs>
<stop_time>Sun Nov 22 11:08:36 2009</stop_time>
</runstats>
</fiwalk>
</pre>
===Bulk Extractor===

External tools/FITS performance

2013-08-24T00:16:02Z

Mdemeo: Comparing FITS against individual tools

= JVM startup lag =

A significant part of FITS's startup is spent instantiating the JVM, which is relatively slow. To speed this up, it may be possible to use the [http://martiansoftware.com/nailgun/ nailgun] tool, which creates a single shared JVM instance that a tool can connect to, thus bypassing the overhead of needing to start up a JVM every time FITS is run.

A simple test of FITS's startup time was performed using the `fits -h` help command; since the help command performs no actual characterization, most of the time of running this command is spent creating a JVM. Results:

== Benchmarks ==

* Average time without a persistent JVM, averaged over 10 iterations: 4.2852s
* Average time with a persistent Nailgun JVM, averaged over 10 iterations: 1.7022s

= DROID startup =

The other largest component of FITS startup is the DROID component. The official FITS 0.6.2 uses an older version of DROID, which takes approximately a second to load its signature data at startup. [https://github.com/gmcgath/fits-mcgath Gary McGath's fork of FITS] has been updated to use a more modern version of DROID, substantially improving startup time

== Benchmarks ==

* Average startup time for FITS 0.6.2 (with Nailgun), averaged over 10 iterations: 1.7022s
* Average time for FITS-McGath 0.7.5 (with Nailgun), over 10 iterations: 0.7526s

= Running specialized tools for specific formats =

For certain formats which are relatively well-known, such as video formats, audio formats, and images, it may be possible to run individual tools to extract useful characterization metadata. For instance, when profiling video files, it would be useful to run DROID and exiftool (both also run by FITS) as well as mediainfo. Experiments indicate that FITS introduces a major overhead, and significant speed improvements can be had by using running tools directly.

== Benchmarks ==

These benchmarks come from a single run of each tool, though later experiments with the tool indicate that they are broadly representative of the tool's average performance.

=== MXF ===

The scanned MXF video was approximately 4GB.

* DROID alone: 0m4.739s
* exiftool alone: 0m0.167s
* mediainfo alone: 0m3.105s
* FITS 0.6.2: 3m54.016s

=== MOV ===

The scanned MOV video was approximately 1GB.

* DROID alone: 0m4.920s
* exiftool alone: 0m0.096s
* mediainfo alone: 0m0.276s
* FITS 0.6.2: 3m53.458s

External tools/FITS performance

2013-08-22T21:41:06Z

Mdemeo:

External tools/FITS performance

2013-08-22T19:13:55Z

Mdemeo: Began work on JVM startup lag