Optimization

From Archivematica
Revision as of 16:01, 11 February 2020 by Sallain (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Main Page > Development > Development documentation > Optimization

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Introduction[edit]

Archivematica is a complex application, with many moving parts, that can be used in a wide variety of configurations, to process a wide variety of workflows. The topic of optimization therefore has many aspects, and many possible approaches. This page will attempt to document some of the use cases, and different techniques to improve performance.

Definition[edit]

One measure of performance in an Archivematica installation is throughput - the raw number of files, or gigabytes of files, that can be processed from initial transfer to safe storage as an AIP, per unit of time. Depending on the composition of the original materials, the configuration of Archivematica and its different components, and the hardware resources available, throughput, as measured in GB processed per hour, can vary dramatically. It is difficult to accomodate all possible permutations, so in this discussion we will start with specific use cases, and examine options for improving performance in those specific examples.

Measuring Performance[edit]

There are a number of ways to measure performance. Counting the number of files that get stored in AIP's each day is one, although there can be large delays, at times when Archivematica is waiting on user input, for example, and the system is idle. Rough counts of gb in storage per week/month are still useful, as they measure actual results.


Use Cases[edit]

Single Large File per Transfer[edit]

Materials[edit]

The user has large files, such as videos that are from 10GB to several hundred GB per file in size. Each Transfer is made up of one single file.

Hardware Configuration[edit]

Assuming separate file systems for Transfer Source Location, Processing Location and AIP Storage Location. These could be mounted by NFS, or CIFS, the main point being that there is, somewhere, separate hardware providing each file system.

The Archivematica server has 21 cpu cores, and 192GB of ram. The file systems are served from a high capacity Isilon system.

Example Performance[edit]

starting with a sample transfer consisting of a single .mov file, 110GB in size. Transfer Source location is configured in the Storage Service in Space A - corresponds to a CIFS mount. Processing Location is configured in Space B, - corresponds to an NFS mount. AIP Storage Location is configured in Space A.

Based on timing sample transfers, there are two main bottlenecks in this scenario:

  1. checksumming
  2. moving/copying files