Preservation storage planning
Benchmarking methodology for performing preservation actions on stored AIPs
We propose to create a Gherkin feature file in the Archivematica Automated User Acceptance Tests (AMAUAT) repository which automates the performance of an experiment that calculates the time and resources spent to perform a fixity check on a specified set of AIPs, where the independent variables are a) the type of storage used for the AIPs at rest and b) the method of performing the fixity check. There is a precedent for using the AMAUAT to run performance experiments in the Output Capturing Performance feature.
(See also Experiments with the AMAUAT and the Archivematica Performance Experiments slides.)
Feature file that defines the experiment
Here is a rough draft of the proposed experiment-defining feature file:
Feature: Performance measurement: fixity checks and storage types Users, administrators and developers of Archivematica want to know how much money, compute resources, and time will be required to perform preservation actions on Archival Information Packages (AIPs), given various cloud-based storage options. Focusing on the "fixity check" preservation action initially, this feature automates the process of performing a fixity check on a specified set of AIPs stored in various preservation storage locations using various algorithms.
Scenario Outline: Justin uses an algorithm to perform a preservation action on a set of AIPs in various types of storage locations and measures the time and compute resources consumed to accomplish that task.
Given that a set of AIPs <aip_set_id> stored in a <storage_type> preservation storage
When the user records a start time And the user launches a daemon process to monitor compute resource consumption
And the user copies the AIP from the preservation storage location to the local compute
And the user runs <algorithm> to perform <preservation_action> And the time required to copy the AIPs and perform <preservation_action> is measured and
And the compute resources required to copy the AIPs and perform <preservation_action> are
measured and stored
# Then the least expensive strategy is to use <storage_type> storage location and the
<algorithm> algorithm to perform <preservation_action>
Examples: experimental variables (assume aip_set_id is constantly “nri_test_aips” and
preservation_action is constantly “fixity check”
| storage_type | algorithm | | local filesystem | Archivematica Storage Service fixity check API endpoint | | Cloud location 1 | Archivematica Storage Service fixity check API endpoint | | Cloud location 2 | Archivematica Storage Service fixity check API endpoint | | local filesystem | bespoke fixity check script using bagit CLI | | Cloud location 1 | bespoke fixity check script using bagit CLI | | Cloud location 1 | bespoke fixity check script using bagit CLI | | Cloud location 2 | script cross-referencing S3 ETag hashes and SS db records ... |
The steps listed under the Scenario Outline in the draft feature file above are the clauses beginning with Given, When, Then, and And. These steps must be implemented as Python functions which use the supplied variable values (in angle brackets) to perform the appropriate steps. This section discusses two scenario variants and their salient steps.
Fixity check via Storage Service
Let the default strategy be to use a local filesystem storage location and to perform fixity checking using the Archivematica Storage Service (SS). To be specific, we valuate the step of type When the user runs <algorithm> to perform <preservation_action> to produce this token: When the user runs <AM SS fixity check API endpoint> to perform <fixity check>. We will use Archivematica’s fixity checker client application to make requests to the Archivematica Storage Service’s fixity check API endpoint.
Notes: The step And the user copies the AIP from the preservation storage location to the local compute environment must be conditionally ambiguous in this case because the SS does this for us. Note: the SS uses the bagit CLI to perform the fixity check (verify this.)
SS cannot perform multiple fixity checks in parallel, although capabilities recently added to the SS could make this possible (verify this.)
Fixity check via S3 interfaces & introspection
The next scenario to test may involve using a cloud object storage interface, in particular the ETag header (in the case of S3), which may provide the hash of an AIP compressed file. To be specific, we valuate the step of type When the user runs <algorithm> to perform <preservation_action>
to produce this token: When the user runs <script cross-referencing S3 ETag hashes and SS db records> to perform <fixity check>.
Notes: See the AWS S3 API docs For other cloud object storage interfaces, determine process for obtaining hash (e.g. OVH and Azure) Questions: can we issue a request to retrieve the ETags for a set of AIPs from S3 without actually copying over the AIPs? Assuming yes, is this information alone sufficient for performing a fixity check? Does the SS store the hash of the finalized AIP prior to storage? Where? In the database?
Proposal: Iteration 1: Implement the default scenario
We propose to implement an executable Gherkin/behave feature file that automates the performance of the scenario characterized as the default and described in the Fixity check via Storage Service subsection above. Implementing this will give us one row in our Examples: table above. It will also build up the infrastructure to facilitate the speedier implementation of subsequent scenario variants.
Estimate & breakdown:
- Compose the storage-preservation-action-experiment.feature feature file. It should do the following. Much of this work has already been accomplished in the course of creating this document.
- Define a scenario describing the experiment.
- Supply values for the default scenario.
- Implement the steps of the scenario.
- Given a set of AIPs <nri-sample-aips> stored in a <local filesystem> preservation storage location. Assume the given. Document the variables in the experiment log.
- Add another Given here which retrieves or assumes, and then records, details of the compute environment.
- When the user records a start time.
- And the user launches a daemon process to monitor compute resource consumption. Research and determine which tool or service to use for this purpose. Write code to implement it.
- And the user copies the AIP from the preservation storage location to the local compute environment. Implement a vacuous step. Retrieve the algorithm state from the context and do nothing here.
- And the user runs <AM SS fixity check API endpoint> to perform <fixity check>.
- And the time required to copy the AIPs and perform <fixity check> is measured and stored.
- And the compute resources required to copy the AIPs and perform <fixity check> are measured and stored.
- Document the feature file, the step implementations and the experiment as a whole.
- Test the experiment. Run it several times and analyze its measurements for plausibility.
- Given the results, decide the variable values for the next scenario and provide an analysis and estimate for the subsequent iteration.
- Draft report of findings and guidance on next experiments for future preservation action benchmarking (hopefully this report could be published for the wider digital preservation community as a way to start documenting costs to conduct preservation actions in the cloud and in hybrid environments).