Ingest (0.2)

From Archivematica
Revision as of 16:23, 2 August 2012 by Sevein (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Main Page > Documentation > Release 0.2 Documentation > Ingest (0.2)


AD1 Receive SIP[edit]

Archivematica UML Activity diagram AD1 Receive SIP

Expected Procedures

Case 1:Internal Producer

  • [1.1] --> [1.3] --> [1.4] --> [1.6] --> [1.8] --> [end]
    • Assumption - Submission agreements with internal producers will mandate that some type of checksum for validating the integrity of the SIP must be included in the SIP; therefore, this case should never have to invoke step 1.5 - Generate Checksum


Case 2: External Producer (may or may not include a checksum as part of the SIP)

  • [1.1] --> [1.2] --> [1.4] --> [1.6] --> [1.8] --> [end] (SIP includes checksum); or,
  • [1.1] --> [1.2] --> [1.5] --> [1.6] --> [1.8] --> [end] (SIP does not include checksum)

Exceptions

  • (for both cases) Submitted SIP includes integrity checksums and fails integrity check at [1.6]


Step Implementation Notes
1.1 Receive notification of intent to submit SIP from Producer (UC-1.1)

Receive e-mail notification from internal producer that scheduled records are ready for transfer; receive signed donor form or other agreement from external producer. Create an accession documentation folder. (E.g., /home/demo/accessions/2009_0001/). Save notification and any related records to this folder.

  • If producer is external, go to 1.2 - Assign offline storage location
  • If producer is internal, go to 1.3 - Assign network storage location
1.2 [External producer] Assign offline storage location for SIP (UC-1.1)

Select a secure physical location in which to store the media (i.e. disks, hard drive or other).

  • Go to 1.4 - Receive SIP from Producer
1.3 [Internal producer] Assign network storage location for SIP

Assign a network location where internal producers can transfer submissions

  • Create a new folder: /home/demo/ingestdropoff/2009_0001/.
  • Go to 1.4 - Receive SIP from Producer
There seems to be a problem with the diagram, because this step needs to be carried out for external producers too.
1.4 Receive SIP from Producer (UC-1.1)

Internal producer copies the SIP to /home/demo/ingest/2009_0001/; for external producer, archivist copies the SIP to that folder.

  • If SIP includes checksum, go to 1.6 - Check integrity of transfered SIP
  • If SIP does not include checksum, go to 1.5 - Generate Checksum
  • Does 1 accession = 1 SIP or multiple SIPs?
  • Does 1 SIP = 1 AIP or multiple AIPs?
  • OAIS does not dictate how SIP(s) are divided or combined into AIP(s); in fact, it explicitly allows for one-to-one, one-to-many and many-to-one relationships. See OAIS, sec. 4.3.2, Data Transformations in the Ingest Functional Area.
  • What is the purpose of the checksum at this stage? If the purpose is only to identify random file corruption that may have occured during the transfer, this may already be handled by CRC routines in the O/S and transfer protocols.
1.5 Generate checksum
  • Select objects, right-click on objects, select scripts > makeMD5 and save report to /home/demo/ingest/2009_0001/.
  • If the SIP contains subfolders, open each subfolder and select the objects, then run the script and save the reports to the relevant folders.
  • Copy the reports to /home/demo/accessions/2009_0001/. Note that each report is automatically named checksum.md5; add identifying information to the titles to differentiate the reports.
  • Go to 1.8 - Send Confirmation of receipt to Producer
The checksum reports need to be copied to the accessions folder because the SIP will later be broken up into AIPs and we need to save copies of the original reports (I guess).
1.6 Check integrity of transferred SIP (UC-1.2)
  • Select checksum report, right-click on report, select scripts > checkMD5.
  • Review report to ensure all checksums match (go to the bottom of the report; it should say "All files are OK!").
  • Copy checksum report(s) to /home/demo/accessions/2009_0001/.
  • If the integrity check fails, go to 1.7 - Request resubmission of SIP. '(note - failure of the integrity check is regarded as an exception to the expected procedure)'
  • If the integrity check passes, go to 1.8 - Send confirmation of receipt to Producer
  • Problem: we may not be able to use md5sum to do this if the producer generated checksums using a different program.
  • The checksum reports need to be copied to the accessions folder because the SIP will later be broken up into AIPs and we need to save copies of the original reports (I guess).
1.7 Request resubmission of SIP
  • Request resubmission of SIP from Producer
  • Go to 1.4 - Receive SIP from Producer
1.8 Send confirmation of receipt to Producer (UC-1.2)
  • Send e-mail to internal producer; send e-mail or other confirmation to external producer.
  • Save copy of confirmation to the accession documentation folder
  • Go to 2.1 - Copy SIP to quarantine

AD2 Audit SIP[edit]

Archivematica UML Activity diagram AD2 Audit SIP

Expected Procedure

  • [2.1] --> [2.2] --> [2.4] --> [2.5] --> [2.8] --> [end]


Exceptions

  • malware detected at step 2.3
  • SIP not compliant at step 2.5


Step Implementation Notes
General procedures
2.1 Copy SIP to quarantine
  • Copy SIP to quarantine space
    • Create a quarantine folder (e.g., /home/demo/quarantine/).
    • Copy the SIP to this folder.
    • Use md5sum to check that all files were copied without error.
  • Wait for quarantine period to expire
  • Copy SIP to working space
    • Create a new folder: /home/demo/ingestprocessing/.
    • Copy SIP from quarantine to this folder.
    • Use md5sum to check that all files were copied without error.
    • Delete SIP from the quarantine folder (/home/demo/quarantine/).
  • Go to: 2.2 - Check SIP for malware
  • In future, quarantine will be entirely separate from the digital archives. SIPs will be copied to removable media where they will remain for a specified period (e.g., 28 days), then copied to the separate system, checked for malware, and then copied back via removable media to the digital archives.
  • A background CRC may be acceptable in lieu of an MD5 or other cryptographic hash to verify post-transfer integrity
  • This step was renamed from "Copy SIP to quarantine" to "Quarantine SIP," to make it more explicit that the quarantine period was the primary action, rather than the copying to the quarantine space. The Implementation section was modified to include the related action of moving the SIP into and out of the quarantine space.
2.2 Check SIP for malware

[There is not currently any malware checking software included in the Archivimatica 0.2 release.]

  • Check SIP for the presence of malware
  • Create malware check report, copy report to the Accession Documentation folder
  • If malware is detected, go to: 2.3 - Remove malware
  • If malware is not detected, go to: 2.4 - Audit SIP for compliance

Documentation should include:

  • list of software used to detect malware
    • related virus/malware definition used to perform the check
  • date and time of check
  • reports generated by the software identifying infected files, nature of the infection
2.3 Remove malware

Attempt to remove malware from the SIP

  • Document the tools and procedures used to remove the malware
  • Document success or failure of the malware removal for each infected file
  • Copy all malware removal report to the Accession Documentation folder

Go to: 2.4 - Audit SIP for compliance

Case: malware not removed

  • Create a plain text report (click on applications > accessories > text editor) describing the type(s) of malware, the efforts made to remove the malware and the reasons for failing to remove it. Save the report to /home/demo/accessions/2009_0001.
  • Presumably any malware removal tools generate reports on success/failure. Depending on the software used, the detection occurring in step 2.2 and the removal occurring in this step may be documented in the same report
2.4 Audit SIP for compliance (UC-1.3; UC-4.6) Manually verify that the SIP conforms to the archives' data formatting and documentation standards and meets the specifications of the Submission Agreement. Do this by skimming the filenames and extensions to make sure that what was supposed to be in the SIP according to the Submission Agreement is actually there.

Create audit documentation

  • Create an audit documentation report as plain text file (click on applications > accessories > text editor)
    • If the SIP is wholly compliant, note this in the audit report; else
    • Document deficiencies in the SIP, identifying the nature of the deficiency (missing file extensions, unacceptable formats, unacceptable packaging, presence of unremoved malware, etc.) and the object(s) the deficiency pertains to (file or group of files, SIP packaging, etc.)
  • Save the SIP audit documentation to the Accession documentation folder (g.g., /home/demo/accessions/2009_0001/)

Go to: 2.5 - Assess SIP deficiencies

  • What else should we check for?
  • Maybe develop a checklist approach to reporting on deficiencies - for example: "contains unacceptable formats YES; records inadequately identified YES..."
2.5 - Assess SIP deficiencies

Based on the results of the audit performed in step 2.4 - Audit SIP for compliance, determine if the deficiencies identified, if any, warrant rejection of the SIP, or if the SIP can be accepted despite identified deficiencies. If the majority of objects conform to standards and Submission Agreement, the SIP may be considered acceptable. Note that the non-conforming objects can be deleted after appraisal (see AD3 step 3.9). Document the decision to accept or reject the SIP

If SIP can be accepted for ingest,

  • Document that SIP has been acepted for ingest
  • go to: 2.8 - Notify producer of SIP acceptance

If SIP can not be accepted for ingest, go to: 2.7 - Notify Producer of SIP rejection

2.6 Notify Producer of SIP rejection
  • Document that the SIP has been rejected, copy the report to the Accession Documentation folder
  • Send an e-mail notification, attaching a copy of the reports created in steps 2.4 or 2.5.
  • If the Producer appeals the SIP rejection:
    • Document receipt of the appeal
    • go to: 2.7 - Evaluate appeals
  • If the Producer does not appeal the SIP rejection: go to: 2.9 - Destroy SIP copies
2.7 Evaluate appeals

Evaluate the Producer's appeal of the SIP rejection


  • If the appeal is accepted:
    • Document acceptance of the appeal
    • Go to: 2.8 - Notify producer of SIP acceptance
  • If the appeal is rejected:
    • Document the rejection of the appeal
    • Notify the Producer of the appeal rejection
    • Go to 2.9 - Destroy SIP copies


2.8 Notify producer of SIP acceptance

Send Producer notification that the SIP has been accepted for ingest Document

Go to: 3.1 - Extract content information from SIP

2.9 Destroy SIP copies Delete SIP from /home/demo/ingest and /home/demo/quarantine/.

End ingest

AD3 Accept SIP for Ingest[edit]

Archivematica UML Activity diagram AD3 Accept SIP for Ingest

Expected Procedure

[3.1] --> [3.2] --> [3.3] --> [3.4] --> [3.5] --> [3.6] --> [3.9] --> [3.10]

Exceptions

  • Producer notification required following [3.6]
  • Appraisal desision appealed following [3.7]


Step Implementation Notes
General procedures
3.1 Unpack SIP (UC-1.3)

Create folders for the SIP contents

  • Create the following 3 directories within the working space (e.g., /home/demo/ingestprocessing):
    • /SIP_ID where ID is a unique identifier assigned to the SIP
    • /SIP_ID/content
    • /SIP_ID/PDI

Unpack the SIP

  • If the SIP uses an archvie file format such as .zip, .tar, etc.., extract the contents using the appropriate unpacking software.
  • Identify the contetn objects in the SIP and sort them into the /content directory
  • Identify the PDI and sort them into the /PDI directory

Go to: 3.2 - Modify/Provide additional PDI

Is this still needed?: Use md5sum to generate a new checksum report for the extracted files.
3.2 Modify / provide additional PDI (UC-1.3)

Go to: 3.3 - Identify Formats

3.3 Identify format Use DROID and NLNZ Metadata extractor to identify objects in the SIP. Save the reports to /home/demo/accessions/2009_0001/.

Go to: 3.4 - Validate formats

3.4 Validate format Use JHOVE to validate the objects in the SIP. Save the report to /home/demo/accessions/2009_0001/.

Go to: 3.5 - Extract metadata

3.5 Extract metadata

Extract preservation metadata from content objects in the SIP

Go to: - 3.6 Audit submission and select for preservation

  • This step seems to have been accomplished in steps 3.4 and 3.6.

In archivimatica, the metadata extraction activity takes place at the same time as format identification, as the NLNZ Metadata extractor tool also performs format identification.

  • What is the necessary metadata than must be extracted at this point?
3.6 Audit submission and select for preservation (UC-4.6)

Based on the results of steps 3.2, 3.3, and 3.4, apply Archives policies and determine which (if any) content objects in the SIP should not be included in the AIP Document which content objects will not be included and why.


If submission agreement requires notifying Producer of appraisal decision, go to: 3.7 - Notify Producer of appraisal decision


Else, go to: 3.9 - Destroy unselected SIP components

Possible reasons for exclusion:

  • technical
    • insufficient preservation metadata
    • unrecognized format
    • unsupported format
    • invalid format
  • appraisal
    • duplicate content
    • does file level selection occur at this point (e.g. methodological sampling of records at the file level for selective retention?) or before the SIPs get to this point?
    • further to the above, this diagram only deals with one sip at a time. How do we manage appraisal decisions that must take into consideration the relationships among many SIPs?
3.7 Notify Producer about appraisal decision(s) Provide the Producer with copies of the appraisal report or other documentation as required by the submisssion agreement identifying SIP components to be destroyed.

If the Producer appeals the appraisal decision, go to: 3.8 - Evaluate appeals

Else, go to: 3.9 - Destroy unselected SIP components

3.8 Evaluate appeals

Receive Producer's appeals Evaluate appeals Based on the evaluation, make any necessary amendments to the appraisal decision Notify the Producer of the outcome of the appeal process

Go to: 3.9 - Destroy unselected SIP components

3.9 - Destroy unselected SIP components

Destroy all SIP content objects identified for destruction in the appraisal decision (or amended appraisal decision as appropriate). Document destruction

  • Note - it is possible that no components have been identified for destruction. In this case, move on to step 3.10
  • Do we need to destroy accompanying metadata as well?
3.10 Accept selected SIP components for ingest

AD4 Generate AIP[edit]

Archivematica UML Activity diagram AD4 Generate AIP


Step Implementation Notes
4.1 Create AIP containers In /home/demo/ingestprocessing/ create new folders entitled 2009_0001_01, 2009_0001_02 etc.
  • With the current test set I created 3 folders, one for the accelerando website files, one for the artefactual website files and one for the rest of the files.
  • In Archivematica 0.3 we will use something like BagIt (see http://www.cdlib.org/inside/diglib/bagit/bagitspec.html) to package the AIP(s).
  • It's pretty problematic to divide the SIP at this point because we have generated so many reports that relate to the entire SIP that should be added as PDI to the AIPs. Maybe we need to divide the SIP earlier.
4.2 Add Content Information to AIP Copy the relevant contents of /home/demo/ingestprocessing/2009_0001/ to the folders created in step 4.1. In addition:
  • Copy the md5sum reports to all the AIPs.
  • Run CheckMD5 against all the reports.
  • Several of the reports will indicate that some objects have failed the check. This is because objects will be missing from some folders, having been redistributed when the SIP was divided.
  • Review the reports manually to ensure that the objects in that folder passed the check.
  • For each AIP with failed checksum results, delete the old checksum reports and create new ones.
4.3 Transform Content Information *In each AIP, create a folder entitled 2009_0001_XX_normalized. Using Xena, normalize the contents of each AIP (excluding the the checksum reports). Ensure that the destination folder for each normalization is this folder.
  • Save Xena log file to /home/demo/accessions/2009_0001/.
The Xena log file gets saved to the accession folder rather than to individual AIPs because Xena creates one log file for the entire process of normalizing multiple AIPs.
4.4 Add Transformed Content Information to AIP If the destination folder is correctly set, Xena automatically saves the normalized content to the AIP.
4.5 Add PDI to AIP Create plain text reports containing provenance and other PDI elements (including arrangement information) and save them to the AIPs. We haven't added all the other PDI (for example, the DROID, NLNZ and JHOVE reports) to the AIPs because we created the reports for the entire SIP, not for the AIPs. This is part of the problem of processing the SIP through validation software etc. before dividing it into AIPs.
4.6 Generate Descriptive Information (UC-1.4)
  • Obtain available descriptive information from PDI added to AIP, Submission Agreement and/or records schedule, communications with donor, etc.
  • In Qubit, enter descriptive information at aggregate levels of description (fonds, series, file).
  • Upload digital objects from /home/demo/ingestprocessing/. Item-level descriptions inheriting some higher-level descriptive information will automatically be created.
  • Add additional descriptive information to item-level descriptions.
  • For instructions on using Qubit, go to the on-line user manual.
The current version of Qubit (1.0.8-dev) does not recognize certain formats and does not allow their selection in the upload screen. Hopefully this will be corrected in archivematica 0.3.

AD5 Transfer AIP to Archival Storage[edit]

Archivematica UML Activity diagram AD5 Transfer AIP to Archival Storage


Step Implementation Notes
5.1 Request storage of AIP (UC-1.5)
5.2 Transfer AIP to Archival storage (UC-1.5) Copy AIPs to /home/ingest/archivalstorage/. Use md5sum to ensure that the copying was done without errors.
5.3 Confirm receipt and storage of AIP (UC-1.5)
5.4 Add AIP storage location to descriptive information (UC-1.6) In the physical storage area in Qubit, add the storage location.
5.5 Add Descriptive Information to Data Management (UC-1.6) This was done in AD4, step 4.6. In OAIS, generating descriptive information and adding them to data management are two different steps; however, in Archivematica this is done in one step in Qubit, which is used to upload images to a web interface, generate derivatives for searching and browsing and record descriptive information.
5.6 Confirm update of Data Management
5.7 Destroy SIP and AIP copies Destroy copies in /home/demo/ingest/ and /home/demo/ingestprocessing/.