SIP Creation

From Archivematica
Jump to navigation Jump to search

Main Page > Projects > Vancouver Digital Archives > SIP Creation


SIP Creation Technical requirements (as opposed to all requirements as below) for Archivematica .7.1 2011-05-30 Draft - diagram workflow to come.

1 Attach physical write protected media (external hdd) containing the transfer to the processing station. Eg. external hdd w. eSATA or SATA connection and physical write blocker, could contain images and/or directories from rsync processing as well as logs from those actions. CVA is currently limiting analysis to rsync copies since we have yet to identify a tool for analysing images. Image analysis is the ideal and the goal as soon as a tool or tools is identified

2 Scan for viruses.(note - need to have a plan for exceptions if/when malware discovered; does this remove Quarantine and Malware checking from later workflow in Archivematica?)

3 Option 1: Make [AFF] image (eg. transferImage.aff) and [AFF] image log (eg. transferImage.info), and set target directory on or attached to the processing station for both – ideally HDD1, which is currently unused. E.g.,(We had to create a folder in the home directory and mount the drive.) Image making should include creating checksum and verifying hashes once the imaging is complete (guymager does this, fyi). Keep log.

3 Option 2: Make rsync copy of transfer. Checksum directory copy. Store log.

4 View [physical?] write protected image content or directory copy content for appraisal.

  • must be able to view directory structure of imaged media
  • must be able to open individual files from imaged media (using file viewer tool or suite of tools)


5 Apply arrangement – name fonds, series and folders. Arrangement metadata should determine placement in ICA-AtoM (so should be in the Dublin Core metadata either now or later)

  • Identify files and/or directory branches that have been selected for inclusion in a particular SIP. Multiple SIPs may be created from a single image, but SIPs may only be generated from a single image, SIPs contents can con come from multiple images/devices/media.

(Note - we envisioned using the project creation capabilities of forensic tools for this)

  • must be able to retain original order for description later

visualization tools could be useful during analysis

6 Log changes/retain original order.

7 Record high level archivist statement of selection methodology – could be external or in accession record (note in module?) CVA agrees there is no need to document all deleted files.

  • Identify commonly restricted information (for example – CC info, names, phone numbers, email addresses, bank account numbers, SSN or SIN)

(Note - how do we identify restricted files and the types of restricted info in them (or specific restricted info in them? What do we do with the metadata once restrictions are identified?)

  • Keyword searches and bulk identification of KNOWN restrictions. For instance, a donor may tell us that anything related to their chocolate tart recipe is super secret, so we’d want to identify and segregate (if its to be accessible after a period of time) or delete (if it’s off limits forever) those files/folders to chocolate tarts

8 Create [empty] SIP on hdd2 on processing station

9 Export/copy selected files from image to content sub-directory in SIP (note: must maintain original directory names (*?) and hierarchy when copying to SIP; depth limit is being considered by project currently)

10 Package the resulting SIP(s). Package should include checksum and manifest, as well as documentation of original order of selected SIP contents

11 Assign SIP identifier (UUID) – should carry forward to AIP/DIP (VANOC specific question: do we have to assign a UUID, or can we just ID the SIP using the DC metadata file to link it to the arrangement unit; assumes use of SIP structure as exists currently in Archivematica 0.7)

12 Update existing accession record in accession module. (this could be in tandem – see number 7, eg)

13 When it comes time to process: Upload SIP to Receive SIP.


2011-05-06 Draft - diagram workflow to come.

1 Attach physical write protected media (external hdd, optical disc, etc) containing the transfer to the processing station. Eg. external hdd w. eSATA or SATA connection and physical write blocker

2 Scan for viruses.(note - need to have a plan for exceptions if/when malware discovered)

3 Option 1: Make [AFF] image (eg. transferImage.aff) and [AFF] image log (eg. transferImage.info), and set target directory on or attached to the processing station for both – ideally HDD1, which is currently unused. E.g.,(We had to create a folder in the home directory and mount the drive.) Image making should include creating checksum and verifying hashes once the imaging is complete (guymager does this, fyi). Keep log.

3 Option 2: Make rsync copy of transfer. Checksum directory copy. Store log.

4 View image content for appraisal.

  • must be able to view directory structure of imaged media
  • must be able to open individual files from imaged media

5 Apply arrangement – name fonds, series and folders. Arrangement metadata should determine placement in ICA-AtoM (so should be in the Dublin Core metadata either now or later)

  • Identify files and/or directory branches that have been selected for inclusion in a particular SIP. Multiple SIPs may be created from a single image, but SIPs may only be generated from a single image, SIPs contents can con come from multiple images/devices/media.

(Note - we envisioned using the project creation capabilities of forensic tools for this)

6 Log changes/retain original order.

7 Record high level archivist statement of selection methodology – could be external or in accession record (note in module?) CVA agrees there is no need to document all deleted files.

  • Identify commonly restricted information (for example – CC info, names, phone numbers, email addresses, bank account numbers, SSN or SIN)

(Note - how do we identify restricted files and the types of restricted info in them (or specific restricted info in them? What do we do with the metadata once restrictions are identified?)

  • Keyword searches and bulk identification of KNOWN restrictions. For instance, a donor may tell us that anything related to their chocolate tart recipe is super secret, so we’d want to identify and segregate (if its to be accessible after a period of time) or delete (if it’s off limits forever) those files/folders to chocolate tarts

8 Create [empty] SIP on hdd2 on processing station

9 Export/copy selected files from image to content sub-directory in SIP (note: must maintain original directory names (*?) and hierarchy when copying to SIP)

10 Package the resulting SIP(s). Package should include checksum and manifest, as well as documentation of original order of selected SIP contents)

11 Assign SIP identifier (UUID) – should carry forward to AIP/DIP (VANOC specific question: do we have to assign a UUID, or can we just ID the SIP using the DC metadata file to link it to the arrangement unit; assumes use of SIP structure as exists currently in Archivematica 0.7)

12 Update existing accession record in accession module. (this could be in tandem – see number 7, eg)

13 When it comes time to process: Upload SIP to Receive SIP.


Requirements for forming the SIP

2011-03-25


These requirements occur at the stage between acquiring the transfer/donation from a donor and beginning the process of tasks that are currently part of the Ingest process in Archivematica 0.7

  • Label and number transfer media (With VANOC, CM used the accession number and added decimal places, eg 2010-004.125) and record numbering and information about the media somewhere (right now it’s a spreadsheet in TRIM) After discussion, we ultimately decided that this portion is “whatever works for our purposes” since we aren’t keeping the original media once the SIP has been processed.
  • Photograph transferred media and includes the photo in the SIP – We think this might be valuable for materials with lots of metadata written on them, so we’re considering doing this in special cases only. We think we’d keep the photo in the spreadsheet with the metadata transcription and other data about the image, but we’re not sure about how we’ll parse it later.
    • Where should the photo go? The photo is metadata, so it should be included in the /SIP/metadata, not in /SIP/objects.
    • METS file should contain a pointer to the photo
    • Photo must be in a preservation-ready format when included, since it will not be normalized on ingest
    • More general problem of how do we deal with all forms of non-standard representation metadata that accompanies the SIP?


  • Attach write-protected transfer media to processing station (physically write-protect either on the media itself when possible or attach SATA write blocker. We may purchase usb write-blocker at some point in the future.)
  • Virus scan
  • Image transferred media (Disk to file and disk to disk) – Currently, using Clonezilla for disk-to-disk and dd or ddrescue, if there are errors, for disk-to-file (.iso) Prefer to do some/all of the imaging using open source forensics tools and format (AFF format, fiwalk?, guymager?) We are considering compressing the image, does AFF do this?
  • Bag the disk image file
  • Extract log/metadata from image –Most forensics tools output a report about the image, including the tool used to create the image, information about the source and the destination, etc. (should this log be attached to any resulting SIPs? or the accession record in ICA-AtoM? We’re also unclear about how/where this should be parsed)
  • Detach and store original transfer media. In most cases, we plan to store the original until the AIP is stored and the DIP is uploaded, then wipe.
  • Attach write-protected image transfer media to processing station
  • Form SIP(s) from the transfer(s):
    • Identify password-protected files – segregate? tag? break? if break, how?
    • Identify, tag or mark in some way, confidential information (cc info, name, phone number, email address, bank acct, ssn) – (bulk extractor and/or fiwalk?)
    • Keyword searches –Helpful in finding other known files for deletion, segregation or tagging that are not found with tool used to find confidential info and password protected files.
    • Arrange files into logical archival fonds, series, and/or files while maintaining a record of original file structure (at Stanford, they "bookmark" using EnCase, maybe we could use fiwalk / sleuthkit?)
    • Destroy hidden files, deleted files (unless instructed otherwise by donor), obvious junk
    • Log all SIP formation processes, including arrangement decisions (fiwalk / sleuthkit?)
    • Assign UUID to each SIP
    • Create a submission agreement (TAPER?) for the SIP (use Curator’s Workbench for all/part of this task?); include instructions in submission agreement for special treatment of some formats/file types (eg digital video may use mediainfo to characterize, validate and normalize)
    • (Ultimately, the SIP that comes out of this phase is fed into "Receive SIP" as three folders: Metadata, Logs, and Objects.)


  • Create an accession record (Archivist’s Toolkit) for the SIP


SIP Creation Scenarios

Assumptions:

  • Archives will receive transfers of records from donors/creators in many ways.
  • Donors/creators will not be responsible for creating SIPs.
    • Archive will consult with individual donors/creators to find out what metadata exists about the records, and to what extent the donor is capable of providing metadata and a consistent structure for the donated records, and linking the metadata to the structure and the records

Network transfer from VanDocs (TRIM)[edit]

  1. VanDocs administrator identifies records due for transfer to archvies
  2. RIM gets approval for transfers from OPR and Archvies
  3. Archives assigns a network drop directory location and permissions
  4. VanDocs administrator exports records and metadata
  5. VanDocs administrator creates transfer package (1 transfer package = 1:* container packages)
  6. VanDocs administrator saves transfer package to network drop
  7. Archives registers receipt of transfer package
  8. Archives transforms container packages into SIPs based on VanDocs submisssion template
    1. Copy transfer package from network drop to transfer drive
    2. Connect transfer drive to Digital Archives local area network (DA-LAN)
    3. Isolate individual container packages
    4. For each container package, run software to map VanDocs export metadata to Archivematica ingest metadata
    5. Package metadata + records from corresponding container package into SIP (use Bagit?)
  9. Archives submits SIPs to Archivematica

Network transfer from other COV system[edit]

Physical transfer of original media from donor (hdd, cd, dvd, etc.)[edit]

  1. Archives identifies media to be transferred
  2. Archives

Copy files from active donor system to archives transfer media[edit]

Internet transfer of files (ftp, e-mail attachment, etc.)[edit]

Direct capture from internet[edit]