Difference between revisions of "Improvements/Disk Image Preservation"

Latest revision as of 16:55, 11 February 2020

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Synopsis[edit]

This project is being sponsored by UCLA Library and NYPL Special Collections but more collaborators are welcome! Please get in touch on the community user forum.

Different tools are needed for extraction of different disk images depending on what file system the disk image was created from. Archivematica's standard tool for disk image extraction is tsk recover, which is limited to 18 file systems. A challenge to invoking the right tool for the job is Archivematica's use of file format, rather than other characteristics (in this case, the file system). For example, if a disk image is identified as an ISO image, Archivematica will currently invoke tsk recover regardless of the file system of the image, where as tsk recover will only be able to extract the contents of the 18 file systems listed in the link above.

In this project we are particularly focused on hfs disk images, which are not currently supported by tsk recover, but the development will be generic enough to be useful to different types of disk images as well.

User story[edit]

As an archivist, I would like to process disk images of an unknown file-system through Archivematica, and have Archivematica and/or its associated tools recognize and record the file system and choose appropriate tools for disk image extraction and characterization. Further, I would want to be able to pull statistics about the size and file-type of disk images from the system.

Development tasks[edit]

We have identified the following development tasks which would need to be addressed in the order described:

Recommendation: upgrade Sleuthkit [1 support ticket][edit]

Upgrade from version 4.1.3 (three years old) to 4.4 (most recent release).

Pre-ingest script [1 support ticket][edit]

Develop a pre-ingest script which would identify the file system and store this metadata in such a way that it can be passed to Archivematica. This could be made part of an automation tool script.

Update: after a first iteration we have decided to take a different approach. See Second iteration, below.

File identification script [1 support tickets][edit]

Currently, the file identification scripts run in Archivematica will identify the type of disk image but not the file system of the disk image. This script will use the data from the pre-ingest script to identify both the disk image type and the file system.

Results

As a first iteration, we used identify by extension and created an FPR entry for raw disk image with HFS:

New FormatVersion: Format: Raw Disk Image; Description: Raw Disk Image (HFS filesystem)

Modify Identify by File Extension like [2]

Command:


from __future__ import print_function
import os.path
import subprocess
import sys

def file_tool(path):
    return subprocess.check_output(['file', path]).strip()

def blkid(path):
    try:
        return subprocess.check_output(['blkid', '-o', 'full', path])
    except Exception:
        return ''

(_, extension) = os.path.splitext(sys.argv[1])

if extension:
    print(extension, end='')
    if extension in ('.img,'):
        output = blkid(sys.argv[1])
        if 'TYPE="hfs"' in output:
            print(' (hfs)')
else:
    # Plaintext files frequently have no extension, but are common to identify.
    # file is pretty smart at figuring these out.
    file_output = file_tool(sys.argv[1])
    if 'text' in file_output:
        print('.txt')

New ID Rule: Format: Raw Disk Image (HFS filesystem); Command: Identify by File Extension; Output: ".img (hfs)"

Disable Rule with output ".img", or modify to identify as Raw Disk Image

[1] https://github.com/cul-it/hfs2dfxml

[2] https://github.com/artefactual/archivematica-fpr-tools/blob/dev/issue-10818-hfs-disk-image/id/file-by-extension.py

Characterize [1 support ticket][edit]

Write meaningful characterization about the size and file type of the disk images so that statistics can be gathered from the AIPs. Currently, when fiwalk is run as the characterization tool for a disk image, dfxml is written in premis:objectCharacteristicsExtension.

Results

For characterization, the hfs2dfxml [4] provides a nice XML output with metadata about the image. However, it is also not packaged for Ubuntu. To install it, follow the instructions in the README [5] by installing hfsutils & python-magic, cloning the repository and cloning the dependency dfxml in the correct location inside the repository.

However, it can only be run from inside the repository without a patch. Either clone my fork [6] and change branches, or apply the patch [7] yourself.

FPR changes

Set up file identification FPR changes
Install & patch hfs2dfxml somewhere Archivematica can run it from
Disable "Delete packages after extraction"
New FPR Tool: Description: hfs2dfxml; Version: git commit hash
New Characterization Command:
- Tool: hfs2dfxml
- Description: hfs2dfxml characterization
- Script Type: bash
- Command:


output=/tmp/temp_`uuid -v4`
echo $(id)
python /home/users/hbecker/bin/hfs2dfxml/hfs2dfxml/hfs2dfxml.py "%fileFullName%" $output
cat $output
rm $output

- Output Format: Text (Markup): XML: XML
- Command Usage: Characterization

New Characterization Rule: Purpose: Characterization; Format: Raw Disk Image (HFS filesystem); Command: hfs2dfxml

However, that setup currently generates an error when run through Archivematica. "_call_hmount error: Failed to initialize HFS working directories: Permission denied" hfs2dfxml is being run, but generates an error when trying to call hfsutils. This requires further investigation.

[4] https://github.com/cul-it/hfs2dfxml [5] https://github.com/cul-it/hfs2dfxml/blob/master/README.md [6] https://github.com/Hwesta/hfs2dfxml/tree/patch-1 [7] https://github.com/cul-it/hfs2dfxml/pull/7/files

File extraction [1 support ticket][edit]

Implement tools such as HFS Utilities that will allow files from hfs disk images to be extracted.

Results

Fiwalk does not recognize the filesystem, and cannot extract from it.
hfsutils provides the hmount and hcopy commands, but hcopy is not recursive
tsk_recover cannot recognize the filesystem, outputting "Cannot determine file system type (Sector offset: 0)Files Recovered: 0"
hfsexplorer [3] provides a command line extraction tool for HFS filesystems. However, hfsexplorer is not packaged for Ubuntu, and must be installed manually.

To install hfsexplorer, download and extract it. By default it uses a GUI, but a command line interface is accessible from the hfsx.sh script. The script we want is unhfs.sh, which extracts files from the image.

FPR changes:

To handle extraction, use hfsexplorer's unhfs command to extract all files from the hfs partition.

Set up file identification FPR changes
Install hfsexplorer somewhere Archivematica can run it from
New FPR Tool: Description: hfsexplorer; Version: 0.23.1
New Extraction Command:
- Tool: hfsexplorer
- Description: unhfs
- Script Type: bash
- Command:


mkdir "%outputDirectory%" 
/home/users/hbecker/bin/hfsexplorer/bin/unhfs.sh -v -o "%outputDirectory%" "%inputFile%"

- Output location: outputDirectory
- Command Usage: Extraction
New Extraction Rule: Purpose: Extract; Format: Raw Disk Image (HFS filesystem); Command: unhfs

[3] http://www.catacombae.org/hfsexplorer/

Deployment of development above for testing [1 support ticket][edit]

Reporting [?? support tickets, depends on scope][edit]

Explore reporting possibilities either in Archivematica or using other reporting tools.

Second iteration[edit]

The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried.

Steps to the second iteration:

test analyzer script from CCA Disk Image Processor[1] on disk image samples
verify that for our samples, the output csv file has identified the file system
integrate the script in to the Siegfried File identification script running within Archivematica. This could be developed to run anytime Siegfried is run as the file identification tool or as a special configuration using the Disk Image transfer type.
this will produce a special format (e.g., "img-hfs" or similar) which will need to be added to the FPR in order to prompt the desired tools for further mirco-services.

[1] Script from CCA Disk Image Processor can be run outside the GUI: https://github.com/timothyryanwalsh/cca-diskimageprocessor/blob/master/diskimageanalyzer.py

Results of work as of May 9, 2017

The HFS disk image extraction and characterization functionality should work on http://am16hfs.archivematica.org/transfer/ and it should also work on a new system that is running Archivematica at branch dev/issue-10818-hfs-disk-images.

To install such a system using Archivematica's deploy-pub Vagrant repo, modify the vars-singlenode-1.6.yml file so that ``archivematica_src_am_version`` is set to ``"dev/issue-10818-hfs-disk-images."`` and provision::

   $ vagrant provision

In order to get the new HFS disk image functionality to work in Archivematica, the following tools and their dependencies must be manually installed:

- hfs2dfxml - hfsexplorer

The following instructions were used on Ubuntu 14.04.

To install hfsutils::

   $ sudo apt-get update
   $ sudo apt-get install hfsutils

Install python-magic. WARNING: it's important that you do NOT install this python-magic: https://github.com/ahupp/python-magic. Instead, you must install the python-magic (see https://github.com/threatstack/libmagic) that Ubuntu installs when you call the following::

   $ sudo apt-get install python-magic

If your install has separate virtual environments for each Archivematica component, then the MCPClient virtualenv needs to have magic installed also::

   $ sudo ln -s /usr/lib/python2.7/dist-packages/magic.py /usr/share/python/archivematica-mcp-client/lib/python2.7/site-packages/magic.py

Install (Holly Becker's patch of) the hfs2dfxml source in your home directory on the machine where Archivematica is installed::

   $ pwd
   /home/vagrant
   $ mkdir bin
   $ cd bin
   $ git clone https://github.com/Hwesta/hfs2dfxml
   $ cd hfs2dfxml/hfs2dfxml
   $ git checkout -t origin/patch-1
   $ git clone https://github.com/simsong/dfxml/

Install hfsexplorer:

- Go to http://www.catacombae.org/hfsexplorer/ - Download the ZIP at link: Download application as ZIP file (cross-platform) - Extract ZIP into directory accessible by Archivematica (suggested: extract

 into /home/vagrant/hfsexplorer)

- Make sure that the hfsexplorer directory is in the directory that extraction

 command expects it to be, i.e., /usr/local/hfsexplorer/bin/unhfs.sh

Also make sure to modify the default processing config so that Siegfried is the file identification command during transfer and packages are not deleted after extraction: this allows the disk image to be characterized after its contents have been extracted.

Results of work as of May 3, 2017

1. Added a new characterization command called "Fiwalk fallback to hfs2dfxml" which attempts to use fiwalk for characterizing disk images and, if that fails, attempts to use hfs2dfxml. Also set the "Output File Format" to "XML".

2. Created a new extraction command for disk images called "tsk_recover fallback unhfs" which attempts to extract a disk image using tsk_recover and then falls back to using unhfs if that fails. Set output file format to JSON so that the output can alter the default tool from "Sleuthkit" to "hfsexplorer" when appropriate.

3. Modified the Siegfried fpr_idcommand so that it attempts to run blkid when it would otherwise return 'UNKNOWN'; if blkid indicates that the file is an HFS disk image, then the fpr_idcommand returns ".img (hfs)", cf. the new fpr_idrule below.

4. Created a new fpr_idrule that connects the Siegfried id command to the UCLA HFS format version, via the special ".img (hfs)" PUID/extension:

- Format: Disk Image: HFS Disk Image (HFS filesystem): UCLA HFS Disk Image
- Command: Siegfried version 1.6.7 PUID runs identify using Siegfried
- Command output: .img (hfs)

5. Modified the source of MCPClient/clientScripts/identifyFileFormat.py so that it can handle Siegfried-based identification returning a pronom id that is actually more akin to a file extension, in this case the special ".img (hfs)" output.

6. Modified MCPClient/clientScripts/extractContents.py so that the tool used in extraction will be listed in the eventDetail. In the HFS extraction case, this will be "hfsexplorer".

7. Ran a transfer on artefactual/227_026/ and it was identified as an HFS disk image (by hfs2dfxml) and it was extracted using unhfs.

Things to Note:

a. The METS file will list "No Matching Format" for the format identification event for the disk image. This is because the FormatVersion that Siegfried/blkid is assigning to the HFS disk image has no pronom id. This could be "fixed" by assigning a fake pronom id to that FormatVersion, as seems to have been done in the past in Archivematica::

mysql> select description, pronom_id from fpr_formatversion where pronom_id like 'archivematica%';

  +-------------------------------------+---------------------+
  | description                         | pronom_id           |
  +-------------------------------------+---------------------+
  |  Raw Disk Image (HFS filesystem)    | archivematica-fmt/5 |
  +-------------------------------------+---------------------+

b. Still left to do is the creation of ansible tasks to install the tools and their dependencies: unhfs, HFSutils, hfs2dfxml

Third iteration[edit]

The goal of a third iteration would be to prevent the need to use a special format, and instead have extract packages and characterization tools chosen appropriately based on file system information rather than format. Two possibilities for enacting FPR rules based on file system type rather than on file format:

as part of Disk Image transfer type workflow
based on the existence of analysis.csv file in the transfer

As part of a third iteration we could also look at converting the contents of analysis.csv to dfxml and adding it to the METS file.

Tools/resources[edit]

HFS Utilities
hfs2dfxml - Utility to parse hfsutils output and produce DFXML for HFS-formatted disk images
CCA Disk Image Processor - Creates ready-to-ingest SIPs from a directory of disk images and related files.

Example dfxml[edit]

This is how dfxml is currently written by fiwalk in the Archivematica METS file:


<premis:objectCharacteristicsExtension>
  <dfxml xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0">
    <metadata>
      <dc:type>Disk Image</dc:type>
    </metadata>
    <creator version="1.0">
      <program>fiwalk</program>
      <version>4.1.3</version>
      <build_environment>
        <compiler>GCC 4.8</compiler>
        <library name="afflib" version="3.6.6"/>
        <library name="libewf" version="20130416"/>
      </build_environment>
      <execution_environment>
        <command_line>fiwalk -x /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso -c /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt</command_line>
        <start_time>2017-01-23T21:13:53Z</start_time>
      </execution_environment>
    </creator>
    <!-- Reading configuration file /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt -->
    <!-- pattern: *  method: dgi  path: python /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/pronom_ident.py -->
    <source>
      <image_filename>/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso</image_filename>
    </source>
    <!-- fs start: 0 -->
    <volume offset="0">
      <partition_offset>0</partition_offset>
      <block_size>2048</block_size>
      <ftype>2048</ftype>
      <ftype_str>iso9660</ftype_str>
      <block_count>6047</block_count>
      <first_block>0</first_block>
      <last_block>6046</last_block>
      <fileobject>
        <filename>.</filename>
        <partition>1</partition>
        <id>1</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>0</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest>
        <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest>
      </fileobject>
      <fileobject>
        <filename>..</filename>
        <partition>1</partition>
        <id>2</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>0</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest>
        <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest>
      </fileobject>
      <fileobject>
        <filename>799PX_EU.BMP</filename>
        <partition>1</partition>
        <id>3</id>
        <name_type>r</name_type>
        <filesize>1437654</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>1</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="51200" img_offset="51200" len="1437654"/>
        </byte_runs>
        <hashdigest type="md5">4829f38a294d156345922db8abd5e91c</hashdigest>
        <hashdigest type="sha1">1bde6c981776d81c13fd657621b2d3c8359d1761</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>BBHELMET.AI</filename>
        <partition>1</partition>
        <id>4</id>
        <name_type>r</name_type>
        <filesize>1080282</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>2</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="1488896" img_offset="1488896" len="1080282"/>
        </byte_runs>
        <hashdigest type="md5">c14bda842e2889a732e0f5f9d8c0ae73</hashdigest>
        <hashdigest type="sha1">98ce1ae12ee18893e8e1bd738855b6e20cd7b5ef</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>G31DS.TIF</filename>
        <partition>1</partition>
        <id>5</id>
        <name_type>r</name_type>
        <filesize>125968</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>3</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="2570240" img_offset="2570240" len="125968"/>
        </byte_runs>
        <hashdigest type="md5">1ea4939968f117de97b15437c6348847</hashdigest>
        <hashdigest type="sha1">d4c23ce4fecf17c8b952f98ed1cadc22a3d7399f</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>LION.SVG</filename>
        <partition>1</partition>
        <id>6</id>
        <name_type>r</name_type>
        <filesize>18324</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>4</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="2697216" img_offset="2697216" len="18324"/>
        </byte_runs>
        <hashdigest type="md5">e5913bebe296eb433fdade7400860e73</hashdigest>
        <hashdigest type="sha1">efe2c396a4ad46bab873f58eef4dbe6607be030c</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>NEMASTYL.PNG</filename>
        <partition>1</partition>
        <id>7</id>
        <name_type>r</name_type>
        <filesize>2050617</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>5</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="2715648" img_offset="2715648" len="2050617"/>
        </byte_runs>
        <hashdigest type="md5">0b0f9676ead317f643e9a58f0177d1e6</hashdigest>
        <hashdigest type="sha1">5d588800a5d5bd1ebe76ff2cbce0568a7f2dd386</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>OAKLAND0.JP2</filename>
        <partition>1</partition>
        <id>8</id>
        <name_type>r</name_type>
        <filesize>527345</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>6</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="4767744" img_offset="4767744" len="527345"/>
        </byte_runs>
        <hashdigest type="md5">04f7802b45838fed393d45afadaa9dcc</hashdigest>
        <hashdigest type="sha1">5a7eb88804b0783e3e3fe208cd10085954173c0a</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>PICTURES</filename>
        <partition>1</partition>
        <id>9</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>7</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest>
        <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest>
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/.</filename>
        <partition>1</partition>
        <id>10</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>7</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest>
        <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest>
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/..</filename>
        <partition>1</partition>
        <id>11</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>0</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest>
        <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest>
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/LANDING_.JPG</filename>
        <partition>1</partition>
        <id>12</id>
        <name_type>r</name_type>
        <filesize>1361321</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>10</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="6453248" img_offset="6453248" len="1361321"/>
        </byte_runs>
        <hashdigest type="md5">0ff111013ad2f8ded1171cee683e718a</hashdigest>
        <hashdigest type="sha1">6aa382a1f8fdb23b7e9f3823ae655ce405b68f9e</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/MARBLES.TGA</filename>
        <partition>1</partition>
        <id>13</id>
        <name_type>r</name_type>
        <filesize>4261301</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>11</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="7815168" img_offset="7815168" len="4261301"/>
        </byte_runs>
        <hashdigest type="md5">d5e100eb19481b8b7f05ac8cc3fd4e26</hashdigest>
        <hashdigest type="sha1">262f203a14d193e199a024e1e567579b5c22f110</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>VECTOR_N.EPS</filename>
        <partition>1</partition>
        <id>14</id>
        <name_type>r</name_type>
        <filesize>1041114</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>8</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="5296128" img_offset="5296128" len="1041114"/>
        </byte_runs>
        <hashdigest type="md5">8dd3a652970aa7f130414305b92ab8a8</hashdigest>
        <hashdigest type="sha1">66280f092b775f132d8fbff84b2226fcaf5d3dce</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>WFPC01.GIF</filename>
        <partition>1</partition>
        <id>15</id>
        <name_type>r</name_type>
        <filesize>113318</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>9</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="6338560" img_offset="6338560" len="113318"/>
        </byte_runs>
        <hashdigest type="md5">2eb15cb1834214b05d0083c691f9545f</hashdigest>
        <hashdigest type="sha1">bf8addf8b2fc09a9bf1ecc9e2c6c5a3b4453b24a</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>$OrphanFiles</filename>
        <partition>1</partition>
        <id>16</id>
        <name_type>d</name_type>
        <filesize>0</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>12</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
      </fileobject>
    </volume>
    <!-- end of volume -->
    <!-- clock: 1.908907 -->
    <rusage>
      <utime>0.112000</utime>
      <stime>0.040000</stime>
      <maxrss>30588</maxrss>
      <minflt>1553</minflt>
      <majflt>2</majflt>
      <nswap>0</nswap>
      <inblock>416</inblock>
      <oublock>23544</oublock>
      <clocktime>1.908907</clocktime>
      <!-- stop_time: Mon Jan 23 21:13:55 2017 -->
    </rusage>
  </dfxml>
</premis:objectCharacteristicsExtension>

Difference between revisions of "Improvements/Disk Image Preservation"

Latest revision as of 16:55, 11 February 2020

Contents

Synopsis[edit]

User story[edit]

Development tasks[edit]

Recommendation: upgrade Sleuthkit [1 support ticket][edit]

Pre-ingest script [1 support ticket][edit]

File identification script [1 support tickets][edit]

Characterize [1 support ticket][edit]

File extraction [1 support ticket][edit]

Deployment of development above for testing [1 support ticket][edit]

Reporting [?? support tickets, depends on scope][edit]

Second iteration[edit]

Third iteration[edit]

Tools/resources[edit]

Example dfxml[edit]

See also[edit]

Navigation menu

Search

@@ Line 1: / Line 1: @@
+<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information. </div> <p>
 == Synopsis ==
@@ Line 171: / Line 173: @@
 == Second iteration ==
+The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried.
+Steps to the second iteration:
+* test analyzer script from [https://github.com/timothyryanwalsh/cca-diskimageprocessor CCA Disk Image Processor][1] on disk image samples
+* verify that for our samples, the output csv file has identified the file system
+* integrate the script in to the Siegfried File identification script running within Archivematica. This could be developed to run anytime Siegfried is run as the file identification tool or as a special configuration using the Disk Image transfer type.
+* this will produce a special format (e.g., "img-hfs" or similar) which will need to be added to the FPR in order to prompt the desired tools for further mirco-services.
+[1] Script from CCA Disk Image Processor can be run outside the GUI: https://github.com/timothyryanwalsh/cca-diskimageprocessor/blob/master/diskimageanalyzer.py
+'''Results of work as of May 9, 2017'''
+The HFS disk image extraction and characterization functionality should work on http://am16hfs.archivematica.org/transfer/ and it should also work on a new system that is running Archivematica at branch dev/issue-10818-hfs-disk-images.
+To install such a system using Archivematica's deploy-pub Vagrant repo, modify the vars-singlenode-1.6.yml file so that ``archivematica_src_am_version`` is set to ``"dev/issue-10818-hfs-disk-images."`` and provision::
+    $ vagrant provision
+In order to get the new HFS disk image functionality to work in Archivematica, the following tools and their dependencies must be manually installed:
+- hfs2dfxml
+- hfsexplorer
+The following instructions were used on Ubuntu 14.04.
+To install hfsutils::
+    $ sudo apt-get update
+    $ sudo apt-get install hfsutils
+Install python-magic. WARNING: it's important that you do NOT install this python-magic: https://github.com/ahupp/python-magic. Instead, you must install the python-magic (see https://github.com/threatstack/libmagic) that Ubuntu installs when you call the following::
+    $ sudo apt-get install python-magic
+If your install has separate virtual environments for each Archivematica component, then the MCPClient virtualenv needs to have magic installed also::
+    $ sudo ln -s /usr/lib/python2.7/dist-packages/magic.py /usr/share/python/archivematica-mcp-client/lib/python2.7/site-packages/magic.py
+Install (Holly Becker's patch of) the hfs2dfxml source in your home directory
+on the machine where Archivematica is installed::
+    $ pwd
+    /home/vagrant
+    $ mkdir bin
+    $ cd bin
+    $ git clone https://github.com/Hwesta/hfs2dfxml
+    $ cd hfs2dfxml/hfs2dfxml
+    $ git checkout -t origin/patch-1
+    $ git clone https://github.com/simsong/dfxml/
+Install hfsexplorer:
+- Go to http://www.catacombae.org/hfsexplorer/
+- Download the ZIP at link: Download application as ZIP file (cross-platform)
+- Extract ZIP into directory accessible by Archivematica (suggested: extract
+  into /home/vagrant/hfsexplorer)
+- Make sure that the hfsexplorer directory is in the directory that extraction
+  command expects it to be, i.e., /usr/local/hfsexplorer/bin/unhfs.sh
+Also make sure to modify the default processing config so that Siegfried is the
+file identification command during transfer and packages are not deleted after
+extraction: this allows the disk image to be characterized after its contents
+have been extracted.
+'''Results of work as of May 3, 2017'''
+. Added a new characterization command called "Fiwalk fallback to hfs2dfxml" which attempts to use fiwalk for characterizing disk images and, if that fails, attempts to use hfs2dfxml. Also set the "Output File Format" to "XML".
+. Created a new extraction command for disk images called "tsk_recover fallback unhfs" which attempts to extract a disk image using tsk_recover and then falls back to using unhfs if that fails. Set output file format to JSON so that the output can alter the default tool from "Sleuthkit" to "hfsexplorer" when appropriate.
+. Modified the Siegfried fpr_idcommand so that it attempts to run blkid when it would otherwise return 'UNKNOWN'; if blkid indicates that the file is an HFS disk image, then the fpr_idcommand returns ".img (hfs)", cf. the new fpr_idrule below.
+. Created a new fpr_idrule that connects the Siegfried id command to the UCLA HFS format version, via the special ".img (hfs)" PUID/extension:
+ - Format: Disk Image: HFS Disk Image (HFS filesystem): UCLA HFS Disk Image
+ - Command: Siegfried version 1.6.7 PUID runs identify using Siegfried
+ - Command output: .img (hfs)
+. Modified the source of MCPClient/clientScripts/identifyFileFormat.py so that it can handle Siegfried-based identification returning a pronom id that is actually more akin to a file extension, in this case the special ".img (hfs)" output.
+. Modified MCPClient/clientScripts/extractContents.py so that the tool used in extraction will be listed in the eventDetail. In the HFS extraction case, this will be "hfsexplorer".
+. Ran a transfer on artefactual/227_026/ and it was identified as an HFS disk image (by hfs2dfxml) and it was extracted using unhfs.
+'''Things to Note:'''
+a. The METS file will list "No Matching Format" for the format identification event for the disk image. This is because the FormatVersion that Siegfried/blkid is assigning to the HFS disk image has no pronom id. This could be "fixed" by assigning a fake pronom id to that FormatVersion, as seems to have been done in the past in Archivematica::
+mysql> select description, pronom_id from fpr_formatversion where pronom_id like 'archivematica%';
+   +-------------------------------------+---------------------+
+   | description                         | pronom_id           |
+   +-------------------------------------+---------------------+
+   |  Raw Disk Image (HFS filesystem)    | archivematica-fmt/5 |
+   +-------------------------------------+---------------------+
+b. Still left to do is the creation of ansible tasks to install the tools and their dependencies: unhfs, HFSutils, hfs2dfxml
+== Third iteration ==
+The goal of a third iteration would be to prevent the need to use a special format, and instead have extract packages and characterization tools chosen appropriately based on file system information rather than format. Two possibilities for enacting FPR rules based on file system type rather than on file format:
+* as part of Disk Image transfer type workflow
+* based on the existence of analysis.csv file in the transfer
+As part of a third iteration we could also look at converting the contents of analysis.csv to dfxml and adding it to the METS file.
 == Tools/resources ==