Difference between revisions of "Improvements/Disk Image Preservation"
(7 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
+ | <div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information. </div> <p> | ||
+ | |||
== Synopsis == | == Synopsis == | ||
Line 22: | Line 24: | ||
Develop a pre-ingest script which would identify the file system and store this metadata in such a way that it can be passed to Archivematica. This could be made part of an [https://github.com/artefactual/automation-tools automation tool] script. | Develop a pre-ingest script which would identify the file system and store this metadata in such a way that it can be passed to Archivematica. This could be made part of an [https://github.com/artefactual/automation-tools automation tool] script. | ||
+ | |||
+ | '''Update''': after a first iteration we have decided to take a different approach. See Second iteration, below. | ||
===File identification script [1 support tickets]=== | ===File identification script [1 support tickets]=== | ||
Currently, the file identification scripts run in Archivematica will identify the type of disk image but not the file system of the disk image. This script will use the data from the pre-ingest script to identify both the disk image type and the file system. | Currently, the file identification scripts run in Archivematica will identify the type of disk image but not the file system of the disk image. This script will use the data from the pre-ingest script to identify both the disk image type and the file system. | ||
+ | |||
+ | '''Results''' | ||
+ | |||
+ | As a first iteration, we used identify by extension and created an FPR entry for raw disk image with HFS: | ||
+ | |||
+ | New FormatVersion: Format: Raw Disk Image; Description: Raw Disk Image (HFS filesystem) | ||
+ | |||
+ | Modify Identify by File Extension like [2] | ||
+ | |||
+ | Command: | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | from __future__ import print_function | ||
+ | import os.path | ||
+ | import subprocess | ||
+ | import sys | ||
+ | |||
+ | def file_tool(path): | ||
+ | return subprocess.check_output(['file', path]).strip() | ||
+ | |||
+ | def blkid(path): | ||
+ | try: | ||
+ | return subprocess.check_output(['blkid', '-o', 'full', path]) | ||
+ | except Exception: | ||
+ | return '' | ||
+ | |||
+ | (_, extension) = os.path.splitext(sys.argv[1]) | ||
+ | |||
+ | if extension: | ||
+ | print(extension, end='') | ||
+ | if extension in ('.img,'): | ||
+ | output = blkid(sys.argv[1]) | ||
+ | if 'TYPE="hfs"' in output: | ||
+ | print(' (hfs)') | ||
+ | else: | ||
+ | # Plaintext files frequently have no extension, but are common to identify. | ||
+ | # file is pretty smart at figuring these out. | ||
+ | file_output = file_tool(sys.argv[1]) | ||
+ | if 'text' in file_output: | ||
+ | print('.txt') | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | New ID Rule: Format: Raw Disk Image (HFS filesystem); Command: Identify by File Extension; Output: ".img (hfs)" | ||
+ | |||
+ | Disable Rule with output ".img", or modify to identify as Raw Disk Image | ||
+ | |||
+ | [1] https://github.com/cul-it/hfs2dfxml | ||
+ | |||
+ | [2] https://github.com/artefactual/archivematica-fpr-tools/blob/dev/issue-10818-hfs-disk-image/id/file-by-extension.py | ||
+ | |||
===Characterize [1 support ticket]=== | ===Characterize [1 support ticket]=== | ||
Write meaningful characterization about the size and file type of the disk images so that statistics can be gathered from the AIPs. Currently, when fiwalk is run as the characterization tool for a disk image, dfxml is written in premis:objectCharacteristicsExtension. | Write meaningful characterization about the size and file type of the disk images so that statistics can be gathered from the AIPs. Currently, when fiwalk is run as the characterization tool for a disk image, dfxml is written in premis:objectCharacteristicsExtension. | ||
+ | |||
+ | '''Results''' | ||
+ | |||
+ | For characterization, the hfs2dfxml [4] provides a nice XML output with metadata about the image. However, it is also not packaged for Ubuntu. To install it, follow the instructions in the README [5] by installing hfsutils & python-magic, cloning the repository and cloning the dependency dfxml in the correct location inside the repository. | ||
+ | |||
+ | However, it can only be run from inside the repository without a patch. Either clone my fork [6] and change branches, or apply the patch [7] yourself. | ||
+ | |||
+ | FPR changes | ||
+ | |||
+ | *Set up file identification FPR changes | ||
+ | *Install & patch hfs2dfxml somewhere Archivematica can run it from | ||
+ | *Disable "Delete packages after extraction" | ||
+ | *New FPR Tool: Description: hfs2dfxml; Version: git commit hash | ||
+ | *New Characterization Command: | ||
+ | **Tool: hfs2dfxml | ||
+ | **Description: hfs2dfxml characterization | ||
+ | **Script Type: bash | ||
+ | **Command: | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | output=/tmp/temp_`uuid -v4` | ||
+ | echo $(id) | ||
+ | python /home/users/hbecker/bin/hfs2dfxml/hfs2dfxml/hfs2dfxml.py "%fileFullName%" $output | ||
+ | cat $output | ||
+ | rm $output | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | **Output Format: Text (Markup): XML: XML | ||
+ | **Command Usage: Characterization | ||
+ | |||
+ | *New Characterization Rule: Purpose: Characterization; Format: Raw Disk Image (HFS filesystem); Command: hfs2dfxml | ||
+ | |||
+ | However, that setup currently generates an error when run through Archivematica. "_call_hmount error: Failed to initialize HFS working directories: Permission denied" hfs2dfxml is being run, but generates an error when trying to call hfsutils. This requires further investigation. | ||
+ | |||
+ | [4] https://github.com/cul-it/hfs2dfxml | ||
+ | [5] https://github.com/cul-it/hfs2dfxml/blob/master/README.md | ||
+ | [6] https://github.com/Hwesta/hfs2dfxml/tree/patch-1 | ||
+ | [7] https://github.com/cul-it/hfs2dfxml/pull/7/files | ||
===File extraction [1 support ticket]=== | ===File extraction [1 support ticket]=== | ||
Implement tools such as [https://www.mars.org/home/rob/proj/hfs/ HFS Utilities] that will allow files from hfs disk images to be extracted. | Implement tools such as [https://www.mars.org/home/rob/proj/hfs/ HFS Utilities] that will allow files from hfs disk images to be extracted. | ||
+ | |||
+ | '''Results''' | ||
+ | |||
+ | * Fiwalk does not recognize the filesystem, and cannot extract from it. | ||
+ | * hfsutils provides the hmount and hcopy commands, but hcopy is not recursive | ||
+ | * tsk_recover cannot recognize the filesystem, outputting "Cannot determine file system type (Sector offset: 0)Files Recovered: 0" | ||
+ | * hfsexplorer [3] provides a command line extraction tool for HFS filesystems. However, hfsexplorer is not packaged for Ubuntu, and must be installed manually. | ||
+ | |||
+ | To install hfsexplorer, download and extract it. By default it uses a GUI, but a command line interface is accessible from the hfsx.sh script. The script we want is unhfs.sh, which extracts files from the image. | ||
+ | |||
+ | FPR changes: | ||
+ | |||
+ | To handle extraction, use hfsexplorer's unhfs command to extract all files from the hfs partition. | ||
+ | |||
+ | *Set up file identification FPR changes | ||
+ | *Install hfsexplorer somewhere Archivematica can run it from | ||
+ | *New FPR Tool: Description: hfsexplorer; Version: 0.23.1 | ||
+ | *New Extraction Command: | ||
+ | **Tool: hfsexplorer | ||
+ | **Description: unhfs | ||
+ | **Script Type: bash | ||
+ | **Command: | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | mkdir "%outputDirectory%" | ||
+ | /home/users/hbecker/bin/hfsexplorer/bin/unhfs.sh -v -o "%outputDirectory%" "%inputFile%" | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | **Output location: outputDirectory | ||
+ | **Command Usage: Extraction | ||
+ | *New Extraction Rule: Purpose: Extract; Format: Raw Disk Image (HFS filesystem); Command: unhfs | ||
+ | |||
+ | [3] http://www.catacombae.org/hfsexplorer/ | ||
===Deployment of development above for testing [1 support ticket]=== | ===Deployment of development above for testing [1 support ticket]=== | ||
Line 40: | Line 171: | ||
Explore reporting possibilities either in Archivematica or using other reporting tools. | Explore reporting possibilities either in Archivematica or using other reporting tools. | ||
+ | |||
+ | == Second iteration == | ||
+ | |||
+ | The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried. | ||
+ | |||
+ | Steps to the second iteration: | ||
+ | |||
+ | * test analyzer script from [https://github.com/timothyryanwalsh/cca-diskimageprocessor CCA Disk Image Processor][1] on disk image samples | ||
+ | * verify that for our samples, the output csv file has identified the file system | ||
+ | * integrate the script in to the Siegfried File identification script running within Archivematica. This could be developed to run anytime Siegfried is run as the file identification tool or as a special configuration using the Disk Image transfer type. | ||
+ | * this will produce a special format (e.g., "img-hfs" or similar) which will need to be added to the FPR in order to prompt the desired tools for further mirco-services. | ||
+ | |||
+ | [1] Script from CCA Disk Image Processor can be run outside the GUI: https://github.com/timothyryanwalsh/cca-diskimageprocessor/blob/master/diskimageanalyzer.py | ||
+ | |||
+ | '''Results of work as of May 9, 2017''' | ||
+ | |||
+ | The HFS disk image extraction and characterization functionality should work on http://am16hfs.archivematica.org/transfer/ and it should also work on a new system that is running Archivematica at branch dev/issue-10818-hfs-disk-images. | ||
+ | |||
+ | To install such a system using Archivematica's deploy-pub Vagrant repo, modify the vars-singlenode-1.6.yml file so that ``archivematica_src_am_version`` is set to ``"dev/issue-10818-hfs-disk-images."`` and provision:: | ||
+ | |||
+ | $ vagrant provision | ||
+ | |||
+ | In order to get the new HFS disk image functionality to work in Archivematica, the following tools and their dependencies must be manually installed: | ||
+ | |||
+ | - hfs2dfxml | ||
+ | - hfsexplorer | ||
+ | |||
+ | The following instructions were used on Ubuntu 14.04. | ||
+ | |||
+ | To install hfsutils:: | ||
+ | |||
+ | $ sudo apt-get update | ||
+ | $ sudo apt-get install hfsutils | ||
+ | |||
+ | Install python-magic. WARNING: it's important that you do NOT install this python-magic: https://github.com/ahupp/python-magic. Instead, you must install the python-magic (see https://github.com/threatstack/libmagic) that Ubuntu installs when you call the following:: | ||
+ | |||
+ | $ sudo apt-get install python-magic | ||
+ | |||
+ | If your install has separate virtual environments for each Archivematica component, then the MCPClient virtualenv needs to have magic installed also:: | ||
+ | |||
+ | $ sudo ln -s /usr/lib/python2.7/dist-packages/magic.py /usr/share/python/archivematica-mcp-client/lib/python2.7/site-packages/magic.py | ||
+ | |||
+ | Install (Holly Becker's patch of) the hfs2dfxml source in your home directory | ||
+ | on the machine where Archivematica is installed:: | ||
+ | |||
+ | $ pwd | ||
+ | /home/vagrant | ||
+ | $ mkdir bin | ||
+ | $ cd bin | ||
+ | $ git clone https://github.com/Hwesta/hfs2dfxml | ||
+ | $ cd hfs2dfxml/hfs2dfxml | ||
+ | $ git checkout -t origin/patch-1 | ||
+ | $ git clone https://github.com/simsong/dfxml/ | ||
+ | |||
+ | Install hfsexplorer: | ||
+ | |||
+ | - Go to http://www.catacombae.org/hfsexplorer/ | ||
+ | - Download the ZIP at link: Download application as ZIP file (cross-platform) | ||
+ | - Extract ZIP into directory accessible by Archivematica (suggested: extract | ||
+ | into /home/vagrant/hfsexplorer) | ||
+ | - Make sure that the hfsexplorer directory is in the directory that extraction | ||
+ | command expects it to be, i.e., /usr/local/hfsexplorer/bin/unhfs.sh | ||
+ | |||
+ | Also make sure to modify the default processing config so that Siegfried is the | ||
+ | file identification command during transfer and packages are not deleted after | ||
+ | extraction: this allows the disk image to be characterized after its contents | ||
+ | have been extracted. | ||
+ | |||
+ | '''Results of work as of May 3, 2017''' | ||
+ | |||
+ | 1. Added a new characterization command called "Fiwalk fallback to hfs2dfxml" which attempts to use fiwalk for characterizing disk images and, if that fails, attempts to use hfs2dfxml. Also set the "Output File Format" to "XML". | ||
+ | |||
+ | 2. Created a new extraction command for disk images called "tsk_recover fallback unhfs" which attempts to extract a disk image using tsk_recover and then falls back to using unhfs if that fails. Set output file format to JSON so that the output can alter the default tool from "Sleuthkit" to "hfsexplorer" when appropriate. | ||
+ | |||
+ | 3. Modified the Siegfried fpr_idcommand so that it attempts to run blkid when it would otherwise return 'UNKNOWN'; if blkid indicates that the file is an HFS disk image, then the fpr_idcommand returns ".img (hfs)", cf. the new fpr_idrule below. | ||
+ | |||
+ | 4. Created a new fpr_idrule that connects the Siegfried id command to the UCLA HFS format version, via the special ".img (hfs)" PUID/extension: | ||
+ | - Format: Disk Image: HFS Disk Image (HFS filesystem): UCLA HFS Disk Image | ||
+ | - Command: Siegfried version 1.6.7 PUID runs identify using Siegfried | ||
+ | - Command output: .img (hfs) | ||
+ | |||
+ | 5. Modified the source of MCPClient/clientScripts/identifyFileFormat.py so that it can handle Siegfried-based identification returning a pronom id that is actually more akin to a file extension, in this case the special ".img (hfs)" output. | ||
+ | |||
+ | 6. Modified MCPClient/clientScripts/extractContents.py so that the tool used in extraction will be listed in the eventDetail. In the HFS extraction case, this will be "hfsexplorer". | ||
+ | |||
+ | 7. Ran a transfer on artefactual/227_026/ and it was identified as an HFS disk image (by hfs2dfxml) and it was extracted using unhfs. | ||
+ | |||
+ | '''Things to Note:''' | ||
+ | |||
+ | a. The METS file will list "No Matching Format" for the format identification event for the disk image. This is because the FormatVersion that Siegfried/blkid is assigning to the HFS disk image has no pronom id. This could be "fixed" by assigning a fake pronom id to that FormatVersion, as seems to have been done in the past in Archivematica:: | ||
+ | |||
+ | mysql> select description, pronom_id from fpr_formatversion where pronom_id like 'archivematica%'; | ||
+ | +-------------------------------------+---------------------+ | ||
+ | | description | pronom_id | | ||
+ | +-------------------------------------+---------------------+ | ||
+ | | Raw Disk Image (HFS filesystem) | archivematica-fmt/5 | | ||
+ | +-------------------------------------+---------------------+ | ||
+ | |||
+ | b. Still left to do is the creation of ansible tasks to install the tools and their dependencies: unhfs, HFSutils, hfs2dfxml | ||
+ | |||
+ | == Third iteration == | ||
+ | The goal of a third iteration would be to prevent the need to use a special format, and instead have extract packages and characterization tools chosen appropriately based on file system information rather than format. Two possibilities for enacting FPR rules based on file system type rather than on file format: | ||
+ | |||
+ | * as part of Disk Image transfer type workflow | ||
+ | * based on the existence of analysis.csv file in the transfer | ||
+ | |||
+ | As part of a third iteration we could also look at converting the contents of analysis.csv to dfxml and adding it to the METS file. | ||
== Tools/resources == | == Tools/resources == | ||
Line 460: | Line 698: | ||
[[Digital_forensics_image_ingest|Original requirements for forensic image ingest]] | [[Digital_forensics_image_ingest|Original requirements for forensic image ingest]] | ||
+ | |||
+ | [[Category:Development documentation]] |
Latest revision as of 15:55, 11 February 2020
Synopsis[edit]
This project is being sponsored by UCLA Library and NYPL Special Collections but more collaborators are welcome! Please get in touch on the community user forum.
Different tools are needed for extraction of different disk images depending on what file system the disk image was created from. Archivematica's standard tool for disk image extraction is tsk recover, which is limited to 18 file systems. A challenge to invoking the right tool for the job is Archivematica's use of file format, rather than other characteristics (in this case, the file system). For example, if a disk image is identified as an ISO image, Archivematica will currently invoke tsk recover regardless of the file system of the image, where as tsk recover will only be able to extract the contents of the 18 file systems listed in the link above.
In this project we are particularly focused on hfs disk images, which are not currently supported by tsk recover, but the development will be generic enough to be useful to different types of disk images as well.
User story[edit]
As an archivist, I would like to process disk images of an unknown file-system through Archivematica, and have Archivematica and/or its associated tools recognize and record the file system and choose appropriate tools for disk image extraction and characterization. Further, I would want to be able to pull statistics about the size and file-type of disk images from the system.
Development tasks[edit]
We have identified the following development tasks which would need to be addressed in the order described:
Recommendation: upgrade Sleuthkit [1 support ticket][edit]
Upgrade from version 4.1.3 (three years old) to 4.4 (most recent release).
Pre-ingest script [1 support ticket][edit]
Develop a pre-ingest script which would identify the file system and store this metadata in such a way that it can be passed to Archivematica. This could be made part of an automation tool script.
Update: after a first iteration we have decided to take a different approach. See Second iteration, below.
File identification script [1 support tickets][edit]
Currently, the file identification scripts run in Archivematica will identify the type of disk image but not the file system of the disk image. This script will use the data from the pre-ingest script to identify both the disk image type and the file system.
Results
As a first iteration, we used identify by extension and created an FPR entry for raw disk image with HFS:
New FormatVersion: Format: Raw Disk Image; Description: Raw Disk Image (HFS filesystem)
Modify Identify by File Extension like [2]
Command:
from __future__ import print_function import os.path import subprocess import sys def file_tool(path): return subprocess.check_output(['file', path]).strip() def blkid(path): try: return subprocess.check_output(['blkid', '-o', 'full', path]) except Exception: return '' (_, extension) = os.path.splitext(sys.argv[1]) if extension: print(extension, end='') if extension in ('.img,'): output = blkid(sys.argv[1]) if 'TYPE="hfs"' in output: print(' (hfs)') else: # Plaintext files frequently have no extension, but are common to identify. # file is pretty smart at figuring these out. file_output = file_tool(sys.argv[1]) if 'text' in file_output: print('.txt')
New ID Rule: Format: Raw Disk Image (HFS filesystem); Command: Identify by File Extension; Output: ".img (hfs)"
Disable Rule with output ".img", or modify to identify as Raw Disk Image
[1] https://github.com/cul-it/hfs2dfxml
Characterize [1 support ticket][edit]
Write meaningful characterization about the size and file type of the disk images so that statistics can be gathered from the AIPs. Currently, when fiwalk is run as the characterization tool for a disk image, dfxml is written in premis:objectCharacteristicsExtension.
Results
For characterization, the hfs2dfxml [4] provides a nice XML output with metadata about the image. However, it is also not packaged for Ubuntu. To install it, follow the instructions in the README [5] by installing hfsutils & python-magic, cloning the repository and cloning the dependency dfxml in the correct location inside the repository.
However, it can only be run from inside the repository without a patch. Either clone my fork [6] and change branches, or apply the patch [7] yourself.
FPR changes
- Set up file identification FPR changes
- Install & patch hfs2dfxml somewhere Archivematica can run it from
- Disable "Delete packages after extraction"
- New FPR Tool: Description: hfs2dfxml; Version: git commit hash
- New Characterization Command:
- Tool: hfs2dfxml
- Description: hfs2dfxml characterization
- Script Type: bash
- Command:
output=/tmp/temp_`uuid -v4` echo $(id) python /home/users/hbecker/bin/hfs2dfxml/hfs2dfxml/hfs2dfxml.py "%fileFullName%" $output cat $output rm $output
- Output Format: Text (Markup): XML: XML
- Command Usage: Characterization
- New Characterization Rule: Purpose: Characterization; Format: Raw Disk Image (HFS filesystem); Command: hfs2dfxml
However, that setup currently generates an error when run through Archivematica. "_call_hmount error: Failed to initialize HFS working directories: Permission denied" hfs2dfxml is being run, but generates an error when trying to call hfsutils. This requires further investigation.
[4] https://github.com/cul-it/hfs2dfxml [5] https://github.com/cul-it/hfs2dfxml/blob/master/README.md [6] https://github.com/Hwesta/hfs2dfxml/tree/patch-1 [7] https://github.com/cul-it/hfs2dfxml/pull/7/files
File extraction [1 support ticket][edit]
Implement tools such as HFS Utilities that will allow files from hfs disk images to be extracted.
Results
- Fiwalk does not recognize the filesystem, and cannot extract from it.
- hfsutils provides the hmount and hcopy commands, but hcopy is not recursive
- tsk_recover cannot recognize the filesystem, outputting "Cannot determine file system type (Sector offset: 0)Files Recovered: 0"
- hfsexplorer [3] provides a command line extraction tool for HFS filesystems. However, hfsexplorer is not packaged for Ubuntu, and must be installed manually.
To install hfsexplorer, download and extract it. By default it uses a GUI, but a command line interface is accessible from the hfsx.sh script. The script we want is unhfs.sh, which extracts files from the image.
FPR changes:
To handle extraction, use hfsexplorer's unhfs command to extract all files from the hfs partition.
- Set up file identification FPR changes
- Install hfsexplorer somewhere Archivematica can run it from
- New FPR Tool: Description: hfsexplorer; Version: 0.23.1
- New Extraction Command:
- Tool: hfsexplorer
- Description: unhfs
- Script Type: bash
- Command:
mkdir "%outputDirectory%" /home/users/hbecker/bin/hfsexplorer/bin/unhfs.sh -v -o "%outputDirectory%" "%inputFile%"
- Output location: outputDirectory
- Command Usage: Extraction
- New Extraction Rule: Purpose: Extract; Format: Raw Disk Image (HFS filesystem); Command: unhfs
[3] http://www.catacombae.org/hfsexplorer/
Deployment of development above for testing [1 support ticket][edit]
Reporting [?? support tickets, depends on scope][edit]
Explore reporting possibilities either in Archivematica or using other reporting tools.
Second iteration[edit]
The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried.
Steps to the second iteration:
- test analyzer script from CCA Disk Image Processor[1] on disk image samples
- verify that for our samples, the output csv file has identified the file system
- integrate the script in to the Siegfried File identification script running within Archivematica. This could be developed to run anytime Siegfried is run as the file identification tool or as a special configuration using the Disk Image transfer type.
- this will produce a special format (e.g., "img-hfs" or similar) which will need to be added to the FPR in order to prompt the desired tools for further mirco-services.
[1] Script from CCA Disk Image Processor can be run outside the GUI: https://github.com/timothyryanwalsh/cca-diskimageprocessor/blob/master/diskimageanalyzer.py
Results of work as of May 9, 2017
The HFS disk image extraction and characterization functionality should work on http://am16hfs.archivematica.org/transfer/ and it should also work on a new system that is running Archivematica at branch dev/issue-10818-hfs-disk-images.
To install such a system using Archivematica's deploy-pub Vagrant repo, modify the vars-singlenode-1.6.yml file so that ``archivematica_src_am_version`` is set to ``"dev/issue-10818-hfs-disk-images."`` and provision::
$ vagrant provision
In order to get the new HFS disk image functionality to work in Archivematica, the following tools and their dependencies must be manually installed:
- hfs2dfxml - hfsexplorer
The following instructions were used on Ubuntu 14.04.
To install hfsutils::
$ sudo apt-get update $ sudo apt-get install hfsutils
Install python-magic. WARNING: it's important that you do NOT install this python-magic: https://github.com/ahupp/python-magic. Instead, you must install the python-magic (see https://github.com/threatstack/libmagic) that Ubuntu installs when you call the following::
$ sudo apt-get install python-magic
If your install has separate virtual environments for each Archivematica component, then the MCPClient virtualenv needs to have magic installed also::
$ sudo ln -s /usr/lib/python2.7/dist-packages/magic.py /usr/share/python/archivematica-mcp-client/lib/python2.7/site-packages/magic.py
Install (Holly Becker's patch of) the hfs2dfxml source in your home directory on the machine where Archivematica is installed::
$ pwd /home/vagrant $ mkdir bin $ cd bin $ git clone https://github.com/Hwesta/hfs2dfxml $ cd hfs2dfxml/hfs2dfxml $ git checkout -t origin/patch-1 $ git clone https://github.com/simsong/dfxml/
Install hfsexplorer:
- Go to http://www.catacombae.org/hfsexplorer/ - Download the ZIP at link: Download application as ZIP file (cross-platform) - Extract ZIP into directory accessible by Archivematica (suggested: extract
into /home/vagrant/hfsexplorer)
- Make sure that the hfsexplorer directory is in the directory that extraction
command expects it to be, i.e., /usr/local/hfsexplorer/bin/unhfs.sh
Also make sure to modify the default processing config so that Siegfried is the file identification command during transfer and packages are not deleted after extraction: this allows the disk image to be characterized after its contents have been extracted.
Results of work as of May 3, 2017
1. Added a new characterization command called "Fiwalk fallback to hfs2dfxml" which attempts to use fiwalk for characterizing disk images and, if that fails, attempts to use hfs2dfxml. Also set the "Output File Format" to "XML".
2. Created a new extraction command for disk images called "tsk_recover fallback unhfs" which attempts to extract a disk image using tsk_recover and then falls back to using unhfs if that fails. Set output file format to JSON so that the output can alter the default tool from "Sleuthkit" to "hfsexplorer" when appropriate.
3. Modified the Siegfried fpr_idcommand so that it attempts to run blkid when it would otherwise return 'UNKNOWN'; if blkid indicates that the file is an HFS disk image, then the fpr_idcommand returns ".img (hfs)", cf. the new fpr_idrule below.
4. Created a new fpr_idrule that connects the Siegfried id command to the UCLA HFS format version, via the special ".img (hfs)" PUID/extension:
- Format: Disk Image: HFS Disk Image (HFS filesystem): UCLA HFS Disk Image - Command: Siegfried version 1.6.7 PUID runs identify using Siegfried - Command output: .img (hfs)
5. Modified the source of MCPClient/clientScripts/identifyFileFormat.py so that it can handle Siegfried-based identification returning a pronom id that is actually more akin to a file extension, in this case the special ".img (hfs)" output.
6. Modified MCPClient/clientScripts/extractContents.py so that the tool used in extraction will be listed in the eventDetail. In the HFS extraction case, this will be "hfsexplorer".
7. Ran a transfer on artefactual/227_026/ and it was identified as an HFS disk image (by hfs2dfxml) and it was extracted using unhfs.
Things to Note:
a. The METS file will list "No Matching Format" for the format identification event for the disk image. This is because the FormatVersion that Siegfried/blkid is assigning to the HFS disk image has no pronom id. This could be "fixed" by assigning a fake pronom id to that FormatVersion, as seems to have been done in the past in Archivematica::
mysql> select description, pronom_id from fpr_formatversion where pronom_id like 'archivematica%';
+-------------------------------------+---------------------+ | description | pronom_id | +-------------------------------------+---------------------+ | Raw Disk Image (HFS filesystem) | archivematica-fmt/5 | +-------------------------------------+---------------------+
b. Still left to do is the creation of ansible tasks to install the tools and their dependencies: unhfs, HFSutils, hfs2dfxml
Third iteration[edit]
The goal of a third iteration would be to prevent the need to use a special format, and instead have extract packages and characterization tools chosen appropriately based on file system information rather than format. Two possibilities for enacting FPR rules based on file system type rather than on file format:
- as part of Disk Image transfer type workflow
- based on the existence of analysis.csv file in the transfer
As part of a third iteration we could also look at converting the contents of analysis.csv to dfxml and adding it to the METS file.
Tools/resources[edit]
- HFS Utilities
- hfs2dfxml - Utility to parse hfsutils output and produce DFXML for HFS-formatted disk images
- CCA Disk Image Processor - Creates ready-to-ingest SIPs from a directory of disk images and related files.
Example dfxml[edit]
This is how dfxml is currently written by fiwalk in the Archivematica METS file:
<premis:objectCharacteristicsExtension> <dfxml xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0"> <metadata> <dc:type>Disk Image</dc:type> </metadata> <creator version="1.0"> <program>fiwalk</program> <version>4.1.3</version> <build_environment> <compiler>GCC 4.8</compiler> <library name="afflib" version="3.6.6"/> <library name="libewf" version="20130416"/> </build_environment> <execution_environment> <command_line>fiwalk -x /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso -c /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt</command_line> <start_time>2017-01-23T21:13:53Z</start_time> </execution_environment> </creator> <!-- Reading configuration file /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt --> <!-- pattern: * method: dgi path: python /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/pronom_ident.py --> <source> <image_filename>/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso</image_filename> </source> <!-- fs start: 0 --> <volume offset="0"> <partition_offset>0</partition_offset> <block_size>2048</block_size> <ftype>2048</ftype> <ftype_str>iso9660</ftype_str> <block_count>6047</block_count> <first_block>0</first_block> <last_block>6046</last_block> <fileobject> <filename>.</filename> <partition>1</partition> <id>1</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>0</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/> </byte_runs> <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest> <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest> </fileobject> <fileobject> <filename>..</filename> <partition>1</partition> <id>2</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>0</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/> </byte_runs> <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest> <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest> </fileobject> <fileobject> <filename>799PX_EU.BMP</filename> <partition>1</partition> <id>3</id> <name_type>r</name_type> <filesize>1437654</filesize> <alloc>1</alloc> <used>1</used> <inode>1</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="51200" img_offset="51200" len="1437654"/> </byte_runs> <hashdigest type="md5">4829f38a294d156345922db8abd5e91c</hashdigest> <hashdigest type="sha1">1bde6c981776d81c13fd657621b2d3c8359d1761</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>BBHELMET.AI</filename> <partition>1</partition> <id>4</id> <name_type>r</name_type> <filesize>1080282</filesize> <alloc>1</alloc> <used>1</used> <inode>2</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="1488896" img_offset="1488896" len="1080282"/> </byte_runs> <hashdigest type="md5">c14bda842e2889a732e0f5f9d8c0ae73</hashdigest> <hashdigest type="sha1">98ce1ae12ee18893e8e1bd738855b6e20cd7b5ef</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>G31DS.TIF</filename> <partition>1</partition> <id>5</id> <name_type>r</name_type> <filesize>125968</filesize> <alloc>1</alloc> <used>1</used> <inode>3</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="2570240" img_offset="2570240" len="125968"/> </byte_runs> <hashdigest type="md5">1ea4939968f117de97b15437c6348847</hashdigest> <hashdigest type="sha1">d4c23ce4fecf17c8b952f98ed1cadc22a3d7399f</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>LION.SVG</filename> <partition>1</partition> <id>6</id> <name_type>r</name_type> <filesize>18324</filesize> <alloc>1</alloc> <used>1</used> <inode>4</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="2697216" img_offset="2697216" len="18324"/> </byte_runs> <hashdigest type="md5">e5913bebe296eb433fdade7400860e73</hashdigest> <hashdigest type="sha1">efe2c396a4ad46bab873f58eef4dbe6607be030c</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>NEMASTYL.PNG</filename> <partition>1</partition> <id>7</id> <name_type>r</name_type> <filesize>2050617</filesize> <alloc>1</alloc> <used>1</used> <inode>5</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="2715648" img_offset="2715648" len="2050617"/> </byte_runs> <hashdigest type="md5">0b0f9676ead317f643e9a58f0177d1e6</hashdigest> <hashdigest type="sha1">5d588800a5d5bd1ebe76ff2cbce0568a7f2dd386</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>OAKLAND0.JP2</filename> <partition>1</partition> <id>8</id> <name_type>r</name_type> <filesize>527345</filesize> <alloc>1</alloc> <used>1</used> <inode>6</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="4767744" img_offset="4767744" len="527345"/> </byte_runs> <hashdigest type="md5">04f7802b45838fed393d45afadaa9dcc</hashdigest> <hashdigest type="sha1">5a7eb88804b0783e3e3fe208cd10085954173c0a</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>PICTURES</filename> <partition>1</partition> <id>9</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>7</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/> </byte_runs> <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest> <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/.</filename> <partition>1</partition> <id>10</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>7</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/> </byte_runs> <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest> <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/..</filename> <partition>1</partition> <id>11</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>0</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/> </byte_runs> <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest> <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/LANDING_.JPG</filename> <partition>1</partition> <id>12</id> <name_type>r</name_type> <filesize>1361321</filesize> <alloc>1</alloc> <used>1</used> <inode>10</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="6453248" img_offset="6453248" len="1361321"/> </byte_runs> <hashdigest type="md5">0ff111013ad2f8ded1171cee683e718a</hashdigest> <hashdigest type="sha1">6aa382a1f8fdb23b7e9f3823ae655ce405b68f9e</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/MARBLES.TGA</filename> <partition>1</partition> <id>13</id> <name_type>r</name_type> <filesize>4261301</filesize> <alloc>1</alloc> <used>1</used> <inode>11</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="7815168" img_offset="7815168" len="4261301"/> </byte_runs> <hashdigest type="md5">d5e100eb19481b8b7f05ac8cc3fd4e26</hashdigest> <hashdigest type="sha1">262f203a14d193e199a024e1e567579b5c22f110</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>VECTOR_N.EPS</filename> <partition>1</partition> <id>14</id> <name_type>r</name_type> <filesize>1041114</filesize> <alloc>1</alloc> <used>1</used> <inode>8</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="5296128" img_offset="5296128" len="1041114"/> </byte_runs> <hashdigest type="md5">8dd3a652970aa7f130414305b92ab8a8</hashdigest> <hashdigest type="sha1">66280f092b775f132d8fbff84b2226fcaf5d3dce</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>WFPC01.GIF</filename> <partition>1</partition> <id>15</id> <name_type>r</name_type> <filesize>113318</filesize> <alloc>1</alloc> <used>1</used> <inode>9</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="6338560" img_offset="6338560" len="113318"/> </byte_runs> <hashdigest type="md5">2eb15cb1834214b05d0083c691f9545f</hashdigest> <hashdigest type="sha1">bf8addf8b2fc09a9bf1ecc9e2c6c5a3b4453b24a</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>$OrphanFiles</filename> <partition>1</partition> <id>16</id> <name_type>d</name_type> <filesize>0</filesize> <alloc>1</alloc> <used>1</used> <inode>12</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> </fileobject> </volume> <!-- end of volume --> <!-- clock: 1.908907 --> <rusage> <utime>0.112000</utime> <stime>0.040000</stime> <maxrss>30588</maxrss> <minflt>1553</minflt> <majflt>2</majflt> <nswap>0</nswap> <inblock>416</inblock> <oublock>23544</oublock> <clocktime>1.908907</clocktime> <!-- stop_time: Mon Jan 23 21:13:55 2017 --> </rusage> </dfxml> </premis:objectCharacteristicsExtension>