Difference between revisions of "Improvements/Disk Image Preservation"
Line 170: | Line 170: | ||
Explore reporting possibilities either in Archivematica or using other reporting tools. | Explore reporting possibilities either in Archivematica or using other reporting tools. | ||
− | == Second | + | == Second/third iterations == |
The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried. | The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried. | ||
Line 176: | Line 176: | ||
Steps to the second iteration: | Steps to the second iteration: | ||
− | + | * test analyzer script from [https://github.com/timothyryanwalsh/cca-diskimageprocessor CCA Disk Image Processor][1] on disk image samples | |
− | + | * verify that for our samples, the output csv file has identified the file system | |
− | + | * integrate the script in to the Siegfried File identification script running within Archivematica. This could be developed to run anytime Siegfried is run as the file identification tool or as a special configuration using the Disk Image transfer type. | |
+ | * this will produce a special format (e.g., "img-hfs" or similar) which will need to be added to the FPR in order to prompt the desired tools for further mirco-services. | ||
+ | |||
+ | |||
+ | [1] Script from CCA Disk Image Processor can be run outside the GUI: https://github.com/timothyryanwalsh/cca-diskimageprocessor/blob/master/diskimageanalyzer.py | ||
+ | |||
+ | The goal of a third iteration would be to prevent the need to use a special format, and instead have extract packages and characterization tools chosen appropriately based on file system information rather than format. Two possibilities for enacting FPR rules based on file system type rather than on file format: | ||
+ | |||
+ | * as part of Disk Image transfer type workflow | ||
+ | * based on the existence of analysis.csv file in the transfer | ||
+ | |||
+ | As part of a third iteration we could also look at converting the contents of analysis.csv to dfxml and adding it to the METS file. | ||
== Tools/resources == | == Tools/resources == |
Revision as of 10:57, 24 April 2017
Synopsis
This project is being sponsored by UCLA Library and NYPL Special Collections but more collaborators are welcome! Please get in touch on the community user forum.
Different tools are needed for extraction of different disk images depending on what file system the disk image was created from. Archivematica's standard tool for disk image extraction is tsk recover, which is limited to 18 file systems. A challenge to invoking the right tool for the job is Archivematica's use of file format, rather than other characteristics (in this case, the file system). For example, if a disk image is identified as an ISO image, Archivematica will currently invoke tsk recover regardless of the file system of the image, where as tsk recover will only be able to extract the contents of the 18 file systems listed in the link above.
In this project we are particularly focused on hfs disk images, which are not currently supported by tsk recover, but the development will be generic enough to be useful to different types of disk images as well.
User story
As an archivist, I would like to process disk images of an unknown file-system through Archivematica, and have Archivematica and/or its associated tools recognize and record the file system and choose appropriate tools for disk image extraction and characterization. Further, I would want to be able to pull statistics about the size and file-type of disk images from the system.
Development tasks
We have identified the following development tasks which would need to be addressed in the order described:
Recommendation: upgrade Sleuthkit [1 support ticket]
Upgrade from version 4.1.3 (three years old) to 4.4 (most recent release).
Pre-ingest script [1 support ticket]
Develop a pre-ingest script which would identify the file system and store this metadata in such a way that it can be passed to Archivematica. This could be made part of an automation tool script.
Update: after a first iteration we have decided to take a different approach. See Second iteration, below.
File identification script [1 support tickets]
Currently, the file identification scripts run in Archivematica will identify the type of disk image but not the file system of the disk image. This script will use the data from the pre-ingest script to identify both the disk image type and the file system.
Results
As a first iteration, we used identify by extension and created an FPR entry for raw disk image with HFS:
New FormatVersion: Format: Raw Disk Image; Description: Raw Disk Image (HFS filesystem)
Modify Identify by File Extension like [2]
Command:
from __future__ import print_function import os.path import subprocess import sys def file_tool(path): return subprocess.check_output(['file', path]).strip() def blkid(path): try: return subprocess.check_output(['blkid', '-o', 'full', path]) except Exception: return '' (_, extension) = os.path.splitext(sys.argv[1]) if extension: print(extension, end='') if extension in ('.img,'): output = blkid(sys.argv[1]) if 'TYPE="hfs"' in output: print(' (hfs)') else: # Plaintext files frequently have no extension, but are common to identify. # file is pretty smart at figuring these out. file_output = file_tool(sys.argv[1]) if 'text' in file_output: print('.txt')
New ID Rule: Format: Raw Disk Image (HFS filesystem); Command: Identify by File Extension; Output: ".img (hfs)"
Disable Rule with output ".img", or modify to identify as Raw Disk Image
[1] https://github.com/cul-it/hfs2dfxml
Characterize [1 support ticket]
Write meaningful characterization about the size and file type of the disk images so that statistics can be gathered from the AIPs. Currently, when fiwalk is run as the characterization tool for a disk image, dfxml is written in premis:objectCharacteristicsExtension.
Results
For characterization, the hfs2dfxml [4] provides a nice XML output with metadata about the image. However, it is also not packaged for Ubuntu. To install it, follow the instructions in the README [5] by installing hfsutils & python-magic, cloning the repository and cloning the dependency dfxml in the correct location inside the repository.
However, it can only be run from inside the repository without a patch. Either clone my fork [6] and change branches, or apply the patch [7] yourself.
FPR changes
- Set up file identification FPR changes
- Install & patch hfs2dfxml somewhere Archivematica can run it from
- Disable "Delete packages after extraction"
- New FPR Tool: Description: hfs2dfxml; Version: git commit hash
- New Characterization Command:
- Tool: hfs2dfxml
- Description: hfs2dfxml characterization
- Script Type: bash
- Command:
output=/tmp/temp_`uuid -v4` echo $(id) python /home/users/hbecker/bin/hfs2dfxml/hfs2dfxml/hfs2dfxml.py "%fileFullName%" $output cat $output rm $output
- Output Format: Text (Markup): XML: XML
- Command Usage: Characterization
- New Characterization Rule: Purpose: Characterization; Format: Raw Disk Image (HFS filesystem); Command: hfs2dfxml
However, that setup currently generates an error when run through Archivematica. "_call_hmount error: Failed to initialize HFS working directories: Permission denied" hfs2dfxml is being run, but generates an error when trying to call hfsutils. This requires further investigation.
[4] https://github.com/cul-it/hfs2dfxml [5] https://github.com/cul-it/hfs2dfxml/blob/master/README.md [6] https://github.com/Hwesta/hfs2dfxml/tree/patch-1 [7] https://github.com/cul-it/hfs2dfxml/pull/7/files
File extraction [1 support ticket]
Implement tools such as HFS Utilities that will allow files from hfs disk images to be extracted.
Results
- Fiwalk does not recognize the filesystem, and cannot extract from it.
- hfsutils provides the hmount and hcopy commands, but hcopy is not recursive
- tsk_recover cannot recognize the filesystem, outputting "Cannot determine file system type (Sector offset: 0)Files Recovered: 0"
- hfsexplorer [3] provides a command line extraction tool for HFS filesystems. However, hfsexplorer is not packaged for Ubuntu, and must be installed manually.
To install hfsexplorer, download and extract it. By default it uses a GUI, but a command line interface is accessible from the hfsx.sh script. The script we want is unhfs.sh, which extracts files from the image.
FPR changes:
To handle extraction, use hfsexplorer's unhfs command to extract all files from the hfs partition.
- Set up file identification FPR changes
- Install hfsexplorer somewhere Archivematica can run it from
- New FPR Tool: Description: hfsexplorer; Version: 0.23.1
- New Extraction Command:
- Tool: hfsexplorer
- Description: unhfs
- Script Type: bash
- Command:
mkdir "%outputDirectory%" /home/users/hbecker/bin/hfsexplorer/bin/unhfs.sh -v -o "%outputDirectory%" "%inputFile%"
- Output location: outputDirectory
- Command Usage: Extraction
- New Extraction Rule: Purpose: Extract; Format: Raw Disk Image (HFS filesystem); Command: unhfs
[3] http://www.catacombae.org/hfsexplorer/
Deployment of development above for testing [1 support ticket]
Reporting [?? support tickets, depends on scope]
Explore reporting possibilities either in Archivematica or using other reporting tools.
Second/third iterations
The goal of the second iteration is to get around the need to identify by extension. This will allow the disk image itself and its contents to be properly identified using a normal identification tool. For the purposes of this project we'll do it using Siegfried.
Steps to the second iteration:
- test analyzer script from CCA Disk Image Processor[1] on disk image samples
- verify that for our samples, the output csv file has identified the file system
- integrate the script in to the Siegfried File identification script running within Archivematica. This could be developed to run anytime Siegfried is run as the file identification tool or as a special configuration using the Disk Image transfer type.
- this will produce a special format (e.g., "img-hfs" or similar) which will need to be added to the FPR in order to prompt the desired tools for further mirco-services.
[1] Script from CCA Disk Image Processor can be run outside the GUI: https://github.com/timothyryanwalsh/cca-diskimageprocessor/blob/master/diskimageanalyzer.py
The goal of a third iteration would be to prevent the need to use a special format, and instead have extract packages and characterization tools chosen appropriately based on file system information rather than format. Two possibilities for enacting FPR rules based on file system type rather than on file format:
- as part of Disk Image transfer type workflow
- based on the existence of analysis.csv file in the transfer
As part of a third iteration we could also look at converting the contents of analysis.csv to dfxml and adding it to the METS file.
Tools/resources
- HFS Utilities
- hfs2dfxml - Utility to parse hfsutils output and produce DFXML for HFS-formatted disk images
- CCA Disk Image Processor - Creates ready-to-ingest SIPs from a directory of disk images and related files.
Example dfxml
This is how dfxml is currently written by fiwalk in the Archivematica METS file:
<premis:objectCharacteristicsExtension> <dfxml xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0"> <metadata> <dc:type>Disk Image</dc:type> </metadata> <creator version="1.0"> <program>fiwalk</program> <version>4.1.3</version> <build_environment> <compiler>GCC 4.8</compiler> <library name="afflib" version="3.6.6"/> <library name="libewf" version="20130416"/> </build_environment> <execution_environment> <command_line>fiwalk -x /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso -c /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt</command_line> <start_time>2017-01-23T21:13:53Z</start_time> </execution_environment> </creator> <!-- Reading configuration file /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt --> <!-- pattern: * method: dgi path: python /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/pronom_ident.py --> <source> <image_filename>/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso</image_filename> </source> <!-- fs start: 0 --> <volume offset="0"> <partition_offset>0</partition_offset> <block_size>2048</block_size> <ftype>2048</ftype> <ftype_str>iso9660</ftype_str> <block_count>6047</block_count> <first_block>0</first_block> <last_block>6046</last_block> <fileobject> <filename>.</filename> <partition>1</partition> <id>1</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>0</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/> </byte_runs> <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest> <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest> </fileobject> <fileobject> <filename>..</filename> <partition>1</partition> <id>2</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>0</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/> </byte_runs> <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest> <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest> </fileobject> <fileobject> <filename>799PX_EU.BMP</filename> <partition>1</partition> <id>3</id> <name_type>r</name_type> <filesize>1437654</filesize> <alloc>1</alloc> <used>1</used> <inode>1</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="51200" img_offset="51200" len="1437654"/> </byte_runs> <hashdigest type="md5">4829f38a294d156345922db8abd5e91c</hashdigest> <hashdigest type="sha1">1bde6c981776d81c13fd657621b2d3c8359d1761</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>BBHELMET.AI</filename> <partition>1</partition> <id>4</id> <name_type>r</name_type> <filesize>1080282</filesize> <alloc>1</alloc> <used>1</used> <inode>2</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="1488896" img_offset="1488896" len="1080282"/> </byte_runs> <hashdigest type="md5">c14bda842e2889a732e0f5f9d8c0ae73</hashdigest> <hashdigest type="sha1">98ce1ae12ee18893e8e1bd738855b6e20cd7b5ef</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>G31DS.TIF</filename> <partition>1</partition> <id>5</id> <name_type>r</name_type> <filesize>125968</filesize> <alloc>1</alloc> <used>1</used> <inode>3</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="2570240" img_offset="2570240" len="125968"/> </byte_runs> <hashdigest type="md5">1ea4939968f117de97b15437c6348847</hashdigest> <hashdigest type="sha1">d4c23ce4fecf17c8b952f98ed1cadc22a3d7399f</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>LION.SVG</filename> <partition>1</partition> <id>6</id> <name_type>r</name_type> <filesize>18324</filesize> <alloc>1</alloc> <used>1</used> <inode>4</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="2697216" img_offset="2697216" len="18324"/> </byte_runs> <hashdigest type="md5">e5913bebe296eb433fdade7400860e73</hashdigest> <hashdigest type="sha1">efe2c396a4ad46bab873f58eef4dbe6607be030c</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>NEMASTYL.PNG</filename> <partition>1</partition> <id>7</id> <name_type>r</name_type> <filesize>2050617</filesize> <alloc>1</alloc> <used>1</used> <inode>5</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="2715648" img_offset="2715648" len="2050617"/> </byte_runs> <hashdigest type="md5">0b0f9676ead317f643e9a58f0177d1e6</hashdigest> <hashdigest type="sha1">5d588800a5d5bd1ebe76ff2cbce0568a7f2dd386</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>OAKLAND0.JP2</filename> <partition>1</partition> <id>8</id> <name_type>r</name_type> <filesize>527345</filesize> <alloc>1</alloc> <used>1</used> <inode>6</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="4767744" img_offset="4767744" len="527345"/> </byte_runs> <hashdigest type="md5">04f7802b45838fed393d45afadaa9dcc</hashdigest> <hashdigest type="sha1">5a7eb88804b0783e3e3fe208cd10085954173c0a</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>PICTURES</filename> <partition>1</partition> <id>9</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>7</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/> </byte_runs> <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest> <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/.</filename> <partition>1</partition> <id>10</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>7</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/> </byte_runs> <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest> <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/..</filename> <partition>1</partition> <id>11</id> <name_type>d</name_type> <filesize>2048</filesize> <alloc>1</alloc> <used>1</used> <inode>0</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/> </byte_runs> <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest> <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/LANDING_.JPG</filename> <partition>1</partition> <id>12</id> <name_type>r</name_type> <filesize>1361321</filesize> <alloc>1</alloc> <used>1</used> <inode>10</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="6453248" img_offset="6453248" len="1361321"/> </byte_runs> <hashdigest type="md5">0ff111013ad2f8ded1171cee683e718a</hashdigest> <hashdigest type="sha1">6aa382a1f8fdb23b7e9f3823ae655ce405b68f9e</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <parent_object> <inode>7</inode> </parent_object> <filename>PICTURES/MARBLES.TGA</filename> <partition>1</partition> <id>13</id> <name_type>r</name_type> <filesize>4261301</filesize> <alloc>1</alloc> <used>1</used> <inode>11</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="7815168" img_offset="7815168" len="4261301"/> </byte_runs> <hashdigest type="md5">d5e100eb19481b8b7f05ac8cc3fd4e26</hashdigest> <hashdigest type="sha1">262f203a14d193e199a024e1e567579b5c22f110</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>VECTOR_N.EPS</filename> <partition>1</partition> <id>14</id> <name_type>r</name_type> <filesize>1041114</filesize> <alloc>1</alloc> <used>1</used> <inode>8</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="5296128" img_offset="5296128" len="1041114"/> </byte_runs> <hashdigest type="md5">8dd3a652970aa7f130414305b92ab8a8</hashdigest> <hashdigest type="sha1">66280f092b775f132d8fbff84b2226fcaf5d3dce</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>WFPC01.GIF</filename> <partition>1</partition> <id>15</id> <name_type>r</name_type> <filesize>113318</filesize> <alloc>1</alloc> <used>1</used> <inode>9</inode> <meta_type>1</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> <crtime>2013-09-05T11:30:00Z</crtime> <byte_runs> <byte_run file_offset="0" fs_offset="6338560" img_offset="6338560" len="113318"/> </byte_runs> <hashdigest type="md5">2eb15cb1834214b05d0083c691f9545f</hashdigest> <hashdigest type="sha1">bf8addf8b2fc09a9bf1ecc9e2c6c5a3b4453b24a</hashdigest> <!-- plugin_process --> </fileobject> <fileobject> <filename>$OrphanFiles</filename> <partition>1</partition> <id>16</id> <name_type>d</name_type> <filesize>0</filesize> <alloc>1</alloc> <used>1</used> <inode>12</inode> <meta_type>2</meta_type> <mode>0</mode> <nlink>1</nlink> <uid>0</uid> <gid>0</gid> </fileobject> </volume> <!-- end of volume --> <!-- clock: 1.908907 --> <rusage> <utime>0.112000</utime> <stime>0.040000</stime> <maxrss>30588</maxrss> <minflt>1553</minflt> <majflt>2</majflt> <nswap>0</nswap> <inblock>416</inblock> <oublock>23544</oublock> <clocktime>1.908907</clocktime> <!-- stop_time: Mon Jan 23 21:13:55 2017 --> </rusage> </dfxml> </premis:objectCharacteristicsExtension>