Difference between revisions of "Improvements/Disk Image Preservation"

From Archivematica
Jump to navigation Jump to search
Line 45: Line 45:
 
* [https://www.mars.org/home/rob/proj/hfs/ HFS Utilities]
 
* [https://www.mars.org/home/rob/proj/hfs/ HFS Utilities]
 
* [https://github.com/cul-it/hfs2dfxml hfs2dfxml] - Utility to parse hfsutils output and produce DFXML for HFS-formatted disk images
 
* [https://github.com/cul-it/hfs2dfxml hfs2dfxml] - Utility to parse hfsutils output and produce DFXML for HFS-formatted disk images
 +
* [https://github.com/timothyryanwalsh/cca-diskimageprocessor CCA Disk Image Processor] - Creates ready-to-ingest SIPs from a directory of disk images and related files.
  
 
== Example dfxml ==
 
== Example dfxml ==

Revision as of 15:00, 27 January 2017

Synopsis

This project is being sponsored by UCLA Library and NYPL Special Collections but more collaborators are welcome! Please get in touch on the community user forum.

Different tools are needed for extraction of different disk images depending on what file system the disk image was created from. Archivematica's standard tool for disk image extraction is tsk recover, which is limited to 18 file systems. A challenge to invoking the right tool for the job is Archivematica's use of file format, rather than other characteristics (in this case, the file system). For example, if a disk image is identified as an ISO image, Archivematica will currently invoke tsk recover regardless of the file system of the image, where as tsk recover will only be able to extract the contents of the 18 file systems listed in the link above.

In this project we are particularly focused on hfs disk images, which are not currently supported by tsk recover, but the development will be generic enough to be useful to different types of disk images as well.

User story

As an archivist, I would like to process disk images of an unknown file-system through Archivematica, and have Archivematica and/or its associated tools recognize and record the file system and choose appropriate tools for disk image extraction and characterization. Further, I would want to be able to pull statistics about the size and file-type of disk images from the system.

Development tasks

We have identified the following development tasks which would need to be addressed in the order described:

Recommendation: upgrade Sleuthkit [1 support ticket]

Upgrade from version 4.1.3 (three years old) to 4.4 (most recent release).

Pre-ingest script [1 support ticket]

Develop a pre-ingest script which would identify the file system and store this metadata in such a way that it can be passed to Archivematica. This could be made part of an automation tool script.

File identification script [1 support tickets]

Currently, the file identification scripts run in Archivematica will identify the type of disk image but not the file system of the disk image. This script will use the data from the pre-ingest script to identify both the disk image type and the file system.

Characterize [1 support ticket]

Write meaningful characterization about the size and file type of the disk images so that statistics can be gathered from the AIPs. Currently, when fiwalk is run as the characterization tool for a disk image, dfxml is written in premis:objectCharacteristicsExtension.

File extraction [1 support ticket]

Implement tools such as HFS Utilities that will allow files from hfs disk images to be extracted.

Deployment of development above for testing [1 support ticket]

Reporting [?? support tickets, depends on scope]

Explore reporting possibilities either in Archivematica or using other reporting tools.

Tools/resources

Example dfxml

This is how dfxml is currently written by fiwalk in the Archivematica METS file:


<premis:objectCharacteristicsExtension>
  <dfxml xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0">
    <metadata>
      <dc:type>Disk Image</dc:type>
    </metadata>
    <creator version="1.0">
      <program>fiwalk</program>
      <version>4.1.3</version>
      <build_environment>
        <compiler>GCC 4.8</compiler>
        <library name="afflib" version="3.6.6"/>
        <library name="libewf" version="20130416"/>
      </build_environment>
      <execution_environment>
        <command_line>fiwalk -x /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso -c /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt</command_line>
        <start_time>2017-01-23T21:13:53Z</start_time>
      </execution_environment>
    </creator>
    <!-- Reading configuration file /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/ficonfig.txt -->
    <!-- pattern: *  method: dgi  path: python /usr/lib/archivematica/archivematicaCommon/externals/fiwalk_plugins/pronom_ident.py -->
    <source>
      <image_filename>/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/iso_image_2-0d13c428-985f-4ab3-a6f8-4c6d81ecd5b8/objects/images.iso</image_filename>
    </source>
    <!-- fs start: 0 -->
    <volume offset="0">
      <partition_offset>0</partition_offset>
      <block_size>2048</block_size>
      <ftype>2048</ftype>
      <ftype_str>iso9660</ftype_str>
      <block_count>6047</block_count>
      <first_block>0</first_block>
      <last_block>6046</last_block>
      <fileobject>
        <filename>.</filename>
        <partition>1</partition>
        <id>1</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>0</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest>
        <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest>
      </fileobject>
      <fileobject>
        <filename>..</filename>
        <partition>1</partition>
        <id>2</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>0</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest>
        <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest>
      </fileobject>
      <fileobject>
        <filename>799PX_EU.BMP</filename>
        <partition>1</partition>
        <id>3</id>
        <name_type>r</name_type>
        <filesize>1437654</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>1</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="51200" img_offset="51200" len="1437654"/>
        </byte_runs>
        <hashdigest type="md5">4829f38a294d156345922db8abd5e91c</hashdigest>
        <hashdigest type="sha1">1bde6c981776d81c13fd657621b2d3c8359d1761</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>BBHELMET.AI</filename>
        <partition>1</partition>
        <id>4</id>
        <name_type>r</name_type>
        <filesize>1080282</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>2</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="1488896" img_offset="1488896" len="1080282"/>
        </byte_runs>
        <hashdigest type="md5">c14bda842e2889a732e0f5f9d8c0ae73</hashdigest>
        <hashdigest type="sha1">98ce1ae12ee18893e8e1bd738855b6e20cd7b5ef</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>G31DS.TIF</filename>
        <partition>1</partition>
        <id>5</id>
        <name_type>r</name_type>
        <filesize>125968</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>3</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="2570240" img_offset="2570240" len="125968"/>
        </byte_runs>
        <hashdigest type="md5">1ea4939968f117de97b15437c6348847</hashdigest>
        <hashdigest type="sha1">d4c23ce4fecf17c8b952f98ed1cadc22a3d7399f</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>LION.SVG</filename>
        <partition>1</partition>
        <id>6</id>
        <name_type>r</name_type>
        <filesize>18324</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>4</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="2697216" img_offset="2697216" len="18324"/>
        </byte_runs>
        <hashdigest type="md5">e5913bebe296eb433fdade7400860e73</hashdigest>
        <hashdigest type="sha1">efe2c396a4ad46bab873f58eef4dbe6607be030c</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>NEMASTYL.PNG</filename>
        <partition>1</partition>
        <id>7</id>
        <name_type>r</name_type>
        <filesize>2050617</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>5</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="2715648" img_offset="2715648" len="2050617"/>
        </byte_runs>
        <hashdigest type="md5">0b0f9676ead317f643e9a58f0177d1e6</hashdigest>
        <hashdigest type="sha1">5d588800a5d5bd1ebe76ff2cbce0568a7f2dd386</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>OAKLAND0.JP2</filename>
        <partition>1</partition>
        <id>8</id>
        <name_type>r</name_type>
        <filesize>527345</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>6</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="4767744" img_offset="4767744" len="527345"/>
        </byte_runs>
        <hashdigest type="md5">04f7802b45838fed393d45afadaa9dcc</hashdigest>
        <hashdigest type="sha1">5a7eb88804b0783e3e3fe208cd10085954173c0a</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>PICTURES</filename>
        <partition>1</partition>
        <id>9</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>7</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest>
        <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest>
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/.</filename>
        <partition>1</partition>
        <id>10</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>7</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="49152" img_offset="49152" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">f235bde51205efe86b5499455f2c4a50</hashdigest>
        <hashdigest type="sha1">61d303072d4c3b619d622d24b15984bc4e000795</hashdigest>
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/..</filename>
        <partition>1</partition>
        <id>11</id>
        <name_type>d</name_type>
        <filesize>2048</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>0</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="47104" img_offset="47104" len="2048"/>
        </byte_runs>
        <hashdigest type="md5">ba20004c2745ecf912cb3d720bcd1c10</hashdigest>
        <hashdigest type="sha1">34a57ea447a8b2d53555ac8b773437362c0c7c3d</hashdigest>
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/LANDING_.JPG</filename>
        <partition>1</partition>
        <id>12</id>
        <name_type>r</name_type>
        <filesize>1361321</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>10</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="6453248" img_offset="6453248" len="1361321"/>
        </byte_runs>
        <hashdigest type="md5">0ff111013ad2f8ded1171cee683e718a</hashdigest>
        <hashdigest type="sha1">6aa382a1f8fdb23b7e9f3823ae655ce405b68f9e</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <parent_object>
          <inode>7</inode>
        </parent_object>
        <filename>PICTURES/MARBLES.TGA</filename>
        <partition>1</partition>
        <id>13</id>
        <name_type>r</name_type>
        <filesize>4261301</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>11</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="7815168" img_offset="7815168" len="4261301"/>
        </byte_runs>
        <hashdigest type="md5">d5e100eb19481b8b7f05ac8cc3fd4e26</hashdigest>
        <hashdigest type="sha1">262f203a14d193e199a024e1e567579b5c22f110</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>VECTOR_N.EPS</filename>
        <partition>1</partition>
        <id>14</id>
        <name_type>r</name_type>
        <filesize>1041114</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>8</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="5296128" img_offset="5296128" len="1041114"/>
        </byte_runs>
        <hashdigest type="md5">8dd3a652970aa7f130414305b92ab8a8</hashdigest>
        <hashdigest type="sha1">66280f092b775f132d8fbff84b2226fcaf5d3dce</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>WFPC01.GIF</filename>
        <partition>1</partition>
        <id>15</id>
        <name_type>r</name_type>
        <filesize>113318</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>9</inode>
        <meta_type>1</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
        <crtime>2013-09-05T11:30:00Z</crtime>
        <byte_runs>
          <byte_run file_offset="0" fs_offset="6338560" img_offset="6338560" len="113318"/>
        </byte_runs>
        <hashdigest type="md5">2eb15cb1834214b05d0083c691f9545f</hashdigest>
        <hashdigest type="sha1">bf8addf8b2fc09a9bf1ecc9e2c6c5a3b4453b24a</hashdigest>
        <!-- plugin_process -->
      </fileobject>
      <fileobject>
        <filename>$OrphanFiles</filename>
        <partition>1</partition>
        <id>16</id>
        <name_type>d</name_type>
        <filesize>0</filesize>
        <alloc>1</alloc>
        <used>1</used>
        <inode>12</inode>
        <meta_type>2</meta_type>
        <mode>0</mode>
        <nlink>1</nlink>
        <uid>0</uid>
        <gid>0</gid>
      </fileobject>
    </volume>
    <!-- end of volume -->
    <!-- clock: 1.908907 -->
    <rusage>
      <utime>0.112000</utime>
      <stime>0.040000</stime>
      <maxrss>30588</maxrss>
      <minflt>1553</minflt>
      <majflt>2</majflt>
      <nswap>0</nswap>
      <inblock>416</inblock>
      <oublock>23544</oublock>
      <clocktime>1.908907</clocktime>
      <!-- stop_time: Mon Jan 23 21:13:55 2017 -->
    </rusage>
  </dfxml>
</premis:objectCharacteristicsExtension>

See also

Original requirements for forensic image ingest