DROID, JHOVE, NLNZ Metadata Extractor

From Archivematica
Jump to: navigation, search

Main Page > Projects > Vancouver Digital Archives > Technology/Tools Evaluation > DROID, JHOVE, NLNZ Metadata Extractor


[edit] Purpose

The purpose of these three tools is to identify and validate formats and extract technical metadata.

[edit] Comments and evaluations by external sources

Performance Study of Digital Object Format Identification & Validation Tools, Quyen Nguyen, ERA Systems Engineering National Archives & Records Administration, DLF Fall 2008 Forum, Nov. 11-14, 2008. www.diglib.org/forums/fall2008/presentations/Nguyen.pdf

  • "Statistically, JHOVE and DROID perform equally well for format types that JHOVE can identify. Qualitatively, JHOVE generated metadata is richer. For types that JHOVE cannot validate, the performance decreases drastically compared to DROID. Easy case: if JHOVE finds that a record is binary, it just responds with a general identification, e.g. ByteStream. But some ASCII cases such as VRML may throw it off." (p. 25)
  • "Integrated Approach: Two-phase approach for File Identification and Validation: – Pass a file through DROID to quickly identify its type. – If the type is found to be on the known list of JHOVE, then pass through JHOVE to extract technical metadata. These extracted technical metadata are useful for automatic verification purposes. Examples include image resolution, format version numbers, creation dates, font information, etc." (p. 26)

Digital Asset Assessment Tool – File Format Testing Tools -Version 1.2, University of London Computer Centre Digital Asset Assessment Tool (DAAT) Project, Dec. 13, 2006. www.jisc.ac.uk/media/documents/programmes/preservation/daat_file_format_tools_report.pdf.

  • "The results of our tests indicate that even in a small sample of objects, [JHOVE's] rate is very high. The three ‘Not well-formed’ results indicate that JHOVE is doing its job, i.e. identifying format weaknesses in particular files, thus highlighting areas which may require preservation action. It may be instructive to compare this with DROID, where a ‘Not identified’ result doesn’t indicate there’s anything wrong with the actual file format; rather, it indicates that DROID has failed to identify it. Clearly JHOVE is going to be essential in a digital preservation context (particularly one which manages to implement the PREMIS model), as it can be used continually to check and recheck each digital object stored in the repository, and by a process of ongoing validation will give you some clues as to whether you're doing something in your preservation actions which might affect the validity of the asset." (p. 8)
  • "In terms of adding an automated 'crawl and assess' feature...we think DROID leaves a lot to be desired. DROID will do an automated crawl of all file formats in a drive, but it will only provide 'static' information on its format, based on whatever information is currently stored in PRONOM. Importantly, DROID isn'really looking 'inside' a file, just reporting on the extension. To put it bluntly, for all your files which end in .TXT, DROID will tell you exactly the same thing for all of them." (p. 11)

Digital Images Archiving Study, Arts and Humanities Data Service, March 2006. http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file986.pdf

  • "The [output of NLNZ Metadata Extractor] is more consistent but less detailed than the output produced by JHOVE. It extracts a very limited element set. There are some implementation problems, for example the tool fails to report the compression of jpeg images correctly. It can also be fooled because it identifies the file format on the basis of the file extension. Output from the program is in National Library Preservation Metadata Data Dictionary XML although the documentation suggests this is user configurable. The National Library of New Zealand Metadata Extract tool is open source like JHOVE but there has been less take-up and the National Library has not committed to institutional support for it. However, unlike JHOVE, it can handle complex relationships – for instance defining website files and relationships between them or spreadsheets." (p. 81)

File formats blog, Gary McGath, Nov. 9, 2006, http://fileformats.blogspot.com/2006_11_01_archive.html

  • JHOVE is being used quite a bit among digital libraries to identify and validate file formats. But having gotten this use, it's showing where it needs improvement. Once the project for the next version starts up -- it's still waiting for money -- there will be a lot of added features to make users happier. Here is a sampling:
    • Making modules more consistent in design, and easier to subclass. This should make it much easier to create third-party modules.
    • Plug-in support of non-JHOVE modules, and in particular, of the NLNZ Metadata Extraction Tool.
    • Separation of identification from validation. Currently, if a TIFF file is defective, JHOVE throws up its hands and says it doesn't know what kind of file it is, even if the defect is minor.
    • More persistent analysis. Most existing JHOVE modules give up at the first error.
    • Greater customizability. It will be possible to make JHOVE disregard particular types of defects, and to control what gets reported as metadata.
    • The ability to handle multiple formats within a file, and multiple files which form a single representation.
    • A mechanism for open-source collaboration. JHOVE is currently open source, but there's no mechanism, other than sending us email, to submit code changes.
    • Greater ease of integration into applications.
    • Integration with the upcoming Global Digital Format Registry
    • Use of Java 5, hopefully gaining efficiency by using the nio package.

JHOVE2: A Next-Generation Architecture for Format-Aware Digital Object Preservation Processing, California Digital Library (as part of Expression of Interest for Announcement RFEI-2006-TA01 Basic Technical Infrastructure, Tools, and Services to Strengthen the NDIIPP Preservation Partners Network), Dec. 19, 2007, http://confluence.ucop.edu/display/JHOVE2Info/Background.

  • The current JHOVE architecture is based on an implicit assumption that a digital object is manifest in a single file encapsulating a single formatted bit stream:

1 digital object = 1 file = 1 format

In practice, however, there are many common usages that fall well outside the boundaries of this assumption: TIFF with embedded ICC color profile and XMP metadata 1 object = 1 file = 3 formats (TIFF; ICC; XMP) JPEG 2000 JPX profile with file fragmentation 1 object = n files = 1 format ESRI Shapefile 1 object = 3 files = 3 formats (Shapefile data; Shapefile index; dBASE)

  • In the JHOVE2 data model a digital object is equivalent to a PREMIS representation, “a set of files [each containing one or more formatted bit streams] . . . needed for a complete and reasonable rendition of an Intellectual Entity.” 1 Thus JHOVE2 will support the general case:

1 object = n files = m formats

The JHOVE2 API will define format-specific parsers that can be recursively invoked so that complex digital objects, existing in either nested containers or spread over multiple files, can be validated and characterized (or otherwise processed) as an aggregate unit. This new modeling paradigm will facilitate the support by JHOVE2 of a significantly wider range of digital formats in current or future use.

[edit] Feature comparison


Identifies formats Yes (on basis of file extension only) Yes Yes (on basis of file extension only)
Identifies versions Yes Yes Yes
Confirms well-formed and valid No Yes No
Extracts technical metadata No Yes Yes (limited)
Outputs xml reports Yes Yes Yes
Other Links to format registry (PRONOM) Calculates checksums Handles complex relationships

Formats handled (according to documentation) (simplified - i.e. excludes versions, text encodings etc.):

Office documents DBF, DOC, Lotus formats, MS Works formats, OpenOffice formats, MDB, MPP, PDF, PPT, PST, PUB, RTF, StarOffice formats, TXT, VSD, WPD, WS and other WordStar formats, XLS (BIFF) and others PDF, TXT. NOTE: rudimentary processing of proprietary formats as generic bytestream objects DOC, MS Works formats, OpenOffice formats, PDF, PPT, WPD, XLS
Images BMP, CDR and other Corel formats, DWG and other AutoCad formats, DXF, EPS, GeoTIFF, GIF, JPEG, JPEG2000, PageMaker documents, PCX, PNG, PS, PSD, PSP, SWF and other Macromedia formats, SVG, TIFF and others GIF, JPEG, JPEG2000, TIFF BMP, GIF, JPEG, TIFF
Sound / moving image AIFF, ASF, AVI, MIDI, MOV, MP3, MPG, Real Audio (RM/A), WAV and others AIFF, WAV WAV, MP3
Markup languages GML, HTML, ODF, XML, XHTML and others HTML, XHTML, XML HTML, XML
Other JS, TAR (Tape Archive Format), ZIP and others

[edit] Test file results

Moved to System Testing > Test File Results

Personal tools