Format identification requirements

From Archivematica
Jump to navigation Jump to search

Documentation > Requirements > Format identification requirements


Requirements

  • FITS runs for characterization and md extraction in transfer tab
    • Output indexed and stored in transfer backlog
  • Select format identification tool job in Normalization micro-service, after SIP creation
    • user selects from dropdown menu of identification tool options:
      • Identify file by extension - this is already the default in Archivematica 0.9
      • file utility - file utility identification metadata from FITS output
      • ffident - ffident identification metadata from FITS output
      • DROID - DROID identification metadata from FITS output
      • JHOVE - JHOVE identification metadata from FITS output
      • FIDO - run FIDO
      • Tika - run modified Tika
      • mediainfo - run mediainfo
  • Duplicate results need cleared from DB - one command for each ID
  • Multiple IDs - do all normalization tasks that apply
    • Indicate duplicates in Norm report and viewer
    • Add delete functionality to Review normalization report and viewer in Dashboard browser, user chooses one or the other
  • If there is no rule, normalization should be based on extension
  • Distinguish identification tool versions in DB
  • Output for each tool must be restructured to comply with format policy
  • Format policies should have globally unique IDs

FITS output for format identification

  • in file utility output: <format> <mimetype>
  • in ffident output: <mimetype>
  • in DROID output: <PUID> <MimeType>
  • in JHOVE output: <format> <mimeType>

Use Cases

Workflows

  • Insert File identification micro-service after Clean up names micro-service and before Normalize micro-service
  • User selects from dropdown menu which FITS tool to trust for format identification (see tools listed above)
  • Archivematica bases Normalization path / application of format policy on the tool selected
  • Post 1.0 work: add tools outside of FITS that will run when selected by the user at the File identification micro-service.

Whiteboard

FileID.jpg


Mockups

Note that the Normalize micro-service is included here for illustrative purposes only. In action, the user would not see the Normalize micro-service until a format identification mechanism was selected from the dropdown menu.

FileIdentificationMS.png

Format policy models and table

The Transcoder page has a lot of information on it.
This diagram is from there.

Transcoder database schema.png

I think for the Format Policy Registry (FPR) almost every table will need to replace the pk INT with a pk UUID, and the corresponding foreign keys. This is to aid in the creation/submission of new rules.

We're also going to need to know what format the rules output to.

Speaking to future iterations: We'll need to be able to identify what formats the original file, and preservation file are in. Determine the format risk of the data with both of those considered. We may choose to normalize a to a new preservation format from the the original file, or the preservation file. The reason for this may be that both formats risk obsolescence, but there are only tools available to handle the preservation format.

To track changes/updates/new items, I would suggest the FPR keep track of the last modified date of all rows. Clients keep track of the last updated date. These dates are from the FPR. Additional: add a marked for removal bool to each table in the FPR, to indicate to clients to remove that from their database.