Revision as of 11:48, 16 October 2012

Documentation > Requirements > Format identification requirements

Requirements

FITS runs for characterization and md extraction in transfer tab
- Output stored in transfer backlog

Format identification micro-service in Ingest tab, after SIP creation
- user selects from dropdown menu of identification tool options:
  - Identify file by extension - this is already the default in Archivematica 0.9
  - file utility - file utility identification metadata from FITS output
  - ffident - ffident identification metadata from FITS output
  - DROID - DROID identification metadata from FITS output
  - JHOVE - JHOVE identification metadata from FITS output
  - FIDO - run FIDO
  - Tika - run modified Tika
  - mediainfo - run mediainfo

Output for each tool must be restructured to comply with format policy

Format policies should have globally unique IDs

Format policies result in rules that are stored in the Format Policy Registry (FPR)
Do not split major/minor mimetypes

FITS output for format identification

in file utility output: <format> <mimetype>
in ffident output: <mimetype>
in DROID output: <PUID> <MimeType>
in JHOVE output: <format> <mimeType>
in FITS output: <identify format> split format and mimetype

Use Cases

Workflows

Insert File identification micro-service after Clean up names micro-service and before Normalize micro-service
User selects from dropdown menu which FITS tool to trust for format identification (see tools listed above)
Archivematica bases Normalization path / application of format policy on the tool selected
Post 1.0 work: add tools outside of FITS that will run when selected by the user at the File identification micro-service.

Whiteboard

Mockups

Note that the Normalize micro-service is included here for illustrative purposes only. In action, the user would not see the Normalize micro-service until a format identification mechanism was selected from the dropdown menu.

Format policy models and table

The Transcoder page has a lot of information on it.
This diagram is from there.

I think for the Format Policy Registry (FPR) almost every table will need to replace the pk INT with a pk UUID, and the corresponding foreign keys. This is to aid in the creation/submission of new rules.

We're also going to need to know what format the rules output to.

Speaking to future iterations: We'll need to be able to identify what formats the original file, and preservation file are in. Determine the format risk of the data with both of those considered. We may choose to normalize a to a new preservation format from the the original file, or the preservation file. The reason for this may be that both formats risk obsolescence, but there are only tools available to handle the preservation format.

To track changes/updates/new items, I would suggest the FPR keep track of the last modified date of all rows. Clients keep track of the last updated date. These dates are from the FPR. Additional: add a marked for removal bool to each table in the FPR, to indicate to clients to remove that from their database.

@@ Line 23: / Line 23: @@
 * Format policies result in rules that are stored in the [[Format_policy_registry_requirements|Format Policy Registry (FPR)]]
+* Do not split major/minor mimetypes
 === FITS output for format identification ===

Difference between revisions of "Format identification requirements"