Difference between revisions of "Format identification requirements"

From Archivematica
Jump to navigation Jump to search
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Documentation]] > [[Requirements]] > Format identification requirements
 
[[Documentation]] > [[Requirements]] > Format identification requirements
  
 +
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information.</div><p>
 +
 +
[[Category:Feature requirements]]
  
 
== Requirements ==
 
== Requirements ==
Line 7: Line 10:
 
** Output indexed and stored in transfer backlog
 
** Output indexed and stored in transfer backlog
  
* '''Select format identification tool''' in Normalization micro-service, after SIP creation
+
* '''Select format identification tool''' job in Normalization micro-service, after SIP creation
 
** user selects from dropdown menu of identification tool options:  
 
** user selects from dropdown menu of identification tool options:  
 
***Identify file by extension - this is already the default in Archivematica 0.9
 
***Identify file by extension - this is already the default in Archivematica 0.9
Line 72: Line 75:
 
To track changes/updates/new items, I would suggest the FPR keep track of the last modified date of all rows. Clients keep track of the last updated date. These dates are from the FPR.  
 
To track changes/updates/new items, I would suggest the FPR keep track of the last modified date of all rows. Clients keep track of the last updated date. These dates are from the FPR.  
 
Additional: add a marked for removal bool to each table in the FPR, to indicate to clients to remove that from their database.
 
Additional: add a marked for removal bool to each table in the FPR, to indicate to clients to remove that from their database.
 
 
[[Category:Development documentation]]
 

Latest revision as of 16:25, 11 February 2020

Documentation > Requirements > Format identification requirements

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Requirements[edit]

  • FITS runs for characterization and md extraction in transfer tab
    • Output indexed and stored in transfer backlog
  • Select format identification tool job in Normalization micro-service, after SIP creation
    • user selects from dropdown menu of identification tool options:
      • Identify file by extension - this is already the default in Archivematica 0.9
      • file utility - file utility identification metadata from FITS output
      • ffident - ffident identification metadata from FITS output
      • DROID - DROID identification metadata from FITS output
      • JHOVE - JHOVE identification metadata from FITS output
      • FIDO - run FIDO
      • Tika - run modified Tika
      • mediainfo - run mediainfo
  • Duplicate results need cleared from DB - one command for each ID
  • Multiple IDs - do all normalization tasks that apply
    • Indicate duplicates in Norm report and viewer
    • Add delete functionality to Review normalization report and viewer in Dashboard browser, user chooses one or the other
  • If there is no rule, normalization should be based on extension
  • Distinguish identification tool versions in DB
  • Output for each tool must be restructured to comply with format policy
  • Format policies should have globally unique IDs

FITS output for format identification[edit]

  • in file utility output: <format> <mimetype>
  • in ffident output: <mimetype>
  • in DROID output: <PUID> <MimeType>
  • in JHOVE output: <format> <mimeType>

Use Cases[edit]

Workflows[edit]

  • Insert File identification micro-service after Clean up names micro-service and before Normalize micro-service
  • User selects from dropdown menu which FITS tool to trust for format identification (see tools listed above)
  • Archivematica bases Normalization path / application of format policy on the tool selected
  • Post 1.0 work: add tools outside of FITS that will run when selected by the user at the File identification micro-service.

Whiteboard[edit]

FileID.jpg


Mockups[edit]

Note that the Normalize micro-service is included here for illustrative purposes only. In action, the user would not see the Normalize micro-service until a format identification mechanism was selected from the dropdown menu.

FileIdentificationMS.png

Format policy models and table[edit]

The Transcoder page has a lot of information on it.
This diagram is from there.

Transcoder database schema.png

I think for the Format Policy Registry (FPR) almost every table will need to replace the pk INT with a pk UUID, and the corresponding foreign keys. This is to aid in the creation/submission of new rules.

We're also going to need to know what format the rules output to.

Speaking to future iterations: We'll need to be able to identify what formats the original file, and preservation file are in. Determine the format risk of the data with both of those considered. We may choose to normalize a to a new preservation format from the the original file, or the preservation file. The reason for this may be that both formats risk obsolescence, but there are only tools available to handle the preservation format.

To track changes/updates/new items, I would suggest the FPR keep track of the last modified date of all rows. Clients keep track of the last updated date. These dates are from the FPR. Additional: add a marked for removal bool to each table in the FPR, to indicate to clients to remove that from their database.