Difference between revisions of "Format policy registry requirements"

From Archivematica
Jump to navigation Jump to search
 
(47 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Documentation]] > [[Requirements]] > Format policy registry requirements
 
[[Documentation]] > [[Requirements]] > Format policy registry requirements
  
 +
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information. </div> <p>
 +
 +
The Format Policy Registry (FPR) contains user-configurable scripts for file identification, characterization, extraction, normalization and other preservation actions that differ depending on file format.  It also contains a list of formats recognized by Archivematica, and the relationship between those formats and the scripts.
 +
 +
== Overview ==
 +
 +
* The Archivematica project team is working on a better way to manage format policies for preservation events such as normalization, transcription, extraction, characterization and format identification.
 +
* A format policy consists of the business rules and tool commands for preservation events based on format.
 +
* The Format Policy Registry lists all of Archivematica's default format policy rules.
 +
* Currently, users can download updates from the FPR server and replace default rules with your own local policies.
 +
* Future funding is sought for further FPR enhancements to include statistical information about the default and custom format policies adopted and implemented by Archivematica users and the ability to upload local policies to the FPR server.
 +
* One of the primary goals of the FPR is to eventually aggregate empirical information about institutional format policies to better identify community best practices. The FPR could provide a practical, community-based approach to OAIS preservation and access planning, allowing the Archivematica community of users to monitor and evaluate format policies as they are adopted, adapted and supplemented by real-world practitioners. The FPR APIs would be designed to share this information with the Archivematica user base as well with other interested communities and projects.
 +
* The FPR server is hosted at fpr.archivematica.org. Sponsorship is actively being sought to develop a front-facing website at this server location. Currently, the local copy of the FPR server in the Archivematica dashboard's preservation planning tab is the only interface for the user. Artefactual manages the server from the back end. With further sponsorship, the server site would be a place to compare institutional format policies and their success rates over time.
  
 
== Description ==
 
== Description ==
  
* The Archivematica project team has recognized the need for a better way to manage preservation plans, i.e. business rules and tool commands for format transcoding. Since these are either implemented or altered by the institution running an Archivematica instance, these rules are referred to as policies. Format policies will change as community standards, practices and tools evolve. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format).  
+
* The Archivematica project team created the FPR after having recognized the need for a better way to manage preservation plans, i.e. business rules and tool commands for format-based preservation events. Since these are either implemented or altered by the institution running an Archivematica instance, these rules are referred to as policies. Format policies will change as community standards, practices and tools evolve. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format).  
 +
 
 +
* Prior to the FPR, the Archivematica project has managed this information on the [[Media_type_preservation_plans|archivematica.org/preservation]] wiki page. These format policies were all researched as a result of related development partnerships with samples provided by funding partners. Since the FPR was first released in late 2013, all new format policies were developed in partnership with funding institutions and any new rules were included in the public FPR server, hosted by Artefactual Systems, Inc.
 +
 
 +
* The Format Policy Registry (FPR) manages this information in a structured format (SQL/JSON).
 +
** With additional sponsored development, APIs with other serializations could be added (e.g. XML, RDF)
  
* Until now, the Archivematica project has managed this information on the [[Media_type_preservation_plans|archivematica.org/preservation]] wiki page.  
+
* The FPR includes updates from PRONOM, which are manually applied by Artefactual Systems. With funding, we could enhance and/or automate this PRONOM interface and add interfaces with other registries like UDFR and linked data registries. A web of interfaces is the best way to monitor and evaluate community-wide best practices.
  
* The Format Policy Registry (FPR) will manage this information in a structured format (SQL/JSON).
+
* The FPR stores structured information about:
** APIs with other serializations may be added (e.g. XML, RDF)
+
**Format identification (FIDO based on PRONOM IDs or file extension, with more tools to be added in the future)
 +
**Normalization format policies for preservation and access. These policies identify preferred preservation and access formats by media type. The default choice of access formats is based on the ubiquity of viewers for the file format. Archivematica's default preservation formats are all open standards; additionally, the choice of preservation format is based on community best practices, availability of open-source normalization tools, and an analysis of the significant characteristics for each media type.
 +
**Characterization (default is FITS for most formats and MediaInfo for some audiovisual formats)
 +
**Transcription (default for OCR using Tesseract)
 +
**Extraction - tools and commands for extracting packages and forensic disk images
  
* It will be hosted at archivematica.org/fpr/
+
* Archivematica default format policies can all be changed or enhanced by individual Archivematica implementers.  
  
* The FPR will also provide valuable online statistics about default format policy adoption as well as customizations amongst Archivematica users and will interface with other online registries (such as PRONOM and UDFR) to monitor and evaluate community-wide best practices.
+
=FPR development=
  
* The FPR stores structured information about normalization format policies for preservation and access. These policies identify preferred preservation and access formats by media type. The choice of access formats is based on the ubiquity of viewers for the file format. Archivematica's preservation formats are all open standards; additionally, the choice of preservation format is based on community best practices, availability of open-source normalization tools, and an analysis of the significant characteristics for each media type.
+
== Use Cases ==
 +
* Alternate preservation event tool than Archivematica default
 +
* Alternate default preservation event outcome format than Archivematica default
 +
* Alternate specification event result
 +
* Disable default in Archivematica  
 +
* Add new format policy
 +
* Add new tool
 +
* Add new format
 +
* Run format policy tools and commands on local digital acquisitions
 +
* Proposed: Add format policies to other open source system workflows
 +
* Proposed: Allow other open source system users to submit proposed format policies
  
* These default format policies can all be changed or enhanced by individual Archivematica implementers.  
+
==1.2==
  
* Subscription to the FPR will allow the Archivematica project to notify users when new or updated preservation and access plans become available, allowing them to make better decisions about normalization and migration strategies for specific format types within their collections. It will also allow them to trigger migration processes as new tools and knowledge becomes available.
+
*Updated PRONOM data for FIDO
 +
*Made one Tools section rather than having tools in each event space, except for Identification tools
 +
*Added Transcription, Extraction, Characterization and Verification event sections
 +
*Ability to update from server in local dashboard's Preservation Planning tab
  
*One of the other primary goals of the FPR is to aggregate empirical information about institutional format policies to better identify community best practices. The FPR will provide a practical, community-based approach to OAIS preservation and access planning, allowing the Archivematica community of users to monitor and evaluate formats policies as they are adopted, adapted and supplemented by real-world practioners. The FPR APIs will be designed to share this information with the Archivematica user base as well with other interested communities and projects.
+
==1.0==
  
==Early prototype==
+
* Ability to add/edit tools, rules, commands and formats
  
*An early FPR prototype (called "Formatica") was developed by Heather Bowden, then Carolina Digital Curation Doctoral Fellow at the School of Information and Library Science in the University of North Carolina at Chapel Hill.  
+
==0.10-beta==
  
[[File:Formatica.png|border|450px|Early FPR prototype originally called Formatica]]
+
*Ability to view, add/edit local format policies
 +
** FPR local is for superusers (preservation planning tab)
 +
* Dashboard FPR captures usage statistics
 +
* Central server at fpr.archivematica.org
 +
* Ability to download most current Archivematica default format policies from fpr.archivematica.org on first installation
  
 +
==Wishlist==
  
 +
* Front-facing server website
 +
**authenticated, web-based access for Artefactual and super-user updates and maintenance
 +
* Ability to submit local changes to the central FPR server
 +
** fpr.archivematica.org will mirror local implementation, except user won't be able to apply changes, only submit them for review and website could include the institution that submitted the policy or indicate whether it is a default Archivematica policy
 +
** users would be able to apply changes locally and submit them for review to fpr
 +
** nice to have: ability to add links to documentation about reasoning behind local institutional policy selection
 +
** accepted changes would result in a new FPR server entry
 +
* Ability to download select new Archivematica default FPR policies as well as those from other institutions from the central FPR server to local Archivematica installations
 +
** ideally a single click download to allow for local implementation from the server or from a list viewable after selecting Update from the Preservation Planning tab of the local Dashboard
 +
* Ability to disable a format policy in the dashboard
 +
* Update Archivematica without losing local FPR configuration and added policies
 +
* Integrate with PRONOM to automate PUID updates
 +
* Integrate with other registries
 +
* Provide a read-only RESTful Web API for accessing policies in JSON format. This is only partially implemented in 1.2, intended only to be used internally (between dashboard and FPR server) and only provides access to the full set of policies.  Future development would ideally include the ability to access a subset of policies via the FPR REST API
  
= Requirements =
+
=Legacy FPR foundation docs=
  
 
[[File:FPR overview Oct 2012.png|border|900px|FPR overview Oct 2012]]
 
[[File:FPR overview Oct 2012.png|border|900px|FPR overview Oct 2012]]
  
* provide an authenticated Web based interface for creation and maintenance of policies
+
 
* provide a read-only RESTful Web API for accessing policies in JSON format
 
* provide an API for monitoring new and updated policies
 
* integrate with PRONOM to retrieve PUIDs
 
* model format policies so that they can be stored in a SQL (MySQL?, PostGres?, SQLlite?) dbase on both client & server
 
* develop iteratively with an emphasis on getting working code in front of users as quickly as possible to make them part of the design process (see #fileidhack)
 
 
* developer [[Format_policy_registry|notes]]
 
* developer [[Format_policy_registry|notes]]
  
== Use Cases ==
+
== Mockups ==
 +
 
 +
'''Preservation Planning tab in Dashboard - Local FPR'''
 +
 
 +
[[File:1.0_PreservationPlanningFPR.png|Preservation Planning tab in dashboard]]
 +
 
 +
------------------------------------------------------------------------------------------------------
 +
 
 +
* Archivematica format ID column is populated by the "description" from the File ID table in the Archivematica database; It links to the file ID table in the Archivematica database, which is composed of the tool and tool version that identified the format.
 +
* Command description is the "description" from the Command table in the Archivematica database
 +
* Purpose is the same as "classification" in the Archivematica database, but is clearer to the user
 +
* Add new -links to create new (blank) form for Archivematica format ID -or- to the create new (blank) form for the Command
 +
* Copy - links to create new form based on this one (populated with current data) for either the Archivematica format ID -or- Command
 +
* Edit -links to edit page for the Archivematica format ID -or- to the edit page for the Command
 +
* Make default - makes the selected Archivematica format ID format policy or the selected Command format policy default for normalization. See Feature #4503
 +
* Show performance and show command are modeled on the old preservation planning tab in the dashboard, available in the dev version of Archivematica at /preservationplanning/old/
 +
'''Add new Archivematica format ID'''
 +
 
 +
[[File:1.0_FormatPolicyFormatID.png]]
 +
--------------------------------------------------------------------------------------------------------------
 +
'''Add new Command'''
 +
 
 +
[[File:1.0_FormatPolicyCommand.png]]
 +
 
 +
* Added https://projects.artefactual.com/issues/4568 (#4568) to add a Output file format field to this form from the outputFileFormat field in the MCP Commands table
 +
-------------------------------------------------------------------------------------------------------------------
 +
'''Copy Archivematica format ID'''
 +
 
 +
[[File:1.0_FormatPolicyFormatIDCOPY.png]]
 +
------------------------------------------------------------------------------------------------------------------
 +
'''Copy Command'''
 +
 
 +
[[File:1.0_FormatPolicyCommandCOPY.png]]
  
 
== Data Model ==
 
== Data Model ==
 +
Design from summer 2013 of the data structures in version 2 of the API.
 +
 +
[[File:Fpr design-2013-07-25.png|900px]]
 +
 +
Updated design as of March 2017
 +
 +
[[File:Fpr-design-2017-03.png|900px]]
  
 
== Workflow ==  
 
== Workflow ==  
Line 53: Line 141:
 
== API ==
 
== API ==
  
 +
== Proposed Changes ==
 +
 +
=== Replace VersionedModel with set of associated rules ===
 +
 +
January 2014
 +
 +
'''Summary:''' Replace the VersionedModel structure of a linear history of revisions with a collection/set of rules all associated with some 'meta-rule'.
 +
 +
'''Bug:'''#6192
 +
 +
'''Problem:'''
 +
Currently, we track a linear history of rule modifications.  A newly modified rule is always added to the end of the chain, even if it modified an earlier rule.  All rules in the chain must be kept. Example: A <- B (enabled).  A is modified, resulting in C with history of: A <- B <- C (enabled)  B cannot be disabled, even if it never used (eg had a typo)
 +
 +
We need to track variations on a rule that is fundamentally the same, but not necessarily a linear version history. With a linear history, the true relationship between a rule and where it originated from is unclear, and a version history of any sort for the rules is probably unnecessary - all we care about is the currently active rule (and any newly downloaded rules that are candidates for replacing it).  The linear history also complicates things when modifications are coming from several sources (eg. Artefactual, various institutions, local modifications)
 +
 +
'''Proposed fix:'''
 +
Each rule stores an Agent (the institution/person that created it) and a 'meta-rule' ID, and no longer stores what rule it replaced.  All rules that are variations on each other have the same 'meta-rule' ID and are 'rule versions'.  Potentially the meta-rule ID is a foreign key to an actual meta-rule table, but that may not be necessary.
 +
 +
In a set of rule versions, only one may be enabled at a time.  Rule versions can come from several sources - originally installed and FPR updates from Artefactual, downloads from a particular institution that is reputable, a local FPR server - and are conceptually linked by having a common meta-rule ID.  Agent IDs will be used to distinguish the sources, and created time used to track the most recent variation.
 +
 +
'''Pros:'''
 +
* Easier to download updates to FPR rules, especially from more than one source
 +
* Can provide the ability to delete truly unwanted rules
 +
* ???
 +
 +
'''Cons:'''
 +
* What information to display to the user? What to display if all the rule versions for a meta-rule are disabled?
 +
* Probably backwards incompatible, or non-trivial to make it
 +
* ???
 +
 +
==Early prototype==
 +
 +
*An early FPR prototype (called "Formatica") was developed by Heather Bowden, then Carolina Digital Curation Doctoral Fellow at the School of Information and Library Science in the University of North Carolina at Chapel Hill.
 +
 +
[[File:Formatica.png|border|450px|Early FPR prototype originally called Formatica]]
  
  
  
  
[[Category:Development documentation]]
+
[[Category:Development documentation]]
 +
[[Category:Feature requirements]]

Latest revision as of 15:51, 11 February 2020

Documentation > Requirements > Format policy registry requirements

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

The Format Policy Registry (FPR) contains user-configurable scripts for file identification, characterization, extraction, normalization and other preservation actions that differ depending on file format. It also contains a list of formats recognized by Archivematica, and the relationship between those formats and the scripts.

Overview[edit]

  • The Archivematica project team is working on a better way to manage format policies for preservation events such as normalization, transcription, extraction, characterization and format identification.
  • A format policy consists of the business rules and tool commands for preservation events based on format.
  • The Format Policy Registry lists all of Archivematica's default format policy rules.
  • Currently, users can download updates from the FPR server and replace default rules with your own local policies.
  • Future funding is sought for further FPR enhancements to include statistical information about the default and custom format policies adopted and implemented by Archivematica users and the ability to upload local policies to the FPR server.
  • One of the primary goals of the FPR is to eventually aggregate empirical information about institutional format policies to better identify community best practices. The FPR could provide a practical, community-based approach to OAIS preservation and access planning, allowing the Archivematica community of users to monitor and evaluate format policies as they are adopted, adapted and supplemented by real-world practitioners. The FPR APIs would be designed to share this information with the Archivematica user base as well with other interested communities and projects.
  • The FPR server is hosted at fpr.archivematica.org. Sponsorship is actively being sought to develop a front-facing website at this server location. Currently, the local copy of the FPR server in the Archivematica dashboard's preservation planning tab is the only interface for the user. Artefactual manages the server from the back end. With further sponsorship, the server site would be a place to compare institutional format policies and their success rates over time.

Description[edit]

  • The Archivematica project team created the FPR after having recognized the need for a better way to manage preservation plans, i.e. business rules and tool commands for format-based preservation events. Since these are either implemented or altered by the institution running an Archivematica instance, these rules are referred to as policies. Format policies will change as community standards, practices and tools evolve. A format policy indicates the actions, tools and settings to apply to a file of a particular file format (e.g. conversion to preservation format, conversion to access format).
  • Prior to the FPR, the Archivematica project has managed this information on the archivematica.org/preservation wiki page. These format policies were all researched as a result of related development partnerships with samples provided by funding partners. Since the FPR was first released in late 2013, all new format policies were developed in partnership with funding institutions and any new rules were included in the public FPR server, hosted by Artefactual Systems, Inc.
  • The Format Policy Registry (FPR) manages this information in a structured format (SQL/JSON).
    • With additional sponsored development, APIs with other serializations could be added (e.g. XML, RDF)
  • The FPR includes updates from PRONOM, which are manually applied by Artefactual Systems. With funding, we could enhance and/or automate this PRONOM interface and add interfaces with other registries like UDFR and linked data registries. A web of interfaces is the best way to monitor and evaluate community-wide best practices.
  • The FPR stores structured information about:
    • Format identification (FIDO based on PRONOM IDs or file extension, with more tools to be added in the future)
    • Normalization format policies for preservation and access. These policies identify preferred preservation and access formats by media type. The default choice of access formats is based on the ubiquity of viewers for the file format. Archivematica's default preservation formats are all open standards; additionally, the choice of preservation format is based on community best practices, availability of open-source normalization tools, and an analysis of the significant characteristics for each media type.
    • Characterization (default is FITS for most formats and MediaInfo for some audiovisual formats)
    • Transcription (default for OCR using Tesseract)
    • Extraction - tools and commands for extracting packages and forensic disk images
  • Archivematica default format policies can all be changed or enhanced by individual Archivematica implementers.

FPR development[edit]

Use Cases[edit]

  • Alternate preservation event tool than Archivematica default
  • Alternate default preservation event outcome format than Archivematica default
  • Alternate specification event result
  • Disable default in Archivematica
  • Add new format policy
  • Add new tool
  • Add new format
  • Run format policy tools and commands on local digital acquisitions
  • Proposed: Add format policies to other open source system workflows
  • Proposed: Allow other open source system users to submit proposed format policies

1.2[edit]

  • Updated PRONOM data for FIDO
  • Made one Tools section rather than having tools in each event space, except for Identification tools
  • Added Transcription, Extraction, Characterization and Verification event sections
  • Ability to update from server in local dashboard's Preservation Planning tab

1.0[edit]

  • Ability to add/edit tools, rules, commands and formats

0.10-beta[edit]

  • Ability to view, add/edit local format policies
    • FPR local is for superusers (preservation planning tab)
  • Dashboard FPR captures usage statistics
  • Central server at fpr.archivematica.org
  • Ability to download most current Archivematica default format policies from fpr.archivematica.org on first installation

Wishlist[edit]

  • Front-facing server website
    • authenticated, web-based access for Artefactual and super-user updates and maintenance
  • Ability to submit local changes to the central FPR server
    • fpr.archivematica.org will mirror local implementation, except user won't be able to apply changes, only submit them for review and website could include the institution that submitted the policy or indicate whether it is a default Archivematica policy
    • users would be able to apply changes locally and submit them for review to fpr
    • nice to have: ability to add links to documentation about reasoning behind local institutional policy selection
    • accepted changes would result in a new FPR server entry
  • Ability to download select new Archivematica default FPR policies as well as those from other institutions from the central FPR server to local Archivematica installations
    • ideally a single click download to allow for local implementation from the server or from a list viewable after selecting Update from the Preservation Planning tab of the local Dashboard
  • Ability to disable a format policy in the dashboard
  • Update Archivematica without losing local FPR configuration and added policies
  • Integrate with PRONOM to automate PUID updates
  • Integrate with other registries
  • Provide a read-only RESTful Web API for accessing policies in JSON format. This is only partially implemented in 1.2, intended only to be used internally (between dashboard and FPR server) and only provides access to the full set of policies. Future development would ideally include the ability to access a subset of policies via the FPR REST API

Legacy FPR foundation docs[edit]

FPR overview Oct 2012


Mockups[edit]

Preservation Planning tab in Dashboard - Local FPR

Preservation Planning tab in dashboard


  • Archivematica format ID column is populated by the "description" from the File ID table in the Archivematica database; It links to the file ID table in the Archivematica database, which is composed of the tool and tool version that identified the format.
  • Command description is the "description" from the Command table in the Archivematica database
  • Purpose is the same as "classification" in the Archivematica database, but is clearer to the user
  • Add new -links to create new (blank) form for Archivematica format ID -or- to the create new (blank) form for the Command
  • Copy - links to create new form based on this one (populated with current data) for either the Archivematica format ID -or- Command
  • Edit -links to edit page for the Archivematica format ID -or- to the edit page for the Command
  • Make default - makes the selected Archivematica format ID format policy or the selected Command format policy default for normalization. See Feature #4503
  • Show performance and show command are modeled on the old preservation planning tab in the dashboard, available in the dev version of Archivematica at /preservationplanning/old/

Add new Archivematica format ID

1.0 FormatPolicyFormatID.png


Add new Command

1.0 FormatPolicyCommand.png


Copy Archivematica format ID

1.0 FormatPolicyFormatIDCOPY.png


Copy Command

1.0 FormatPolicyCommandCOPY.png

Data Model[edit]

Design from summer 2013 of the data structures in version 2 of the API.

Fpr design-2013-07-25.png

Updated design as of March 2017

Fpr-design-2017-03.png

Workflow[edit]

GUI[edit]

API[edit]

Proposed Changes[edit]

Replace VersionedModel with set of associated rules[edit]

January 2014

Summary: Replace the VersionedModel structure of a linear history of revisions with a collection/set of rules all associated with some 'meta-rule'.

Bug:#6192

Problem: Currently, we track a linear history of rule modifications. A newly modified rule is always added to the end of the chain, even if it modified an earlier rule. All rules in the chain must be kept. Example: A <- B (enabled). A is modified, resulting in C with history of: A <- B <- C (enabled) B cannot be disabled, even if it never used (eg had a typo)

We need to track variations on a rule that is fundamentally the same, but not necessarily a linear version history. With a linear history, the true relationship between a rule and where it originated from is unclear, and a version history of any sort for the rules is probably unnecessary - all we care about is the currently active rule (and any newly downloaded rules that are candidates for replacing it). The linear history also complicates things when modifications are coming from several sources (eg. Artefactual, various institutions, local modifications)

Proposed fix: Each rule stores an Agent (the institution/person that created it) and a 'meta-rule' ID, and no longer stores what rule it replaced. All rules that are variations on each other have the same 'meta-rule' ID and are 'rule versions'. Potentially the meta-rule ID is a foreign key to an actual meta-rule table, but that may not be necessary.

In a set of rule versions, only one may be enabled at a time. Rule versions can come from several sources - originally installed and FPR updates from Artefactual, downloads from a particular institution that is reputable, a local FPR server - and are conceptually linked by having a common meta-rule ID. Agent IDs will be used to distinguish the sources, and created time used to track the most recent variation.

Pros:

  • Easier to download updates to FPR rules, especially from more than one source
  • Can provide the ability to delete truly unwanted rules
  • ???

Cons:

  • What information to display to the user? What to display if all the rule versions for a meta-rule are disabled?
  • Probably backwards incompatible, or non-trivial to make it
  • ???

Early prototype[edit]

  • An early FPR prototype (called "Formatica") was developed by Heather Bowden, then Carolina Digital Curation Doctoral Fellow at the School of Information and Library Science in the University of North Carolina at Chapel Hill.

Early FPR prototype originally called Formatica