Difference between revisions of "Format policies"

From Archivematica
Jump to navigation Jump to search
 
(114 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[[Main Page]] > [[Documentation]] > Media type preservation plans
+
[[Main Page]] > [[Documentation]] > Format policies
  
 +
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/user-manual/preservation/preservation-planning/ Archivematica Preservation Planning documentation] for information about format policies. </div> <p>
 +
 +
==Format Policy Registry (FPR)==
 +
Archivematica manages format policies locally and externally via a Format Policy Registry (FPR). The registry is on a server that Artefactual hosts which includes our default policies for normalization, extraction and format identification. The local FPR offered in the user dashboard preservation planning tab is customizable for the local user. To learn about the FPR, please see [[Administrator_manual_1.0#Format_Policy_Registry_.28FPR.29|Format Policy Registry]]. To read about some of the comprehensive goals of the FPR, see [[Format_policy_registry_requirements|FPR Requirements]].
  
 
==Migration and emulation==
 
==Migration and emulation==
Line 6: Line 10:
  
 
==Normalization==
 
==Normalization==
Archivematica's primary preservation strategy is to normalize files to preservation and access formats upon ingest. The choice of access formats is based on the ubiquity of viewers for the file format. Archivematica's preservation formats are all [http://en.wikipedia.org/wiki/Open_standard open standards]. Additionally, the choice of preservation format is  based on community best practices, availability of open-source normalization tools, and an analysis of the [[significant properties]] for each media type.  
+
Archivematica's primary preservation strategy is to normalize files to preservation and access formats upon ingest. Archivematica's preservation formats are all [http://en.wikipedia.org/wiki/Open_standard open standards]. Additionally, the choice of preservation format is  based on community best practices, availability of open-source normalization tools, and an analysis of the [[significant characteristics]] for each media type. The choice of access formats is based on the ubiquity of viewers for the file format.
  
==Media type preservation plans==
+
Follow the link for each file format for further information about the open-source normalization tools and settings that have been tested and integrated into Archivematica to make the format conversions.
  
{| border="1" cellpadding="10" cellspacing="0" width=90%
+
==Format policies==
 +
* Format Policies indicate what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. They can be thought of as analogous to Virus Definitions, which need to be updated periodically in an Archivematica installation in order to ensure the efficacy of the virus scanning micro-service.  Similarly, software security updates are downloaded at the operating system level, to keep the host machine secure.
 +
 
 +
{| border="1" cellpadding="10" cellspacing="0"  
 
|-
 
|-
 
|- style="background-color:#cccccc;"
 
|- style="background-color:#cccccc;"
 
!style="width:20%"|'''Media type'''
 
!style="width:20%"|'''Media type'''
 
!style="width:30%"|'''File formats'''
 
!style="width:30%"|'''File formats'''
!style="width:20%"|'''Preservation format(s)'''
+
!style="width:15%"|'''Preservation format(s)'''
!style="width:10%"|'''Access format(s)'''
+
!style="width:15%"|'''Access format(s)'''
!style="width:10%"|'''Normalization tool'''
+
!style="width:15%"|'''Normalization tool'''
!style="width:10%"|'''Comments'''
 
 
|-
 
|-
 
|[[Audio]]
 
|[[Audio]]
|
+
|[[AC-3 Compressed Audio (Dolby Digital)|AC3]], [[Audio Interchange File Format|AIFF]], [[MPEG-1_Audio,_Layer_3|MP3]], [[Waveform Audio|WAV]], [[Windows Media Audio|WMA]]
|LPCM/WAVE
+
|WAVE (LPCM)
 
|MP3
 
|MP3
 
|FFmpeg
 
|FFmpeg
|
+
|-
 +
|[[Email]]
 +
|[[PST]]
 +
|MBOX
 +
|MBOX
 +
|readpst
 +
|-
 +
|[[Email]]
 +
|[[Maildir]]**
 +
|Original format
 +
|MBOX
 +
|md2mb.py
 +
|-
 +
|[[Office Open XML]]
 +
|DOCX, PPTX, XLSX
 +
|Original format
 +
|Original format
 +
|Tool search in progress
 +
|-
 +
|[[Plain text]]
 +
|[[Plain text file|TXT]]
 +
|Original format
 +
|Original format
 +
|None
 +
|-
 +
|[[Portable Document Format]]
 +
|[[Portable Document Format|PDF]]
 +
|PDF/A
 +
|Original format
 +
|Ghostscript
 
|-
 
|-
 
|[[Presentation files]]
 
|[[Presentation files]]
|
+
|[[Microsoft Powerpoint Presentation|PPT]]
|Open Document Format; PDF/A
+
|Original format
 
|PDF
 
|PDF
|Xena or OpenOffice Impress
+
|Tool search in progress
|
 
 
|-
 
|-
 
|[[Raster images]]
 
|[[Raster images]]
|
+
|[[Microsoft Windows Bitmap Image file|BMP]], [[Graphics Interchange Format|GIF]], [[Joint Photographic Experts Group|JPG]], [[JPEG2000|JP2]]*, [[Macintosh PICT Image|PCT]], [[Portable Network Graphics|PNG]]*, [[Adobe Photoshop|PSD]], [[Tagged Image File Format|TIFF]], [[Truevision TARGA file|TGA]]
|TIFF, JPEG2000 or PNG
+
|Uncompressed TIFF
|PNG
+
|JPEG
 
|ImageMagick
 
|ImageMagick
|
 
 
|-
 
|-
|[[Raw camera files]]
+
|[[Raw camera files]]/[[Digital Negative format]]**
|
+
|3FR, ARW, CR2, CRW, DCR, DNG, ERF, KDC, MRW, NEF, ORF, PEF, RAF, RAW, X3F
|DNG
+
|Original format
|TIFF or PNG
+
|JPEG
|DigiKam DNG Converter
+
|ImageMagick/UFRaw
|
 
 
|-
 
|-
 
|[[Spreadsheets]]
 
|[[Spreadsheets]]
|
+
|[[Microsoft Excel Workbook|XLS]]
|Open Document Format
+
|Original format
|
+
|Original format
|OpenOffice Calc
+
|None
|
 
 
|-
 
|-
 
|[[Vector images]]
 
|[[Vector images]]
|
+
|[[Adobe Illustrator drawing|AI]], [[Encapsulated PostScript|EPS]], [[Scalable Vector Graphics|SVG]]
 
|SVG
 
|SVG
|
+
|PDF
|
+
|Inkscape
|
 
 
|-
 
|-
 
|[[Video]]
 
|[[Video]]
|
+
|[[Audio/Video_Interleaved_Format|AVI]], [[Macromedia_FLV|FLV]], [[Quicktime (video)|MOV]], [[MPEG-1 and MPEG-2|MPEG-1]], [[MPEG-1 and MPEG-2|MPEG-2]], [[MPEG-4|MPEG-4]], [[Shockwave Flash file|SWF]], [[Windows Media Player file|WMV]]
|Motion JPEG2000/MXF or MPEG-2/MXF
+
|FFV1/LPCM in MKV
|OGG,FLV
+
|MP4
 
|FFmpeg
 
|FFmpeg
|Motion JPEG2000 is the emerging preferred standard for video files but it is hard to find a tool for Linux that converts to that codec. MPEG-2 is an accepted standard, however, which is in use by a number of institutions.
 
 
|-
 
|-
 
|[[Word processing files]]
 
|[[Word processing files]]
|
+
|[[Microsoft Word for Windows|DOC]], [[Corel WordPerfect|WPD]], [[Rich Text Format|RTF]]
|Open Document Format; PDF/A
+
|Original format
|PDF or PDF/A
+
|Original format
|OpenOffice Writer
+
|Tool search in progress***
|PDF/A normalization of MS Word files is somewhat problematic because best results are achieved from within the native application - i.e. MS Office running in MS Windows. Archivematica does not support either Windows or MS Office since these are proprietary software packages.
 
 
|-
 
|-
|
+
|}
|
+
 
|
+
*(*) PNG and JPEG2000 are not normalized to a preservation format
|
+
*(**) in development
|
+
*(***) See Word processing formats, below
|-
+
 
|
+
==Word processing formats==
|
+
 
|
+
In early versions of Archivematica, normalization of word processing formats (Microsoft Word, Word Perfect, etc) were normalized to PDF or open office formats using Libre Office. However, testing showed that the results were too inconsistent with significant losses in formatting information to continue using this normalization path. Currently, the FPR does not have any normalization paths for word processing formats.
|
+
 
|
+
We have recently began investigating Libre Office again for this purpose. We have identified these issues:
|-
+
 
|
+
* LibreOffice sometimes hangs, causing any future LibreOffice jobs to fail until an administrator manually kills the service.
|
+
* LibreOffice sometimes reports that it succeeded despite not creating a PDF, making it difficult to determine whether or not the job really succeeded.
|
 
|
 
|
 
|-
 
|-}</br>
 
  
 +
Alternatives we have been investigating include:
  
 +
* [https://github.com/dagwieers/unoconv unoconv], a script which wraps libreoffice's headless mode.
  
 +
* [http://abiword.com/ AbiWord] is another word processor which can convert to PDF at the commandline. It's quality of conversion to PDF still needs to be investigated, and in initial tests we found that it could not open some of our sample files.
  
 +
* [http://www.documentliberation.org/projects/ Document Liberation Project]. This project has built a number of open source libraries that are able to open many proprietary document formats and convert them to either ODF (the open office format) or EPUB.
  
 +
Some of the above tools may be used in combination. For example, a Document Liberation Project library could be used to create an ODF before using Abiword or Libre Office to convert to a PDF.
  
{| border="1" cellpadding="10" cellspacing="0" width=90%
+
==Web archive formats==
|-
 
|- style="background-color:#cccccc;"
 
!style="width:20%"|'''Format'''
 
!style="width:20%"|'''Also known as'''
 
!style="width:20%"|'''File extension(s)'''
 
!style="width:20%"|'''Normalization tool'''
 
|-
 
|[[Adobe Extensible Metadata Platform]]
 
|Adobe XMP
 
|.xmp
 
|
 
|-
 
|[[Adobe Illustrator drawing]]
 
|Adobe AI
 
|.ai
 
|
 
|-
 
|[[Advanced Stream Redirector file]]
 
|Microsoft ASX Playlist
 
|.asx
 
|
 
|-
 
|[[Audio/Video Interleaved Format]]
 
|AVI
 
|.avi
 
|
 
|-
 
|[[Batch file]]
 
|
 
*MS-DOS batch file
 
*MS-Windows batch file
 
|.bat
 
|
 
|-
 
|[[Cascading Stylesheet]]
 
|
 
|.css
 
|Xena
 
|-
 
|[[Comma Separated Values file]]
 
|
 
|.csv
 
|Xena
 
|-
 
|[[Command Source Code]]
 
|
 
|.cmd
 
|
 
|-
 
|[[Document type definition]]
 
|
 
|.dtd
 
|
 
|-
 
|[[Encapsulated PostScript]]
 
|
 
|.eps
 
|
 
|-
 
|[[Extensible Markup Language]]
 
|
 
|.xml
 
|Xena
 
|-
 
|[[Extensible Stylesheet Language]]
 
|
 
|.xsl
 
|
 
|-
 
|[[Graphics Interchange Format]]
 
|GIF
 
|.gif
 
|Xena
 
|-
 
|[[Hypertext Markup Language ]]
 
|
 
|
 
*.html
 
*.htm
 
|Xena
 
|-
 
|[[JavaScript file]]
 
|
 
*JavaScript programming files
 
*JScript files
 
|.js
 
|Xena
 
|-
 
|[[Java archive file]]
 
|
 
|.jar
 
|Xena
 
|-
 
|[[Joint Photographic Experts Group]]
 
|JPEG
 
|.jpg
 
|Xena
 
|-
 
|[[Macromedia FLV]]
 
|Flash Video
 
|.flv
 
|
 
|-
 
|[[Microsoft Excel Workbook]]
 
|
 
*Microsoft Excel
 
*Binary Interchange File Format (BIFF)
 
|.xls
 
|Xena
 
|-
 
|[[Microsoft Icon file]]
 
|
 
|.ico
 
|
 
|-
 
|[[Microsoft Powerpoint Presentation]]
 
|
 
|.ppt
 
|Xena
 
|-
 
|[[Microsoft Word for Windows]]
 
|MS-WORD
 
|.doc
 
|Xena
 
|-
 
|[[Moving Picture Experts Group file]]
 
|MPEG
 
|.mpg
 
|
 
|-
 
|[[MPEG-1 Audio, Layer 3]]
 
|MP3
 
|.mp3
 
|Xena
 
|-
 
|[[Nikon Electronic Format]]
 
|
 
*Nikon Image Format
 
*Nikon Digital SLR Camera Raw Image File
 
|.nef
 
|
 
|-
 
|[[Perl script]]
 
|
 
|.pl
 
|Xena
 
|-
 
|[[Portable Document Format]]
 
|
 
|.pdf
 
|Xena
 
|-
 
|[[Portable Network Graphics]]
 
|
 
|.png
 
|Xena
 
|-
 
|[[Python script file]]
 
|
 
|.py
 
|Xena
 
|-
 
|[[Quicktime (video)]]
 
|
 
*MOV
 
*QT
 
|.mov
 
|
 
|-
 
|[[Real Media Metafile]]
 
|
 
|.ram
 
|
 
|-
 
|[[Rich Text Format]]
 
|
 
|.rtf
 
|Xena
 
|-
 
|[[Scalable Vector Graphics]]
 
|
 
|.svg
 
|Xena
 
|-
 
|[[Shockwave Flash file]]
 
|
 
|.swf
 
|
 
|-
 
|[[SQL code]]
 
|
 
|.sql
 
|Xena
 
|-
 
|[[Truevision TARGA file]]
 
|
 
*Truevision Advanced Raster Graphics Adapter
 
*Truevision Graphics Adapter
 
|.tga, .tpic
 
|
 
|-
 
|[[Tagged Image File Format]]
 
|
 
|.tiff, .tif
 
|Xena
 
|-
 
|[[Visual Basic Scripting file]]
 
|
 
*C Source Code
 
*Microsoft Active Scripting language
 
*VBScript
 
|.vbs
 
|Xena
 
|-
 
|[[Waveform Audio]]
 
|Wave Audio File Format
 
|.wav
 
|Xena
 
|-
 
|[[Microsoft Windows Bitmap Image file|Windows Bitmap Image file]]
 
|
 
|.bmp
 
|Xena
 
|-
 
|[[Windows Media Audio]]
 
|
 
|.wma
 
|
 
|-
 
|[[Windows Media Player file]]
 
|
 
|.wmv
 
|
 
|-
 
|}<br />
 
  
 +
While there is not currently a default format policy for [[Websites]], we have done some research and assessment work with our clients that may be of interest towards developing one.
  
 +
==See also==
  
__NOTOC__
+
* Formats that are considered [[Access formats]]
 +
* Formats that are considered [[Preservation formats]]

Latest revision as of 14:43, 12 November 2019

Main Page > Documentation > Format policies

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica Preservation Planning documentation for information about format policies.

Format Policy Registry (FPR)[edit]

Archivematica manages format policies locally and externally via a Format Policy Registry (FPR). The registry is on a server that Artefactual hosts which includes our default policies for normalization, extraction and format identification. The local FPR offered in the user dashboard preservation planning tab is customizable for the local user. To learn about the FPR, please see Format Policy Registry. To read about some of the comprehensive goals of the FPR, see FPR Requirements.

Migration and emulation[edit]

Archivematica maintains the original format of all ingested files to support migration and emulation preservation strategies.

Normalization[edit]

Archivematica's primary preservation strategy is to normalize files to preservation and access formats upon ingest. Archivematica's preservation formats are all open standards. Additionally, the choice of preservation format is based on community best practices, availability of open-source normalization tools, and an analysis of the significant characteristics for each media type. The choice of access formats is based on the ubiquity of viewers for the file format.

Follow the link for each file format for further information about the open-source normalization tools and settings that have been tested and integrated into Archivematica to make the format conversions.

Format policies[edit]

  • Format Policies indicate what tool to run when normalizing for a given purpose (access, preservation) when a specific File Identification Tool identifies a specific File Format. They can be thought of as analogous to Virus Definitions, which need to be updated periodically in an Archivematica installation in order to ensure the efficacy of the virus scanning micro-service. Similarly, software security updates are downloaded at the operating system level, to keep the host machine secure.
Media type File formats Preservation format(s) Access format(s) Normalization tool
Audio AC3, AIFF, MP3, WAV, WMA WAVE (LPCM) MP3 FFmpeg
Email PST MBOX MBOX readpst
Email Maildir** Original format MBOX md2mb.py
Office Open XML DOCX, PPTX, XLSX Original format Original format Tool search in progress
Plain text TXT Original format Original format None
Portable Document Format PDF PDF/A Original format Ghostscript
Presentation files PPT Original format PDF Tool search in progress
Raster images BMP, GIF, JPG, JP2*, PCT, PNG*, PSD, TIFF, TGA Uncompressed TIFF JPEG ImageMagick
Raw camera files/Digital Negative format** 3FR, ARW, CR2, CRW, DCR, DNG, ERF, KDC, MRW, NEF, ORF, PEF, RAF, RAW, X3F Original format JPEG ImageMagick/UFRaw
Spreadsheets XLS Original format Original format None
Vector images AI, EPS, SVG SVG PDF Inkscape
Video AVI, FLV, MOV, MPEG-1, MPEG-2, MPEG-4, SWF, WMV FFV1/LPCM in MKV MP4 FFmpeg
Word processing files DOC, WPD, RTF Original format Original format Tool search in progress***
  • (*) PNG and JPEG2000 are not normalized to a preservation format
  • (**) in development
  • (***) See Word processing formats, below

Word processing formats[edit]

In early versions of Archivematica, normalization of word processing formats (Microsoft Word, Word Perfect, etc) were normalized to PDF or open office formats using Libre Office. However, testing showed that the results were too inconsistent with significant losses in formatting information to continue using this normalization path. Currently, the FPR does not have any normalization paths for word processing formats.

We have recently began investigating Libre Office again for this purpose. We have identified these issues:

  • LibreOffice sometimes hangs, causing any future LibreOffice jobs to fail until an administrator manually kills the service.
  • LibreOffice sometimes reports that it succeeded despite not creating a PDF, making it difficult to determine whether or not the job really succeeded.

Alternatives we have been investigating include:

  • unoconv, a script which wraps libreoffice's headless mode.
  • AbiWord is another word processor which can convert to PDF at the commandline. It's quality of conversion to PDF still needs to be investigated, and in initial tests we found that it could not open some of our sample files.
  • Document Liberation Project. This project has built a number of open source libraries that are able to open many proprietary document formats and convert them to either ODF (the open office format) or EPUB.

Some of the above tools may be used in combination. For example, a Document Liberation Project library could be used to create an ODF before using Abiword or Libre Office to convert to a PDF.

Web archive formats[edit]

While there is not currently a default format policy for Websites, we have done some research and assessment work with our clients that may be of interest towards developing one.

See also[edit]