Email preservation

From Archivematica
Jump to navigation Jump to search

Main Page > Development > Development documentation > Email preservation

These are requirements for email preservation in Archivematica:

  • Preservation format
    • Proprietary closed formats such as PST should be converted to open preservation format
    • Preservation format should be text or xml based
    • Email messages, calendars, contacts and other related entities should be normalized to the preservation format
    • Preservation format should preserve the significant characteristics of the email messages
    • Preservation format should be capable of being viewed as an access format or should be capable of generating an access format
  • Attachments
    • Attachments should be converted to preservation and access formats
    • Converted attachments should retain links to emails to which they were attached
  • Access format
    • Access format should be human-readable and should be recognizable as email
    • Access format should allow simple, intuitive navigation between messages, attachments, email boxes, contacts, calendars
    • Access format should allow navigation to normalized access copies of attachments

Email preservation research

Preserving Email (DPC Technology Watch Report 11-01, ISSN 2048-7916, Digital Preservation Coalition 2011), Chris Prom, University of Illinois.

  • "In general, if an institution can get email into one of the MBOX or EML formats, it has taken a very big step on the road toward preserving email." (p.23)
  • "XML conversion tools...can be very useful in achieving format neutrality. However, the author is aware of no general-purpose tools that are intended to facilitate the access, display, searching, or visualization of messages that have been migrated to XML. Until such tools have been developed – if they ever are – institutions will be forced to provide access to migrated messages using an email client of their choice or the user’s choice, recognizing that specific tools support different functionality." (p.23)
  • "Institutions are beginning to implement the Email Account Schema [CERP], but few tools exist to query, display and render messages that are stored in the format. If the digital preservation community were to develop tools that support the Email Account Schema or a different XML standard for email, that XML format would be a likely candidate for adoption as an International Council on Archives or even an ISO standard." (p.24)
    • "Until such applications are developed, it may seem that there is relatively little immediate benefit to be gained by migrating email into an XML-based format. Therefore, institutions that decide to keep email in an XML format should also keep a copy of messages in one of the IETF formats, preferable EML, since it allows attachments to be written as separate files." (p.24)
  • "Once messages have been migrated outside their native client/server architecture, they can be searched, retrieved and displayed by loading them back into another server or Preserving Email client application. Therefore, repositories should keep the original email format or MBOX/EML files as an access copy, which can be imported into email clients as needed." (pp.28-29)
  • "At present, several tools, including Hypermail and Aid4Mail, can convert messages to a static HTML format. In addition, a tool currently under development at Stanford University shows exciting potential for making preserved email useful. The Muse program, which can capture messages from several server environments to a local computer, also includes a search, browsing, visualization and analysis tool (Stanford University, Mobisocial Laboratory 2011)." (p.29)


2010. Reshaping the Repository: The Challenge of Email Archiving, Goethals, A. and Gogel, W. In 7th International Conference on Preservation of Digital Objects], (iPRES2010). Vienna, Austria.

  • "We have preliminarily chosen to use the CERP/EMCAP schema, because we think it strikes the right balance between fully supporting the complexities of email headers and structure with a welcome lack of manipulation of the message bodies and attachments. Unlike most of the other schemas, it uses generic <Header> elements to store the names and values of the message headers. The advantage of this approach is that it can accommodate unanticipated headers, for example custom headers added by client systems, or those that will be added to future revisions of the email RFCs. It can support multiple message bodies per email, including HTML, and pointers to externally-stored attachments. They also have a separate schema for

wrapping base64-encoded attachments, however we will likely decode attachments and store them in their original formats. While the CERP/EMCAP schema is designed to contain all the email messages for an account, we anticipate that it will work equally well at storing a single email message, which is how we intend to use it." (p.3)