Difference between revisions of "Email preservation"

From Archivematica
Jump to navigation Jump to search
m (Move to feature requirements category)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > Email preservation
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > Email preservation
  
==Requirements==
+
[[Category:Feature requirements]]
 +
 
 +
==Functional Requirements==
  
 
These are requirements for email preservation in Archivematica:
 
These are requirements for email preservation in Archivematica:
Line 17: Line 19:
 
**Access format should allow simple, intuitive navigation between messages, attachments, email boxes, contacts, calendars
 
**Access format should allow simple, intuitive navigation between messages, attachments, email boxes, contacts, calendars
 
**Access format should allow navigation to normalized access copies of attachments
 
**Access format should allow navigation to normalized access copies of attachments
 +
</br>
  
 
==Formats==
 
==Formats==
Line 29: Line 32:
 
[http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1018 PRONOM]:
 
[http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1018 PRONOM]:
 
*Defined in RFC5322 the Internet Message Format provides a syntax for text messages that are sent among computer users, within the framework of "electronic mail". Consisting of header fields and an optional body, encoded using ASCII characters. The Internet Message Format (*.eml) is a format used by numerous email systems. The format does not prevent the use of alternative formats for storage of electronic mail on those systems.
 
*Defined in RFC5322 the Internet Message Format provides a syntax for text messages that are sent among computer users, within the framework of "electronic mail". Consisting of header fields and an optional body, encoded using ASCII characters. The Internet Message Format (*.eml) is a format used by numerous email systems. The format does not prevent the use of alternative formats for storage of electronic mail on those systems.
 +
</br>
  
 
===MBOX===
 
===MBOX===
  
 +
From [http://www.significantproperties.org.uk/email-testingreport.html Inspect: Significant Properties Testing Report: Electronic Mail] Gareth Knight, 30 March 2009:
 +
*"The mbox family refers to four related, but only semi-compatible formats for the storage of one or more email messages and attachments. The four formats - mboxo, mboxrd, mboxcl, and mboxcl2 – originate from different versions of Unix. Each mbox file represents a set of email messages that are ordered sequentially and grouped into a ‘folder ‘. Email messages are stored in their source format, e.g. plain text may be stored as ASCII or Unicode, binary data is stored as Base64-encoded text. The format is well supported by a number of email applications and, thanks to its text-based composition can be processed, rendered and converted by a wide range of text processing software."
 +
</br>
  
 
===Maildir===
 
===Maildir===
 +
 +
From http://wiki.dovecot.org/MailboxFormat/Maildir:
 +
*"This format debuted with the qmail server in the mid-1990s. Each mailbox folder is a directory and each message a file. This improves efficiency because individual emails can be modified, deleted and added without affecting the mailbox or other emails, and makes it safer to use on networked file systems such as NFS."
 +
From [http://www.significantproperties.org.uk/email-testingreport.html Inspect: Significant Properties Testing Report: Electronic Mail] Gareth Knight, 30 March 2009:
 +
*"Maildir is an organizational structure for the storage of one or more emails on a file system. Each email is stored as a distinct file in one of three sub-directories: the ‘tmp ‘ sub-directory temporarily stores emails during processing; ‘new ‘ contains newly delivered emails; and ‘cur ‘ contains emails that have been processed by the client ‘s mail-reader software. The storage of each email as a distinct file in the file structure is cited as workaround to file locking issues that affect compound formats, such as mbox that update the mail data file that the user is accessing. However, the filename convention used for the storage of emails may cause incompatibility in implementations of Maildir for Unix-compatible and Microsoft Windows operating systems. The colon character is an illegal character in Microsoft Windows. However, there is no standard on the alternative character that may be used in the environment."
 +
Notes:
 +
*Archivematica sanitizes filenames to remove characters that are prohibited in various operating systems. In the case of maildir, a filename such as ''1330052116_1.7424.dell-desktop,U=153,FMD5=f72fc74533f0b9f432010a357af35516:2,'' would be changed to ''1330052116_1.7424.dell-desktop_U_153_FMD5_f72fc74533f0b9f432010a357af35516_2_''. If we do ingest and preserver maildir files we may need to skip the name cleanup micro-service if email systems depend on the format of the filename to recognize maildir files.
 +
</br>
 +
 +
==Tools==
 +
*[http://www.five-ten-sg.com/libpst/rn01re01.html Readpst], which can be downloaded from http://www.five-ten-sg.com/libpst/packages/, converts PST (MS Outlook Personal Folders) files to mbox and other formats.
 +
*[http://offlineimap.org/ OfflineImap] connects to IMAP accounts and saves the contents locally as maildir backups.
 +
*[https://gist.github.com/1709069 md2mb.py] is a python script that converts maildir to mbox format.
 +
</br>
 +
 +
===Muse===
 +
[http://mobisocial.stanford.edu/muse/ Muse] is a social media tool that can render mbox files as plain text messages with attachments.
 +
*Features:
 +
**Ability to tag restricted messages and export the untagged messages as an mbox file. Note that tagging one message also tags any threads containing the same message.
 +
**Ability to keyword search email messages.
 +
**Ability to key quickly through messages by using forward and back arrows on the keyboard.
 +
**Ability to click on sender or receiver to view all the messages sent or received by that individual.
 +
**Ability to click on selected terms in the email to view other messages with the same terms.
 +
*Drawbacks:
 +
**In alpha development, somewhat brittle in Linux environment.
 +
**Poorly designed user interface.
 +
**Poor user documentation.
 +
</br>
  
 
==Email preservation research==
 
==Email preservation research==
Line 46: Line 81:
 
[http://www.ifs.tuwien.ac.at/dp/ipres2010/schedule.html 2010. Reshaping the Repository: The Challenge of Email Archiving], Goethals, A. and Gogel, W. In 7th International Conference on Preservation of Digital Objects], (iPRES2010). Vienna, Austria.
 
[http://www.ifs.tuwien.ac.at/dp/ipres2010/schedule.html 2010. Reshaping the Repository: The Challenge of Email Archiving], Goethals, A. and Gogel, W. In 7th International Conference on Preservation of Digital Objects], (iPRES2010). Vienna, Austria.
 
*"We have preliminarily chosen to use the CERP/EMCAP schema, because we think it strikes the right balance between fully supporting the complexities of email headers and structure with a welcome lack of manipulation of the message bodies and attachments. Unlike most of the other schemas, it uses generic <Header> elements to store the names and values of the message headers. The advantage of this approach is that it can accommodate unanticipated headers, for example custom headers added by client systems, or those that will be added to future revisions of the email RFCs. It can support multiple message bodies per email, including HTML, and pointers to externally-stored attachments. They also have a separate schema for wrapping base64-encoded attachments, however we will likely decode attachments and store them in their original formats. While the CERP/EMCAP schema is designed to contain all the email messages for an account, we anticipate that it will work equally well at storing a single email message, which is how we intend to use it." (p.3)
 
*"We have preliminarily chosen to use the CERP/EMCAP schema, because we think it strikes the right balance between fully supporting the complexities of email headers and structure with a welcome lack of manipulation of the message bodies and attachments. Unlike most of the other schemas, it uses generic <Header> elements to store the names and values of the message headers. The advantage of this approach is that it can accommodate unanticipated headers, for example custom headers added by client systems, or those that will be added to future revisions of the email RFCs. It can support multiple message bodies per email, including HTML, and pointers to externally-stored attachments. They also have a separate schema for wrapping base64-encoded attachments, however we will likely decode attachments and store them in their original formats. While the CERP/EMCAP schema is designed to contain all the email messages for an account, we anticipate that it will work equally well at storing a single email message, which is how we intend to use it." (p.3)
 +
</br>
  
 +
==Archivematica 0.9 maildir ingest requirements==
  
 
+
*The Maildir directory as a whole forms the transfer
 
+
*The Maildir directory is the preservation master in the AIP
 
+
*Attachments must be extracted and normalized
 
+
*Extracted attachments and their normalized versions will reside in a directory outside the Maildir directory
 
+
*Maildir message names must NOT be sanitized (they contain colons, commas and equal signs)
[[Category:Development documentation]]
+
*The Maildir messages DO NOT have to go through FITS
 +
*The extracted attachments DO have to go through FITS
 +
*Each Maildir subdirectory is normalized to an mbox file for access purposes
 +
*Question: how will we link the mbox file back to the directory in the AIP? May not have to do this for 0.9.

Latest revision as of 14:48, 23 March 2017

Main Page > Development > Development documentation > Email preservation

Functional Requirements[edit]

These are requirements for email preservation in Archivematica:

  • Preservation format
    • Proprietary closed formats such as PST should be converted to open preservation format
    • Preservation format should be text or xml based
    • Email messages, calendars, contacts and other related entities should be normalized to the preservation format
    • Preservation format should preserve the significant characteristics of the email messages
    • Preservation format should be capable of being viewed as an access format or should be capable of generating an access format
  • Attachments
    • Attachments should be converted to preservation and access formats
    • Converted attachments should retain links to emails to which they were attached
  • Access format
    • Access format should be human-readable and should be recognizable as email
    • Access format should allow simple, intuitive navigation between messages, attachments, email boxes, contacts, calendars
    • Access format should allow navigation to normalized access copies of attachments


Formats[edit]

EML[edit]

From http://www.coolutils.com/Formats/EML:

  • EML, which stands for ‘E-mail’, is the file extension of the Outlook Express Saved Mail Messages files. It belongs to the Microsoft range of e-mail management programs and is used for saving e-mails for storage and forwarding purposes.
  • Since the object of an EML file is to store e-mail messages, it is a plain text file, and as a result, has a standard file structure. It consists of a short header and the main body. The header contains the e-mail addresses of the sender and the recipient, the subject, and the time and date of the message. The main message of the e-mail is in the body of the file. EML files can also contain hyperlinks and attachments.
  • Since EML files are created to comply with the industry RFC 822 standard, they can be used with most e-mail clients, servers and applications. Besides the Microsoft Outlook Express, EML files can be opened using most e-mail clients, such as Microsoft Outlook, Microsoft Entourage, Mozilla Thunderbird, Apple Mail, and IncrediMail. Since EML files are plaintext and formatted much like MHT (MIME HTML) files, they can also be opened directly in the Internet Explorer, Mozilla Firefox and Opera, by first changing the file extension from ‘.eml’ to ‘.mht’. It is also possible to view EML files using notepad or any other text editor.

PRONOM:

  • Defined in RFC5322 the Internet Message Format provides a syntax for text messages that are sent among computer users, within the framework of "electronic mail". Consisting of header fields and an optional body, encoded using ASCII characters. The Internet Message Format (*.eml) is a format used by numerous email systems. The format does not prevent the use of alternative formats for storage of electronic mail on those systems.


MBOX[edit]

From Inspect: Significant Properties Testing Report: Electronic Mail Gareth Knight, 30 March 2009:

  • "The mbox family refers to four related, but only semi-compatible formats for the storage of one or more email messages and attachments. The four formats - mboxo, mboxrd, mboxcl, and mboxcl2 – originate from different versions of Unix. Each mbox file represents a set of email messages that are ordered sequentially and grouped into a ‘folder ‘. Email messages are stored in their source format, e.g. plain text may be stored as ASCII or Unicode, binary data is stored as Base64-encoded text. The format is well supported by a number of email applications and, thanks to its text-based composition can be processed, rendered and converted by a wide range of text processing software."


Maildir[edit]

From http://wiki.dovecot.org/MailboxFormat/Maildir:

  • "This format debuted with the qmail server in the mid-1990s. Each mailbox folder is a directory and each message a file. This improves efficiency because individual emails can be modified, deleted and added without affecting the mailbox or other emails, and makes it safer to use on networked file systems such as NFS."

From Inspect: Significant Properties Testing Report: Electronic Mail Gareth Knight, 30 March 2009:

  • "Maildir is an organizational structure for the storage of one or more emails on a file system. Each email is stored as a distinct file in one of three sub-directories: the ‘tmp ‘ sub-directory temporarily stores emails during processing; ‘new ‘ contains newly delivered emails; and ‘cur ‘ contains emails that have been processed by the client ‘s mail-reader software. The storage of each email as a distinct file in the file structure is cited as workaround to file locking issues that affect compound formats, such as mbox that update the mail data file that the user is accessing. However, the filename convention used for the storage of emails may cause incompatibility in implementations of Maildir for Unix-compatible and Microsoft Windows operating systems. The colon character is an illegal character in Microsoft Windows. However, there is no standard on the alternative character that may be used in the environment."

Notes:

  • Archivematica sanitizes filenames to remove characters that are prohibited in various operating systems. In the case of maildir, a filename such as 1330052116_1.7424.dell-desktop,U=153,FMD5=f72fc74533f0b9f432010a357af35516:2, would be changed to 1330052116_1.7424.dell-desktop_U_153_FMD5_f72fc74533f0b9f432010a357af35516_2_. If we do ingest and preserver maildir files we may need to skip the name cleanup micro-service if email systems depend on the format of the filename to recognize maildir files.


Tools[edit]


Muse[edit]

Muse is a social media tool that can render mbox files as plain text messages with attachments.

  • Features:
    • Ability to tag restricted messages and export the untagged messages as an mbox file. Note that tagging one message also tags any threads containing the same message.
    • Ability to keyword search email messages.
    • Ability to key quickly through messages by using forward and back arrows on the keyboard.
    • Ability to click on sender or receiver to view all the messages sent or received by that individual.
    • Ability to click on selected terms in the email to view other messages with the same terms.
  • Drawbacks:
    • In alpha development, somewhat brittle in Linux environment.
    • Poorly designed user interface.
    • Poor user documentation.


Email preservation research[edit]

Preserving Email (DPC Technology Watch Report 11-01, ISSN 2048-7916, Digital Preservation Coalition 2011), Chris Prom, University of Illinois.

  • "In general, if an institution can get email into one of the MBOX or EML formats, it has taken a very big step on the road toward preserving email." (p.23)
  • "XML conversion tools...can be very useful in achieving format neutrality. However, the author is aware of no general-purpose tools that are intended to facilitate the access, display, searching, or visualization of messages that have been migrated to XML. Until such tools have been developed – if they ever are – institutions will be forced to provide access to migrated messages using an email client of their choice or the user’s choice, recognizing that specific tools support different functionality." (p.23)
  • "Institutions are beginning to implement the Email Account Schema [CERP], but few tools exist to query, display and render messages that are stored in the format. If the digital preservation community were to develop tools that support the Email Account Schema or a different XML standard for email, that XML format would be a likely candidate for adoption as an International Council on Archives or even an ISO standard." (p.24)
    • "Until such applications are developed, it may seem that there is relatively little immediate benefit to be gained by migrating email into an XML-based format. Therefore, institutions that decide to keep email in an XML format should also keep a copy of messages in one of the IETF formats, preferable EML, since it allows attachments to be written as separate files." (p.24)
  • "Once messages have been migrated outside their native client/server architecture, they can be searched, retrieved and displayed by loading them back into another server or Preserving Email client application. Therefore, repositories should keep the original email format or MBOX/EML files as an access copy, which can be imported into email clients as needed." (pp.28-29)
  • "At present, several tools, including Hypermail and Aid4Mail, can convert messages to a static HTML format. In addition, a tool currently under development at Stanford University shows exciting potential for making preserved email useful. The Muse program, which can capture messages from several server environments to a local computer, also includes a search, browsing, visualization and analysis tool (Stanford University, Mobisocial Laboratory 2011)." (p.29)


2010. Reshaping the Repository: The Challenge of Email Archiving, Goethals, A. and Gogel, W. In 7th International Conference on Preservation of Digital Objects], (iPRES2010). Vienna, Austria.

  • "We have preliminarily chosen to use the CERP/EMCAP schema, because we think it strikes the right balance between fully supporting the complexities of email headers and structure with a welcome lack of manipulation of the message bodies and attachments. Unlike most of the other schemas, it uses generic <Header> elements to store the names and values of the message headers. The advantage of this approach is that it can accommodate unanticipated headers, for example custom headers added by client systems, or those that will be added to future revisions of the email RFCs. It can support multiple message bodies per email, including HTML, and pointers to externally-stored attachments. They also have a separate schema for wrapping base64-encoded attachments, however we will likely decode attachments and store them in their original formats. While the CERP/EMCAP schema is designed to contain all the email messages for an account, we anticipate that it will work equally well at storing a single email message, which is how we intend to use it." (p.3)


Archivematica 0.9 maildir ingest requirements[edit]

  • The Maildir directory as a whole forms the transfer
  • The Maildir directory is the preservation master in the AIP
  • Attachments must be extracted and normalized
  • Extracted attachments and their normalized versions will reside in a directory outside the Maildir directory
  • Maildir message names must NOT be sanitized (they contain colons, commas and equal signs)
  • The Maildir messages DO NOT have to go through FITS
  • The extracted attachments DO have to go through FITS
  • Each Maildir subdirectory is normalized to an mbox file for access purposes
  • Question: how will we link the mbox file back to the directory in the AIP? May not have to do this for 0.9.