Difference between revisions of "Email"
Jump to navigation
Jump to search
(28 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
− | [[Main Page]] > [[Documentation]] > [[ | + | [[Main Page]] > [[Documentation]] > [[Format policies]] > Email |
+ | See also [[Email preservation]]. | ||
==[[Significant characteristics of email]]== | ==[[Significant characteristics of email]]== | ||
==Preservation Format== | ==Preservation Format== | ||
− | + | *Options: | |
+ | ** [http://siarchives.si.edu/cerp/ CERP Project] [http://www.archives.ncdcr.gov/mail-account E-Mail Account Schema] | ||
+ | ** [http://www.qmail.org/man/man5/mbox.html mbox] ('''implemented in 0.7''') | ||
+ | ** [http://cr.yp.to/proto/maildir.html maildir] ('''implemented in 0.9) | ||
==Access Format== | ==Access Format== | ||
+ | *Options: | ||
+ | ** [http://www.qmail.org/man/man5/mbox.html mbox] ('''implemented in 0.7''') | ||
+ | ==Attachments== | ||
+ | * These should be normalized according to the media type preservation plan for each attachment file format. Attachments must remain linked to email message | ||
==Normalization tool== | ==Normalization tool== | ||
+ | *Options: | ||
+ | ** [http://offlineimap.org/ OfflineImap] | ||
+ | ** [http://www.five-ten-sg.com/libpst/rn01re01.html readpst] ('''implemented in 0.7''') | ||
+ | ** [http://www.aduna-software.com/technology/aperture Aperture] | ||
+ | ** [http://tika.apache.org/0.8/formats.html#Supported_Document_Formats Tika] has an mbox extractor | ||
+ | ** [http://sourceforge.net/projects/pedalsemailextr/ PEDALS] project email extractor (MS-Windows) | ||
+ | ** [http://alioth.debian.org/projects/libpst/ libpst] | ||
+ | ** [http://www.aid4mail.com/ aid4mail] (proprietary license, MS-Windows) | ||
+ | ** [http://sourceforge.net/projects/libpff/ libpff] (Not Tested) | ||
+ | ** [http://siarchives.si.edu/cerp/parserdownload.htm CERP's Email Preservation Parser] | ||
+ | ==Conversion test results== | ||
+ | *[[PST to MBOX using readpst]] | ||
+ | *[[PST to Email Account XML Schema using CERP Email Parser]] | ||
+ | *[[Gmail to Maildir using OfflineImap]] | ||
+ | *[[Zimbra to Maildir using OfflineImap]] | ||
==Comments== | ==Comments== | ||
+ | '''General''' | ||
+ | |||
*The [http://www.pedalspreservation.org/Default.aspx PEDALS (Persistent Digital Archives and Library System) project] has produced an open-source [http://sourceforge.net/projects/pedalsemailextr/ email extractor] that converts .pst files to xml. However, this tool is designed for Windows only. Users would need to extract the email outside Archivematica and submit the extracted emails as the SIP. For more information, see Library of Congress News and Events at http://www.digitalpreservation.gov/news/2010/20100924news_article_pedals_email_tool.html. | *The [http://www.pedalspreservation.org/Default.aspx PEDALS (Persistent Digital Archives and Library System) project] has produced an open-source [http://sourceforge.net/projects/pedalsemailextr/ email extractor] that converts .pst files to xml. However, this tool is designed for Windows only. Users would need to extract the email outside Archivematica and submit the extracted emails as the SIP. For more information, see Library of Congress News and Events at http://www.digitalpreservation.gov/news/2010/20100924news_article_pedals_email_tool.html. | ||
− | *[http://en.wikipedia.org/wiki/Mbox | + | *[http://en.wikipedia.org/wiki/Mbox Mbox] might be an acceptable preservation format for email. MBox files are aggregations of email messages converted to plain text. |
− | *A detailed report on testing conversion of email from proprietary to open formats is available at http://www.significantproperties.org.uk/email-testingreport.html | + | **The Bodleian Libraries at the University of Oxford use mbox as a preservation format for mailboxes. See http://www.dpconline.org/component/docman/doc_download/640-emailthomasjul2011. |
+ | *A detailed report on testing conversion of email from proprietary to open formats is available at http://www.significantproperties.org.uk/email-testingreport.html. The report includes information about testing conversions from pst to mbox using [http://alioth.debian.org/projects/libpst/ ReadPST]. | ||
+ | |||
+ | '''Maildir''' | ||
+ | |||
+ | *[http://offlineimap.org/ OfflineImap] can be used to connect to active IMAP accounts to capture the accounts as maildir. | ||
+ | From http://wiki.dovecot.org/MailboxFormat/Maildir: | ||
+ | *"This format debuted with the qmail server in the mid-1990s. Each mailbox folder is a directory and each message a file. This improves efficiency because individual emails can be modified, deleted and added without affecting the mailbox or other emails, and makes it safer to use on networked file systems such as NFS." | ||
+ | From [http://www.significantproperties.org.uk/email-testingreport.html Inspect: Significant Properties Testing Report: Electronic Mail] Gareth Knight, 30 March 2009: | ||
+ | *"Maildir is an organizational structure for the storage of one or more emails on a file system. Each email is stored as a distinct file in one of three sub-directories: the ‘tmp ‘ sub-directory temporarily stores emails during processing; ‘new ‘ contains newly delivered emails; and ‘cur ‘ contains emails that have been processed by the client ‘s mail-reader software. The storage of each email as a distinct file in the file structure is cited as workaround to file locking issues that affect compound formats, such as mbox that update the mail data file that the user is accessing. However, the filename convention used for the storage of emails may cause incompatibility in implementations of Maildir for Unix-compatible and Microsoft Windows operating systems. The colon character is an illegal character in Microsoft Windows. However, there is no standard on the alternative character that may be used in the environment." | ||
+ | Notes: | ||
+ | *Archivematica sanitizes filenames to remove characters that are prohibited in various operating systems. In the case of maildir, a filename such as ''1330052116_1.7424.dell-desktop,U=153,FMD5=f72fc74533f0b9f432010a357af35516:2,'' would be changed to ''1330052116_1.7424.dell-desktop_U_153_FMD5_f72fc74533f0b9f432010a357af35516_2_''. If we do ingest and preserver maildir files we may need to skip the name cleanup micro-service if email systems depend on the format of the filename to recognize maildir files. | ||
+ | </br> | ||
+ | |||
+ | |||
__NOTOC__ | __NOTOC__ |
Latest revision as of 13:44, 30 October 2013
Main Page > Documentation > Format policies > Email
See also Email preservation.
Significant characteristics of email[edit]
Preservation Format[edit]
- Options:
- CERP Project E-Mail Account Schema
- mbox (implemented in 0.7)
- maildir (implemented in 0.9)
Access Format[edit]
- Options:
- mbox (implemented in 0.7)
Attachments[edit]
- These should be normalized according to the media type preservation plan for each attachment file format. Attachments must remain linked to email message
Normalization tool[edit]
- Options:
- OfflineImap
- readpst (implemented in 0.7)
- Aperture
- Tika has an mbox extractor
- PEDALS project email extractor (MS-Windows)
- libpst
- aid4mail (proprietary license, MS-Windows)
- libpff (Not Tested)
- CERP's Email Preservation Parser
Conversion test results[edit]
- PST to MBOX using readpst
- PST to Email Account XML Schema using CERP Email Parser
- Gmail to Maildir using OfflineImap
- Zimbra to Maildir using OfflineImap
Comments[edit]
General
- The PEDALS (Persistent Digital Archives and Library System) project has produced an open-source email extractor that converts .pst files to xml. However, this tool is designed for Windows only. Users would need to extract the email outside Archivematica and submit the extracted emails as the SIP. For more information, see Library of Congress News and Events at http://www.digitalpreservation.gov/news/2010/20100924news_article_pedals_email_tool.html.
- Mbox might be an acceptable preservation format for email. MBox files are aggregations of email messages converted to plain text.
- The Bodleian Libraries at the University of Oxford use mbox as a preservation format for mailboxes. See http://www.dpconline.org/component/docman/doc_download/640-emailthomasjul2011.
- A detailed report on testing conversion of email from proprietary to open formats is available at http://www.significantproperties.org.uk/email-testingreport.html. The report includes information about testing conversions from pst to mbox using ReadPST.
Maildir
- OfflineImap can be used to connect to active IMAP accounts to capture the accounts as maildir.
From http://wiki.dovecot.org/MailboxFormat/Maildir:
- "This format debuted with the qmail server in the mid-1990s. Each mailbox folder is a directory and each message a file. This improves efficiency because individual emails can be modified, deleted and added without affecting the mailbox or other emails, and makes it safer to use on networked file systems such as NFS."
From Inspect: Significant Properties Testing Report: Electronic Mail Gareth Knight, 30 March 2009:
- "Maildir is an organizational structure for the storage of one or more emails on a file system. Each email is stored as a distinct file in one of three sub-directories: the ‘tmp ‘ sub-directory temporarily stores emails during processing; ‘new ‘ contains newly delivered emails; and ‘cur ‘ contains emails that have been processed by the client ‘s mail-reader software. The storage of each email as a distinct file in the file structure is cited as workaround to file locking issues that affect compound formats, such as mbox that update the mail data file that the user is accessing. However, the filename convention used for the storage of emails may cause incompatibility in implementations of Maildir for Unix-compatible and Microsoft Windows operating systems. The colon character is an illegal character in Microsoft Windows. However, there is no standard on the alternative character that may be used in the environment."
Notes:
- Archivematica sanitizes filenames to remove characters that are prohibited in various operating systems. In the case of maildir, a filename such as 1330052116_1.7424.dell-desktop,U=153,FMD5=f72fc74533f0b9f432010a357af35516:2, would be changed to 1330052116_1.7424.dell-desktop_U_153_FMD5_f72fc74533f0b9f432010a357af35516_2_. If we do ingest and preserver maildir files we may need to skip the name cleanup micro-service if email systems depend on the format of the filename to recognize maildir files.