Normalizing Office Documents

From Archivematica
Revision as of 17:51, 24 January 2011 by Joseph (talk | contribs)
Jump to navigation Jump to search


This is a developement page, not to be confused with the Media type preservation plans.


The typical open source approach to transcoding MS Office documents of various formats is to use unoconv to interface with Open Office. Unfortunately, our experiences with unoconv and our system have not been great.


In Archivematica 0.6 we faced a problem where on boot, the first one or two conversions would fail, due to the server not being initialised. We got around that by creating a daemon to start a unoconv listener.


In Archivematica 0.6.2, with stricter error checking, and the implementation of the MCP we came across further problems with unoconv. It would appear to hang while processing, and occasionally report errors, like 139 segmentation fault. After much testing and we replaced unoconv with some alternative scripts we found online. Issue 304 was used to track this problem: . After unoconv was replaced, the occasional segmentation fault would still appear. I believe the segmentation fault problem is in Open Office: it may be collisions of temporary files/memory between returns and cleanups, but I'm only guessing. The problem seems less frequent when longer sleeps between calls are implemented.


We are aware that MS Office has the ability to save documents into pdf format, which may be able to be scripted/automated. Our concern is that this doesn't keep with the archivematica goal of being a completely open source utility. Future revisions of Archivematica may provide the option to normalize in this manner, but we would like to provide an open source alternative.