Normalizing Office Documents
This is a developement page, not to be confused with the Media type preservation plans.
The typical open source approach to transcoding MS Office documents of various formats is to use unoconv to interface with Open Office. Unfortunately, our experiences with unoconv and our system have not been as desirable as we had hoped.
In Archivematica 0.6 we faced a problem where on boot, the first one or two conversions would fail, due to the server not being initialised. We got around that by creating a daemon to start a unoconv listener.
In Archivematica 0.6.2, with stricter error checking, and the implementation of the MCP we came across further problems with unoconv. It would appear to hang while processing, and occasionally report errors, like 139 segmentation fault. After much testing and we replaced unoconv with some alternative scripts we found online. Issue 304 was used to track this problem: . After unoconv was replaced, the occasional segmentation fault would still appear.
We are aware that MS Office has the ability to save documents into pdf format, which may be able to be scripted/automated. Our concern is that this doesn't keep with the archivematica goal of being a completely open source utility. Future revisions of Archivematica may provide the option to normalize in this manner, but we would like to provide an open source alternative.