OCR text in DIP

From Archivematica
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Main Page > Development > Development documentation > OCR test in DIP

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Requirements

See related issues:

  • #6257

Add OCR text files to DIP and AIP

  • Add open-source OCR tool to Archivematica
    • Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
    • The actively-developed options are:
      • Tesseract - Very actively developed. Best accuracy of all the open-source options, good speed.
      • OCRad - Moderate-to-poor accuracy, excellent speed.
      • gOCR - Poor accuracy, slow speed.
    • Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
  • Add micro-service to OCR files in DIP (post-normalization)
    • Micro-service: Transcription
    • User choice Yes/No (default NO)
  • Add FPR purpose - Transcription - OCR first and only tool in that section
  • Add OCR files to DIP in "[DIP]/OCRfiles" directory
  • Run OCR on originals or preservation copies??
  • Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"

</fileGrp>

<fileGrp USE="text/ocr">
     <file GROUPID="Group-67f7e276-0dd9-4b09-bb30-3589f3f3900e" ID="file-67f7e276-0dd9-4b09-bb30-3589f3f3900e">
       <FLocat xlink:href="objects/jp2/25080603-b3ffc8c4-0db9-454d-8fb6-eb01e0c1b07f.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
     </file>
     <file GROUPID="Group-5e8d85a8-802c-413c-a198-7e45466dfb04" ID="file-5e8d85a8-802c-413c-a198-7e45466dfb04">
       <FLocat xlink:href="objects/pdf/Glass_Hall-acffd9c8-f048-45d6-9ef5-10ca8a9d28ac.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
     </file>
     <file GROUPID="Group-dfb298bc-aa31-4054-918c-ba84e349e2fa" ID="file-dfb298bc-aa31-4054-918c-ba84e349e2fa">
       <FLocat xlink:href="objects/tif/43845161-5cc7ad6e-28bd-493f-a0fb-9ad13dccfe43.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
     </file>
   </fileGrp>
  • Add configuration setting to administrative tab of the dashboard to pre-select OCR options