Difference between revisions of "OCR text in DIP"

From Archivematica
Jump to navigation Jump to search
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP
 
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP
  
[[Category:Development documentation]]
+
<div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information.</div><p>
 +
 
 +
[[Category:Feature requirements]]
  
 
== Requirements ==
 
== Requirements ==
Line 8: Line 10:
 
* #6257
 
* #6257
  
===Add OCR text files to DIP===
+
===Add OCR text files to DIP and AIP===
  
 
*Add open-source OCR tool to Archivematica
 
*Add open-source OCR tool to Archivematica
Line 17: Line 19:
 
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.  
 
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.  
 
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
 
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
*Add micro-service to OCR files in DIP (post-normalization)
+
*Add micro-service to OCR files in DIP (post-normalization)  
**Micro-service: OCR files
+
**Micro-service: Transcription
**User choice Yes/No
+
**User choice Yes/No (default NO)
*Add OCR files to DIP in "[DIP]/OCR files" directory
+
*Add FPR purpose - Transcription - OCR first and only tool in that section
*Add OCR files to AIP in "[AIP]/data/objects/submissionDocumentation"
+
*Add OCR files to DIP in "[DIP]/OCRfiles" directory
 +
*Run OCR on originals or preservation copies??
 +
*Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"
 +
**METS file PREMIS event Transcription (add to PREMIS events) [[PREMIS_metadata:_events#Transcription]]
 +
**METS file : use text/ocr fileGrp in [[METS#.3CfileSec.3E]]
 +
</fileGrp>
 +
<fileGrp USE="text/ocr">
 +
      <file GROUPID="Group-67f7e276-0dd9-4b09-bb30-3589f3f3900e" ID="file-67f7e276-0dd9-4b09-bb30-3589f3f3900e">
 +
        <FLocat xlink:href="objects/jp2/25080603-b3ffc8c4-0db9-454d-8fb6-eb01e0c1b07f.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
 +
      </file>
 +
      <file GROUPID="Group-5e8d85a8-802c-413c-a198-7e45466dfb04" ID="file-5e8d85a8-802c-413c-a198-7e45466dfb04">
 +
        <FLocat xlink:href="objects/pdf/Glass_Hall-acffd9c8-f048-45d6-9ef5-10ca8a9d28ac.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
 +
      </file>
 +
      <file GROUPID="Group-dfb298bc-aa31-4054-918c-ba84e349e2fa" ID="file-dfb298bc-aa31-4054-918c-ba84e349e2fa">
 +
        <FLocat xlink:href="objects/tif/43845161-5cc7ad6e-28bd-493f-a0fb-9ad13dccfe43.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
 +
      </file>
 +
    </fileGrp>
 +
 
 
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options
 
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options

Latest revision as of 16:22, 11 February 2020

Main Page > Development > Development documentation > OCR test in DIP

This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.

Requirements[edit]

See related issues:

  • #6257

Add OCR text files to DIP and AIP[edit]

  • Add open-source OCR tool to Archivematica
    • Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
    • The actively-developed options are:
      • Tesseract - Very actively developed. Best accuracy of all the open-source options, good speed.
      • OCRad - Moderate-to-poor accuracy, excellent speed.
      • gOCR - Poor accuracy, slow speed.
    • Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
  • Add micro-service to OCR files in DIP (post-normalization)
    • Micro-service: Transcription
    • User choice Yes/No (default NO)
  • Add FPR purpose - Transcription - OCR first and only tool in that section
  • Add OCR files to DIP in "[DIP]/OCRfiles" directory
  • Run OCR on originals or preservation copies??
  • Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"

</fileGrp>

<fileGrp USE="text/ocr">
     <file GROUPID="Group-67f7e276-0dd9-4b09-bb30-3589f3f3900e" ID="file-67f7e276-0dd9-4b09-bb30-3589f3f3900e">
       <FLocat xlink:href="objects/jp2/25080603-b3ffc8c4-0db9-454d-8fb6-eb01e0c1b07f.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
     </file>
     <file GROUPID="Group-5e8d85a8-802c-413c-a198-7e45466dfb04" ID="file-5e8d85a8-802c-413c-a198-7e45466dfb04">
       <FLocat xlink:href="objects/pdf/Glass_Hall-acffd9c8-f048-45d6-9ef5-10ca8a9d28ac.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
     </file>
     <file GROUPID="Group-dfb298bc-aa31-4054-918c-ba84e349e2fa" ID="file-dfb298bc-aa31-4054-918c-ba84e349e2fa">
       <FLocat xlink:href="objects/tif/43845161-5cc7ad6e-28bd-493f-a0fb-9ad13dccfe43.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/>
     </file>
   </fileGrp>
  • Add configuration setting to administrative tab of the dashboard to pre-select OCR options