Difference between revisions of "OCR text in DIP"

From Archivematica
Jump to navigation Jump to search
Line 17: Line 17:
 
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.  
 
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.  
 
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
 
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
*Add micro-service to OCR files in DIP (post-normalization)
+
*Add micro-service to OCR files in DIP (post-normalization)  
**Micro-service: OCR files
+
**Micro-service: Transcription
**User choice Yes/No
+
**User choice Yes/No (default NO)
 +
*Add FPR purpose - Transcription - OCR first and only tool in that section
 
*Add OCR files to DIP in "[DIP]/OCRfiles" directory
 
*Add OCR files to DIP in "[DIP]/OCRfiles" directory
*Add OCR files to AIP in "[AIP]/data/objects/submissionDocumentation" -OR- "AIP/data/objects/metadata/OCRfiles" - needs discussion
+
*Run OCR on originals or preservation copies??
 +
*Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"  
 +
**METS file PREMIS event Transcription (add to PREMIS events)
 +
**METS file new file group to file sec (Evelyn to mock up)
 
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options
 
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options

Revision as of 18:44, 7 April 2014

Main Page > Development > Development documentation > OCR test in DIP

Requirements

See related issues:

  • #6257

Add OCR text files to DIP and AIP

  • Add open-source OCR tool to Archivematica
    • Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
    • The actively-developed options are:
      • Tesseract - Very actively developed. Best accuracy of all the open-source options, good speed.
      • OCRad - Moderate-to-poor accuracy, excellent speed.
      • gOCR - Poor accuracy, slow speed.
    • Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
  • Add micro-service to OCR files in DIP (post-normalization)
    • Micro-service: Transcription
    • User choice Yes/No (default NO)
  • Add FPR purpose - Transcription - OCR first and only tool in that section
  • Add OCR files to DIP in "[DIP]/OCRfiles" directory
  • Run OCR on originals or preservation copies??
  • Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"
    • METS file PREMIS event Transcription (add to PREMIS events)
    • METS file new file group to file sec (Evelyn to mock up)
  • Add configuration setting to administrative tab of the dashboard to pre-select OCR options