Difference between revisions of "OCR text in DIP"

From Archivematica
Jump to navigation Jump to search
(Created page with "Add OCR text files to DIP -Test ocr tools with client supplied sample data -Add open-source OCR tool to Archivematica -Add micro-service to OCR files in DIP -Add OCR files...")
 
Line 1: Line 1:
Add OCR text files to DIP
+
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP
  
-Test ocr tools with client supplied sample data
+
[[Category:Development documentation]]
  
-Add open-source OCR tool to Archivematica
+
== Requirements ==
  
-Add micro-service to OCR files in DIP
+
See related issues:
 +
* #6257
  
-Add OCR files to DIP in "[DIP]/OCR files" directory
+
===Add OCR text files to DIP===
  
-Add configuration setting to admin tab to pre-select OCR
+
*Add open-source OCR tool to Archivematica
 
+
**[https://code.google.com/p/tesseract-ocr/ Tesseract] is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
option
+
**The actively-developed options are:
 +
***[https://code.google.com/p/tesseract-ocr/ Tesseract] - Very actively developed. Best accuracy of all the open-source options, good speed.
 +
***[http://www.gnu.org/software/ocrad/ OCRad] - Moderate-to-poor accuracy, excellent speed.
 +
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.
 +
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
 +
*Add micro-service to OCR files in DIP (post-normalization)
 +
**Micro-service: OCR files
 +
**User choice Yes/No
 +
*Add OCR files to DIP in "[DIP]/OCR files" directory
 +
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options

Revision as of 21:07, 27 March 2014

Main Page > Development > Development documentation > OCR test in DIP

Requirements

See related issues:

  • #6257

Add OCR text files to DIP

  • Add open-source OCR tool to Archivematica
    • Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
    • The actively-developed options are:
      • Tesseract - Very actively developed. Best accuracy of all the open-source options, good speed.
      • OCRad - Moderate-to-poor accuracy, excellent speed.
      • gOCR - Poor accuracy, slow speed.
    • Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
  • Add micro-service to OCR files in DIP (post-normalization)
    • Micro-service: OCR files
    • User choice Yes/No
  • Add OCR files to DIP in "[DIP]/OCR files" directory
  • Add configuration setting to administrative tab of the dashboard to pre-select OCR options