Difference between revisions of "OCR text in DIP"

Revision as of 21:07, 27 March 2014

Main Page > Development > Development documentation > OCR test in DIP

Requirements

See related issues:

#6257

Add OCR text files to DIP

Add open-source OCR tool to Archivematica
- Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
- The actively-developed options are:
  - Tesseract - Very actively developed. Best accuracy of all the open-source options, good speed.
  - OCRad - Moderate-to-poor accuracy, excellent speed.
  - gOCR - Poor accuracy, slow speed.
- Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
Add micro-service to OCR files in DIP (post-normalization)
- Micro-service: OCR files
- User choice Yes/No
Add OCR files to DIP in "[DIP]/OCR files" directory
Add configuration setting to administrative tab of the dashboard to pre-select OCR options

@@ Line 1: / Line 1: @@
-Add OCR text files to DIP
+[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP
--Test ocr tools with client supplied sample data
+[[Category:Development documentation]]
--Add open-source OCR tool to Archivematica
+== Requirements ==
--Add micro-service to OCR files in DIP
+See related issues:
+* #6257
--Add OCR files to DIP in "[DIP]/OCR files" directory
+===Add OCR text files to DIP===
--Add configuration setting to admin tab to pre-select OCR
+*Add open-source OCR tool to Archivematica
+**[https://code.google.com/p/tesseract-ocr/ Tesseract] is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
-option
+**The actively-developed options are:
+***[https://code.google.com/p/tesseract-ocr/ Tesseract] - Very actively developed. Best accuracy of all the open-source options, good speed.
+***[http://www.gnu.org/software/ocrad/ OCRad] - Moderate-to-poor accuracy, excellent speed.
+***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.
+**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
+*Add micro-service to OCR files in DIP (post-normalization)
+**Micro-service: OCR files
+**User choice Yes/No
+*Add OCR files to DIP in "[DIP]/OCR files" directory
+*Add configuration setting to administrative tab of the dashboard to pre-select OCR options

Difference between revisions of "OCR text in DIP"

Revision as of 21:07, 27 March 2014

Requirements

Add OCR text files to DIP

Navigation menu

Search