Difference between revisions of "OCR text in DIP"
Jump to navigation
Jump to search
(Created page with "Add OCR text files to DIP -Test ocr tools with client supplied sample data -Add open-source OCR tool to Archivematica -Add micro-service to OCR files in DIP -Add OCR files...") |
|||
Line 1: | Line 1: | ||
− | + | [[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP | |
− | + | [[Category:Development documentation]] | |
− | + | == Requirements == | |
− | + | See related issues: | |
+ | * #6257 | ||
− | + | ===Add OCR text files to DIP=== | |
− | -Add configuration setting to | + | *Add open-source OCR tool to Archivematica |
− | + | **[https://code.google.com/p/tesseract-ocr/ Tesseract] is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR). | |
− | + | **The actively-developed options are: | |
+ | ***[https://code.google.com/p/tesseract-ocr/ Tesseract] - Very actively developed. Best accuracy of all the open-source options, good speed. | ||
+ | ***[http://www.gnu.org/software/ocrad/ OCRad] - Moderate-to-poor accuracy, excellent speed. | ||
+ | ***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed. | ||
+ | **Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad. | ||
+ | *Add micro-service to OCR files in DIP (post-normalization) | ||
+ | **Micro-service: OCR files | ||
+ | **User choice Yes/No | ||
+ | *Add OCR files to DIP in "[DIP]/OCR files" directory | ||
+ | *Add configuration setting to administrative tab of the dashboard to pre-select OCR options |
Revision as of 20:07, 27 March 2014
Main Page > Development > Development documentation > OCR test in DIP
Requirements
See related issues:
- #6257
Add OCR text files to DIP
- Add open-source OCR tool to Archivematica
- Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
- The actively-developed options are:
- Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
- Add micro-service to OCR files in DIP (post-normalization)
- Micro-service: OCR files
- User choice Yes/No
- Add OCR files to DIP in "[DIP]/OCR files" directory
- Add configuration setting to administrative tab of the dashboard to pre-select OCR options