Difference between revisions of "OCR text in DIP"
Jump to navigation
Jump to search
Line 20: | Line 20: | ||
**Micro-service: OCR files | **Micro-service: OCR files | ||
**User choice Yes/No | **User choice Yes/No | ||
− | *Add OCR files to DIP in "[DIP]/ | + | *Add OCR files to DIP in "[DIP]/OCRfiles" directory |
− | *Add OCR files to AIP in "[AIP]/data/objects/submissionDocumentation" | + | *Add OCR files to AIP in "[AIP]/data/objects/submissionDocumentation" -OR- "AIP/data/objects/metadata/OCRfiles" - needs discussion |
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options | *Add configuration setting to administrative tab of the dashboard to pre-select OCR options |
Revision as of 11:01, 4 April 2014
Main Page > Development > Development documentation > OCR test in DIP
Requirements
See related issues:
- #6257
Add OCR text files to DIP
- Add open-source OCR tool to Archivematica
- Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
- The actively-developed options are:
- Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
- Add micro-service to OCR files in DIP (post-normalization)
- Micro-service: OCR files
- User choice Yes/No
- Add OCR files to DIP in "[DIP]/OCRfiles" directory
- Add OCR files to AIP in "[AIP]/data/objects/submissionDocumentation" -OR- "AIP/data/objects/metadata/OCRfiles" - needs discussion
- Add configuration setting to administrative tab of the dashboard to pre-select OCR options