Difference between revisions of "OCR text in DIP"
Jump to navigation
Jump to search
Line 17: | Line 17: | ||
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed. | ***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed. | ||
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad. | **Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad. | ||
− | *Add micro-service to OCR files in DIP (post-normalization) | + | *Add micro-service to OCR files in DIP (post-normalization) |
− | **Micro-service: | + | **Micro-service: Transcription |
− | **User choice Yes/No | + | **User choice Yes/No (default NO) |
+ | *Add FPR purpose - Transcription - OCR first and only tool in that section | ||
*Add OCR files to DIP in "[DIP]/OCRfiles" directory | *Add OCR files to DIP in "[DIP]/OCRfiles" directory | ||
− | *Add OCR files to AIP in | + | *Run OCR on originals or preservation copies?? |
+ | *Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles" | ||
+ | **METS file PREMIS event Transcription (add to PREMIS events) | ||
+ | **METS file new file group to file sec (Evelyn to mock up) | ||
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options | *Add configuration setting to administrative tab of the dashboard to pre-select OCR options |
Revision as of 17:44, 7 April 2014
Main Page > Development > Development documentation > OCR test in DIP
Requirements
See related issues:
- #6257
Add OCR text files to DIP and AIP
- Add open-source OCR tool to Archivematica
- Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
- The actively-developed options are:
- Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
- Add micro-service to OCR files in DIP (post-normalization)
- Micro-service: Transcription
- User choice Yes/No (default NO)
- Add FPR purpose - Transcription - OCR first and only tool in that section
- Add OCR files to DIP in "[DIP]/OCRfiles" directory
- Run OCR on originals or preservation copies??
- Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"
- METS file PREMIS event Transcription (add to PREMIS events)
- METS file new file group to file sec (Evelyn to mock up)
- Add configuration setting to administrative tab of the dashboard to pre-select OCR options