Difference between revisions of "OCR text in DIP"

Revision as of 18:44, 7 April 2014

See related issues:

Add open-source OCR tool to Archivematica
- Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
- The actively-developed options are:
  - Tesseract - Very actively developed. Best accuracy of all the open-source options, good speed.
  - OCRad - Moderate-to-poor accuracy, excellent speed.
  - gOCR - Poor accuracy, slow speed.
- Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
Add micro-service to OCR files in DIP (post-normalization)
- Micro-service: Transcription
- User choice Yes/No (default NO)
Add FPR purpose - Transcription - OCR first and only tool in that section
Add OCR files to DIP in "[DIP]/OCRfiles" directory
Run OCR on originals or preservation copies??
Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"
- METS file PREMIS event Transcription (add to PREMIS events)
- METS file new file group to file sec (Evelyn to mock up)
Add configuration setting to administrative tab of the dashboard to pre-select OCR options

@@ Line 17: / Line 17: @@
 ***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed.
 **Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
 *Add micro-service to OCR files in DIP (post-normalization)
-**Micro-service: OCR files
+**Micro-service: Transcription
-**User choice Yes/No
+**User choice Yes/No (default NO)
+*Add FPR purpose - Transcription - OCR first and only tool in that section
 *Add OCR files to DIP in "[DIP]/OCRfiles" directory
-*Add OCR files to AIP in "[AIP]/data/objects/submissionDocumentation" -OR- "AIP/data/objects/metadata/OCRfiles" - needs discussion
+*Run OCR on originals or preservation copies??
+*Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"
+**METS file PREMIS event Transcription (add to PREMIS events)
+**METS file new file group to file sec (Evelyn to mock up)
 *Add configuration setting to administrative tab of the dashboard to pre-select OCR options