Difference between revisions of "OCR text in DIP"
Jump to navigation
Jump to search
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
[[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP | [[Main Page]] > [[Development]] > [[:Category:Development documentation|Development documentation]] > OCR test in DIP | ||
− | [[Category: | + | <div style="padding: 10px 10px; border: 1px solid black; background-color: #F79086;">This page is no longer being maintained and may contain inaccurate information. Please see the [https://www.archivematica.org/docs/latest/ Archivematica documentation] for up-to-date information.</div><p> |
+ | |||
+ | [[Category:Feature requirements]] | ||
== Requirements == | == Requirements == | ||
Line 8: | Line 10: | ||
* #6257 | * #6257 | ||
− | ===Add OCR text files to DIP=== | + | ===Add OCR text files to DIP and AIP=== |
*Add open-source OCR tool to Archivematica | *Add open-source OCR tool to Archivematica | ||
Line 17: | Line 19: | ||
***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed. | ***[http://jocr.sourceforge.net/ gOCR] - Poor accuracy, slow speed. | ||
**Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad. | **Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad. | ||
− | *Add micro-service to OCR files in DIP (post-normalization) | + | *Add micro-service to OCR files in DIP (post-normalization) |
− | **Micro-service: | + | **Micro-service: Transcription |
− | **User choice Yes/No | + | **User choice Yes/No (default NO) |
+ | *Add FPR purpose - Transcription - OCR first and only tool in that section | ||
*Add OCR files to DIP in "[DIP]/OCRfiles" directory | *Add OCR files to DIP in "[DIP]/OCRfiles" directory | ||
− | *Add OCR files to AIP in " | + | *Run OCR on originals or preservation copies?? |
+ | *Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles" | ||
+ | **METS file PREMIS event Transcription (add to PREMIS events) [[PREMIS_metadata:_events#Transcription]] | ||
+ | **METS file : use text/ocr fileGrp in [[METS#.3CfileSec.3E]] | ||
+ | </fileGrp> | ||
+ | <fileGrp USE="text/ocr"> | ||
+ | <file GROUPID="Group-67f7e276-0dd9-4b09-bb30-3589f3f3900e" ID="file-67f7e276-0dd9-4b09-bb30-3589f3f3900e"> | ||
+ | <FLocat xlink:href="objects/jp2/25080603-b3ffc8c4-0db9-454d-8fb6-eb01e0c1b07f.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> | ||
+ | </file> | ||
+ | <file GROUPID="Group-5e8d85a8-802c-413c-a198-7e45466dfb04" ID="file-5e8d85a8-802c-413c-a198-7e45466dfb04"> | ||
+ | <FLocat xlink:href="objects/pdf/Glass_Hall-acffd9c8-f048-45d6-9ef5-10ca8a9d28ac.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> | ||
+ | </file> | ||
+ | <file GROUPID="Group-dfb298bc-aa31-4054-918c-ba84e349e2fa" ID="file-dfb298bc-aa31-4054-918c-ba84e349e2fa"> | ||
+ | <FLocat xlink:href="objects/tif/43845161-5cc7ad6e-28bd-493f-a0fb-9ad13dccfe43.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> | ||
+ | </file> | ||
+ | </fileGrp> | ||
+ | |||
*Add configuration setting to administrative tab of the dashboard to pre-select OCR options | *Add configuration setting to administrative tab of the dashboard to pre-select OCR options |
Latest revision as of 16:22, 11 February 2020
Main Page > Development > Development documentation > OCR test in DIP
This page is no longer being maintained and may contain inaccurate information. Please see the Archivematica documentation for up-to-date information.
Requirements[edit]
See related issues:
- #6257
Add OCR text files to DIP and AIP[edit]
- Add open-source OCR tool to Archivematica
- Tesseract is the most actively developed; other open-source OCR is either moribund (Cuneiform, OCRopus) or released only sporadically without major improvements (OCRad, gOCR).
- The actively-developed options are:
- Tesseract appears to be the best solution in most cases. If speed was paramount and mediocre or poor accuracy would be acceptable, there might be an argument to use OCRad.
- Add micro-service to OCR files in DIP (post-normalization)
- Micro-service: Transcription
- User choice Yes/No (default NO)
- Add FPR purpose - Transcription - OCR first and only tool in that section
- Add OCR files to DIP in "[DIP]/OCRfiles" directory
- Run OCR on originals or preservation copies??
- Add OCR files to AIP in "AIP/data/objects/metadata/OCRfiles"
- METS file PREMIS event Transcription (add to PREMIS events) PREMIS_metadata:_events#Transcription
- METS file : use text/ocr fileGrp in METS#.3CfileSec.3E
</fileGrp>
<fileGrp USE="text/ocr"> <file GROUPID="Group-67f7e276-0dd9-4b09-bb30-3589f3f3900e" ID="file-67f7e276-0dd9-4b09-bb30-3589f3f3900e"> <FLocat xlink:href="objects/jp2/25080603-b3ffc8c4-0db9-454d-8fb6-eb01e0c1b07f.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> </file> <file GROUPID="Group-5e8d85a8-802c-413c-a198-7e45466dfb04" ID="file-5e8d85a8-802c-413c-a198-7e45466dfb04"> <FLocat xlink:href="objects/pdf/Glass_Hall-acffd9c8-f048-45d6-9ef5-10ca8a9d28ac.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> </file> <file GROUPID="Group-dfb298bc-aa31-4054-918c-ba84e349e2fa" ID="file-dfb298bc-aa31-4054-918c-ba84e349e2fa"> <FLocat xlink:href="objects/tif/43845161-5cc7ad6e-28bd-493f-a0fb-9ad13dccfe43.txt" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM"/> </file> </fileGrp>
- Add configuration setting to administrative tab of the dashboard to pre-select OCR options