tesseract - OCR of PDF files with images

Question

Welcome To Ask or Share your Answers For Others

tesseract - OCR of PDF files with images

1 Reply

深蓝 · Answer 1 · 2021-02-06T00:21:35+0000

There are 2 important flags that tika uses to extract text:

X-Tika-PDFextractInlineImages (true/false). When false than all images is ignored. So it works fine for the native pdfs - the text is extracted from the native pdf When true than images will be used to text extraction
X-Tika-PDFocrStrategy: https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.OCR_STRATEGY.html NO_OCR - extract the text without ocr - works for native pdfs OCR_ONLY - only the ocr is used - so the text from "native pdf" is also send to ocr OCR_AND_TEXT_EXTRACTION - invokes NO_OCR OCR_ONLY

so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR seems to be the best

for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY

but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika

Categories

tesseract - OCR of PDF files with images

tesseract - OCR of PDF files with images

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags