ocr - Tesseract - ambiguity in space and tab

Question

Welcome To Ask or Share your Answers For Others

ocr - Tesseract - ambiguity in space and tab

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

ocr - Tesseract - ambiguity in space and tab

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE:
col-a    col-b    col-c

desired output:
col-a    col-b    col-c

but I am getting the following:
col-a col-b col-c

I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:07:21+0000

After a very long research I found the solution. Here are the steps to follow

Upgrade your tesseract to 3.04
Create config.txt (Create a file in the directory where you input the image file)
In config file define "preserve_interword_spaces"
After the work preserve_interword_spaces give either 0 or 1. Ex:

preserve_interword_spaces 0

or

preserve_interword_spaces 1

Test & Cheers!!!

Categories

ocr - Tesseract - ambiguity in space and tab

ocr - Tesseract - ambiguity in space and tab

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags