Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
666 views
in Technique[技术] by (71.8m points)

ocr - How to preserve document structure in tesseract

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.

input

and the output I am getting is as follows:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output as of the same structure in image?

i.e. as follows:

                                                 Some text here
                                                 Some text here

Some to the left
Some to the left

                    Some in the middle
                    Some in the middle

        Some with some tab
        Some with some tab

Some with some space between them                       this much
Some with some space between them                       this much
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

Details on this option are here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...