Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
492 views
in Technique[技术] by (71.8m points)

python - Clean text images with OpenCV for OCR reading

I received some images that need to be treated in order to OCR some information out of them. Here are the originals:

original 1

original 1

original 2

original 2

original 3

original 3

original 4

original 4

After processing them with this code:

img = cv2.imread('original_1.jpg', 0) 
ret,thresh = cv2.threshold(img,55,255,cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT,(2,2)))
cv2.imwrite('result_1.jpg', opening)

I get these results:

result 1

result 1

result 2

result 2

result 3

result 3

result 4

result 4

As you can see, some images get nice results for OCR reading, other still maintain some noise in the background.

Any suggestions as how to clean up the background?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

MH304's answer is very nice and straightforward. In the case you can't use morphology or blurring to get a cleaner image, consider using an "Area Filter". That is, filter every blob that does not exhibit a minimum area.

Use opencv's connectedComponentsWithStats, here's a C++ implementation of a very basic area filter:

cv::Mat outputLabels, stats, img_color, centroids;

int numberofComponents = cv::connectedComponentsWithStats(bwImage, outputLabels, 
stats, centroids, connectivity);

std::vector<cv::Vec3b> colors(numberofComponents+1);
colors[i] = cv::Vec3b(rand()%256, rand()%256, rand()%256);

//do not count the original background-> label = 0:
colors[0] = cv::Vec3b(0,0,0);

//Area threshold:
int minArea = 10; //10 px

for( int i = 1; i <= numberofComponents; i++ ) {

    //get the area of the current blob:
    auto blobArea = stats.at<int>(i-1, cv::CC_STAT_AREA);

    //apply the area filter:
    if ( blobArea < minArea )
    {
        //filter blob below minimum area:
        //small regions are painted with (ridiculous) pink color
        colors[i-1] = cv::Vec3b(248,48,213);

    }

}

Using the area filter I get this result on your noisiest image:

enter image description here

**Additional info:

Basically, the algorithm goes like this:

  • Pass a binary image to connectedComponentsWithStats. The function will compute the number of connected components, matrix of labels and an additional matrix with statistics – including blob area.

  • Prepare a color vector of size “numberOfcomponents”, this will help visualize the blobs that we are actually filtering. The colors are generated randomly by the rand function. From a range 0 – 255, 3 values for each pixel: BGR.

  • Consider that the background is colored in black, so ignore this “connected component” and its color (black).

  • Set an area threshold. All blobs or pixels below this area will be colored with a (ridiculous) pink.

  • Loop thru all the found connected components (blobs), retrive the area for the current blob via the stats matrix and compare it to the area threshold.

  • If the area is below the threshold, color the blob pink (in this case, but usually you want black).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...