Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.6k views
in Technique[技术] by (71.8m points)

python - Opening image file from URL for text recognition with pytesseract

I'm really new to Python (really, really new). This is the problem i need help to solve:

I have a list of images URLs inside a txt file. There are around 80.000 URLs in this file. I need to scan all of these imagems with pytesseract and save the results inside a csv file. I found a solution, but I wanted to optimize it.

I'm doing this now: ? I download all the images to my computer using PowerShell (yes, I'm on Windows) ? After they are all saved in a folder (and this is taking a long time), I use the following code (which I found on the internet) to scan all the images and save the extracted text and the image file name to a .csv file:

from PIL import Image 
from pytesseract import image_to_string
import pytesseract
import os 
import csv

def main(): 
    # path for the folder for getting the raw images 
    path =r"C:Users
aphaelgomesDesktopProjeto OCR - Connect MarketplaceImagens - Powershell"
  
    # link to the file in which output needs to be kept 
    fullTempPath =r"C:Users
aphaelgomesDesktopProjeto OCR - Connect MarketplaceOCR Checker Python
esultsoutputFile.csv"
  
    # iterating the images inside the folder 
    for imageName in os.listdir(path): 
        inputPath = os.path.join(path, imageName) 
        img = Image.open(inputPath) 
  
        # applying ocr using pytesseract for python
        pytesseract.pytesseract.tesseract_cmd = r"C:Users
aphaelgomesAppDataLocalProgramsTesseract-OCResseract.exe"
        text = pytesseract.image_to_string(img, lang ="eng") 
  
        # saving the  text for appending it to the output.txt file 
        # a + parameter used for creating the file if not present 
        # and if present then append the text content 
        file1 = open(fullTempPath, "a+") 
  
        # providing the name of the image 
        file1.write(imageName+"
") 
  
        # providing the content in the image 
        file1.write(text+"
") 
        file1.close()  
  
    # for printing the output file 
    file2 = open(fullTempPath, 'r') 
    print(file2.read()) 
    file2.close()         
  
  
if __name__ == '__main__': 
    main() 

The point is: is there a way I can jump the downloading process I'm doing using PowerShell? I'd really appreciate any help doing this. The idea is to do this whole process in Python: as I said, I already have all the file links inside a .txt, so I needed a Python code to read them one by one and save the file name and the extracted text from image inside a .csv

Thank you very much :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...