Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
606 views
in Technique[技术] by (71.8m points)

python - Openpyxl optimizing cells search speed

I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patterns are usually row after row so I use iter_rows(row_offset=x). Unfortunately the code below finds the given pattern an increasing number of times in each for loop (starting from milliseconds and getting up to almost a minute). What am I doing wrong?

import openpyxl
import datetime
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws.title = "test_sheet"

print("Generating quite big excel file")

for i in range(1,10000):
    for j in range(1,20):
        ws.cell(row = i, column = j).value = "Cell[{},{}]".format(i,j)

print("Saving test excel file")
wb.save('test.xlsx')

def FindXlCell(search_str, last_r):
    t = datetime.datetime.utcnow()
    for row in ws.iter_rows(row_offset=last_r):
        for cell in row:
            if (search_str == cell.value):
                print(search_str, last_r, cell.row, datetime.datetime.utcnow() - t)
                last_r = cell.row
                return last_r
    print("record not found ",search_str, datetime.datetime.utcnow() - t)
    return 1

wb = openpyxl.load_workbook("test.xlsx", data_only=True)
t = datetime.datetime.utcnow()
ws = wb["test_sheet"]
last_row = 1
print("Parsing excel file in a loop for 3 cells")
for i in range(1,100,1):
    last_row = FindXlCell("Cell[0,0]", last_row)
    last_row = FindXlCell("Cell[1000,6]", last_row)
    last_row = FindXlCell("Cell[6000,6]", last_row)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row) means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1.

If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.

Something like:

matrix = {}
for row in ws:
    for cell in row:
         matrix[cell.value] = (cell.row, cell.col_idx)

In a real-world example you'd probably want to use a defaultdict to be able to handle multiple cells with the same text.

This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...