Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
534 views
in Technique[技术] by (71.8m points)

python - os.walk very slow, any way to optimise?

I am using os.walk to build a map of a data-store (this map is used later in the tool I am building)

This is the code I currently use:

def find_children(tickstore):
    children = []
    dir_list = os.walk(tickstore)
    for i in dir_list:
        children.append(i[0])
    return children

I have done some analysis on it:

dir_list = os.walk(tickstore) runs instantly, if I do nothing with dir_list then this function completes instantly.

It is iterating over dir_list that takes a long time, even if I don't append anything, just iterating over it is what takes the time.

Tickstore is a big datastore, with ~10,000 directories.

Currently it takes approx 35minutes to complete this function.

Is there any way to speed it up?

I've looked at alternatives to os.walk but none of them seemed to provide much of an advantage in terms of speed.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.

This work done as part of PEP 471.

Extracted from the PEP:

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.

In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)

In practice, removing all those extra system calls makes os.walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...