Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
113 views
in Technique[技术] by (71.8m points)

python pandas dataframe thread safe?

I am using multiple threads to access and delete data in my pandas dataframe. Because of this, I am wondering is pandas dataframe threadsafe?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

No, pandas is not thread safe. And its not thread safe in surprising ways.

  • Can I delete from pandas dataframe while another thread is using?

Fuggedaboutit! Nope. And generally no. Not even for GIL-locked python datastructures.

  • Can I read from a pandas object while someone else is writing to it?
  • Can I copy a pandas dataframe in my thread, and work on the copy?

Definitely not. There's a long standing open issue: https://github.com/pandas-dev/pandas/issues/2728

Actually I think this is pretty reasonable (i.e. expected) behavior. I wouldn't expect to be able to simultaneouls write and read from, or copy, any datastructure unless either: i) it had been designed for concurrency, or ii) I have an exclusive lock on that object and all the view objects derived from it (.loc, .iloc are views and pandas has may others).

  • Can I read from a pandas object while no-one else is writing to it?

For almost all data structures in Python, the answer is yes. For pandas, no. And it seems, its not a design goal at present.

Typically, you can perform 'reading' operations on objects if no-one is performing mutating operations. You have to be a little cautious though. Some datastructures, including pandas, perform memoization, to cache expensive operations that are otherwise functionally pure. Its generally easy to implement lockless memoization in Python:

@property
def thing(self):
    if _thing is MISSING:
        self._thing = self._calc_thing()
    return self._thing

... it simple and safe (assuming assignment is safely atomic -- which has not always been the case for every language, but is in CPython, unless you override setattribute).

Pandas, series and dataframe indexes are computed lazily, on first use. I hope (but I do not see guarantees in the docs), that they're done in a similar safe way.

For all libraries (including pandas) I would hope that all types of read-only operations (or more specifically, 'functionally pure' operations) would be thread safe if no-one is performing mutating operations. I think this is a 'reasonable' easily-achievable, common, lower-bar for thread safeness.

For pandas, however, you cannot assume this. Even if you can guarantee no-one is performing 'functionally impure' operations on your object (e.g. writing to cells, adding/deleting columns'), pandas is not thread safe.

Here's a recent example: https://github.com/pandas-dev/pandas/issues/25870 (its marked as a duplicate of the .copy-not-threadsafe issue, but it seems it could be a separate issue).

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # Exception !

... fails for f(s): s.reindex(..., copy=True), which returns it's result a as new object -- you would think it would be functionally pure and thread safe. Unfortunately, it is not.

The result of this is that we could not use pandas in production for our healthcare analytics system - and I now discourage it for internal development since it makes in-memory parallelization of read-only operations unsafe. (!!)

The reindex behavior is weird and surprising. If anyone has ideas about why it fails, please answer here: What's the source of thread-unsafety in this usage of pandas.Series.reindex(, copy=True)?

The maintainers marked this as a duplicate of https://github.com/pandas-dev/pandas/issues/2728 . I'm suspicious, but if .copy is the source, then almost all of pandas is not thread safe in any situation (which is their advice).

!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...