Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
768 views
in Technique[技术] by (71.8m points)

python - A faster alternative to Pandas `isin` function

I have a very large data frame df that looks like:

ID       Value1    Value2
1345      3.2      332
1355      2.2      32
2346      1.0      11
3456      8.9      322

And I have a list that contains a subset of IDs ID_list. I need to have a subset of df for the ID contained in ID_list.

Currently, I am using df_sub=df[df.ID.isin(ID_list)] to do it. But it takes a lot time. IDs contained in ID_list doesn't have any pattern, so it's not within certain range. (And I need to apply the same operation to many similar dataframes. I was wondering if there is any faster way to do this. Will it help a lot if make ID as the index?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

EDIT 2: Here's a link to a more recent look into the performance of various pandas operations, though it doesn't seem to include merge and join to date.

https://github.com/mm-mansour/Fast-Pandas

EDIT 1: These benchmarks were for a quite old version of pandas and likely are not still relevant. See Mike's comment below on merge.

It depends on the size of your data but for large datasets DataFrame.join seems to be the way to go. This requires your DataFrame index to be your 'ID' and the Series or DataFrame you're joining against to have an index that is your 'ID_list'. The Series must also have a name to be used with join, which gets pulled in as a new field called name. You also need to specify an inner join to get something like isin because join defaults to a left join. query in syntax seems to have the same speed characteristics as isin for large datasets.

If you're working with small datasets, you get different behaviors and it actually becomes faster to use a list comprehension or apply against a dictionary than using isin.

Otherwise, you can try to get more speed with Cython.

# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)

df = pd.DataFrame(l_series, columns=['ID'])


In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'


In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 μs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 μs per loop

In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 μs per loop

In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 μs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 μs per loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...