Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

python - Pandas: Conditional filtering based on the last row of each dataframe within another datarame

I have a dataframe df as follows

Date          Group   Value   Duration
2018-01-01      A      20       30
2018-02-01      A      10       60
2018-03-01      A      25       88    <-----Last row for Group A
2018-01-01      B      15      180
2018-02-01      B      30      210
2018-03-01      B      25      238    <-----Last row of Group B

I want to drop the Group A because it hax max duration less than 90. Or in other words, considering the last row of each Group, if the Duration value is less than 90, we omit that group. So my resultant data frame should look like

    Date       Group   Value   Duration
 2018-01-01      B      15      180
 2018-02-01      B      30      210
 2018-03-01      B      25      240

To achieve this, my approach is as follows:

df_f = []
for k,v in df.groupby(['Group']):
    v_f = v[max(v['Duration'])>=90]
    df_f.append(v_f)

The above code snippet is throwing an error as KeyError: False

Am I missing out anything here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can test if maximal value per groups is hogher of equal like 90 in GroupBy.transform and then filter by boolean indexing:

df = df[df.groupby('Group')['Duration'].transform('max') >= 90]
#alternative
#df = df[df.groupby('Group')['Duration'].transform('max').ge(90)]
print (df)
         Date Group  Value  Duration
3  2018-01-01     B     15       180
4  2018-02-01     B     30       210
5  2018-03-01     B     25       238

EDIT: I suggest dont use filter for filtration per groups, because slow:

#1k rows
np.random.seed(123)
N = 1000
df = pd.DataFrame({'Group': np.random.randint(100, size=N),
                   'Duration':np.random.randint(200,size=N)})

#print (df)

In [24]: %timeit df.groupby('Group').filter(lambda x: x.Duration.max()>=90)
39.8 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [25]: %timeit df[df.groupby('Group')['Duration'].transform('max') >= 90]
3.12 ms ± 37.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#100k rows
np.random.seed(123)
N = 100000
df = pd.DataFrame({'Group': np.random.randint(1000, size=N),
                   'Duration':np.random.randint(200,size=N)})

#print (df)


In [29]: %timeit df[df.groupby('Group')['Duration'].transform('max') >= 90]
11.8 ms ± 153 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [30]: %timeit df.groupby('Group').filter(lambda x: x.Duration.max()>=90)
394 ms ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...