python - Pandas: Conditional filtering based on the last row of each dataframe within another datarame

Question

Welcome To Ask or Share your Answers For Others

python - Pandas: Conditional filtering based on the last row of each dataframe within another datarame

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

python - Pandas: Conditional filtering based on the last row of each dataframe within another datarame

I have a dataframe df as follows

Date          Group   Value   Duration
2018-01-01      A      20       30
2018-02-01      A      10       60
2018-03-01      A      25       88    <-----Last row for Group A
2018-01-01      B      15      180
2018-02-01      B      30      210
2018-03-01      B      25      238    <-----Last row of Group B

I want to drop the Group A because it hax max duration less than 90. Or in other words, considering the last row of each Group, if the Duration value is less than 90, we omit that group. So my resultant data frame should look like

    Date       Group   Value   Duration
 2018-01-01      B      15      180
 2018-02-01      B      30      210
 2018-03-01      B      25      240

To achieve this, my approach is as follows:

df_f = []
for k,v in df.groupby(['Group']):
    v_f = v[max(v['Duration'])>=90]
    df_f.append(v_f)

The above code snippet is throwing an error as KeyError: False

Am I missing out anything here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:13:45+0000

You can test if maximal value per groups is hogher of equal like 90 in GroupBy.transform and then filter by boolean indexing:

df = df[df.groupby('Group')['Duration'].transform('max') >= 90]
#alternative
#df = df[df.groupby('Group')['Duration'].transform('max').ge(90)]
print (df)
         Date Group  Value  Duration
3  2018-01-01     B     15       180
4  2018-02-01     B     30       210
5  2018-03-01     B     25       238

EDIT: I suggest dont use filter for filtration per groups, because slow:

#1k rows
np.random.seed(123)
N = 1000
df = pd.DataFrame({'Group': np.random.randint(100, size=N),
                   'Duration':np.random.randint(200,size=N)})

#print (df)

In [24]: %timeit df.groupby('Group').filter(lambda x: x.Duration.max()>=90)
39.8 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [25]: %timeit df[df.groupby('Group')['Duration'].transform('max') >= 90]
3.12 ms ± 37.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#100k rows
np.random.seed(123)
N = 100000
df = pd.DataFrame({'Group': np.random.randint(1000, size=N),
                   'Duration':np.random.randint(200,size=N)})

#print (df)


In [29]: %timeit df[df.groupby('Group')['Duration'].transform('max') >= 90]
11.8 ms ± 153 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [30]: %timeit df.groupby('Group').filter(lambda x: x.Duration.max()>=90)
394 ms ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Categories

python - Pandas: Conditional filtering based on the last row of each dataframe within another datarame

python - Pandas: Conditional filtering based on the last row of each dataframe within another datarame

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags