Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
186 views
in Technique[技术] by (71.8m points)

python - Get consecutive occurence based on grouping

I am trying to find a way to get groups of consecutive occurences grouped by hosts and sorted by time. Ideally I need the groups that meet a certain treshold and isCorrect == false

Example

Time    |   Host    |   isCorrect   |
-------------------------------------
10:01   |   HostA   |   true        |
10:02   |   HostB   |   true        |
10:03   |   HostA   |   false       |
10:15   |   HostA   |   false       |
10:16   |   HostA   |   false       |
10:18   |   HostB   |   false       |
10:20   |   HostA   |   true        |
10:21   |   HostA   |   true        |
10:22   |   HostB   |   false       |
10:23   |   HostB   |   false       |

Threshold: >=3

Would results in 2 groups of

Time    |   Host    |   isCorrect   | Group
--------------------------------------------
10:03   |   HostA   |   false       |1
10:15   |   HostA   |   false       |1
10:16   |   HostA   |   false       |1

10:18   |   HostB   |   false       |2
10:22   |   HostB   |   false       |2
10:23   |   HostB   |   false       |2

I was reading https://towardsdatascience.com/pandas-dataframe-group-by-consecutive-certain-values-a6ed8e5d8cc but could not find a way to group by Host first.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First filter False values by inverting mask with ~ and sorting values (if necessary), then filter groups with threshold and last create Group column by factorize:

df = df[~df['isCorrect']].sort_values(['Host','Time'])
mask = df['Host'].map(df['Host'].value_counts()) >= 3

df = df[mask].copy()
df['Group'] = pd.factorize(df['Host'])[0] + 1
print (df)

    Time   Host  isCorrect  Group
2  10:03  HostA      False      1
3  10:15  HostA      False      1
4  10:16  HostA      False      1
5  10:18  HostB      False      2
8  10:22  HostB      False      2
9  10:23  HostB      False      2

If grouping by consecutive Falses:

m = ~df['isCorrect']
df['Group'] = df['isCorrect'].cumsum()[m]

df = df[m].sort_values(['Host','Time'])

mask = df.groupby(['Group', 'Host'])['Group'].transform('size') >= 3

df = df[mask].copy()
df['Group'] = pd.factorize(df['Host'])[0] + 1
print (df)
    Time   Host  isCorrect  Group
2  10:03  HostA      False      1
3  10:15  HostA      False      1
4  10:16  HostA      False      1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...