Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

python - Find group of consecutive dates in Pandas DataFrame

I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame. My df looks like below.

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

In this df, I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.

I calculated the difference with 1 lag by applying following code.

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

But after then I can't figure out that how to get the groups of consecutive rows without iterating.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.

There's also one tricky part that can be fleshed out by example. Notice that df below contains an added row that doesn't have any consecutive dates before or after it.

>>> df
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253
4   2017-01-20  0.485949  # < watch out for this
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

>>> df.dtypes
DateAnalyzed    datetime64[ns]
Val                    float64
dtype: object

The answer below assumes that you want to ignore 2017-01-20 completely, without processing it. (See end of answer for a solution if you do want to process this date.)

First:

>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: DateAnalyzed, dtype: bool

Now, in_block will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.

The next step is to derive the groupings themselves:

>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1    1
2    1
3    1
5    2
6    2
7    2
Name: DateAnalyzed, dtype: int64

Then you can call df.groupby(groups) with your operation of choice.

>>> for _, frame in filt.groupby(groups):
...     print(frame, end='

')
... 
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253

  DateAnalyzed       Val
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

To incorporate this back into df, assign to it and the isolated dates will be NaN:

>>> df['groups'] = groups
>>> df
  DateAnalyzed       Val  groups
1   2018-03-18  0.470253     1.0
2   2018-03-19  0.470253     1.0
3   2018-03-20  0.470253     1.0
4   2017-01-20  0.485949     NaN
5   2018-09-25  0.467729     2.0
6   2018-09-26  0.467729     2.0
7   2018-09-27  0.467729     2.0

If you do want to include the "lone" date, things become a bit more straightforward:

dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...