Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
224 views
in Technique[技术] by (71.8m points)

python - Computing diffs within groups of a dataframe

Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date) tuple is unique. (But obviously the same date will show up in many rows since it will be there for multiple tickers, and the same ticker will show up in multiple rows since it will be there for many dates.)

Initially, my rows in a specific order, but not sorted by any of the columns.

I would like to compute first differences (daily changes) of each ticker (ordered by date) and put these in a new column in my dataframe. Given this context, I cannot simply do

df['diffs'] = df['value'].diff()

because adjacent rows do not come from the same ticker. Sorting like this:

df = df.sort(['ticker', 'date'])
df['diffs'] = df['value'].diff()

doesn't solve the problem because there will be "borders". I.e. after that sort, the last value for one ticker will be above the first value for the next ticker. And computing differences then would take a difference between two tickers. I don't want this. I want the earliest date for each ticker to wind up with an NaN in its diff column.

This seems like an obvious time to use groupby but for whatever reason, I can't seem to get it to work properly. To be clear, I would like to perform the following process:

  1. Group rows based on their ticker
  2. Within each group, sort rows by their date
  3. Within each sorted group, compute differences of the value column
  4. Put these differences into the original dataframe in a new diffs column (ideally leaving the original dataframe order in tact.)

I have to imagine this is a one-liner. But what am I missing?


Edit at 9:00pm 2013-12-17

Ok...some progress. I can do the following to get a new dataframe:

result = df.set_index(['ticker', 'date'])
    .groupby(level='ticker')
    .transform(lambda x: x.sort_index().diff())
    .reset_index()

But if I understand the mechanics of groupby, my rows will now be sorted first by ticker and then by date. Is that correct? If so, would I need to do a merge to append the differences column (currently in result['current'] to the original dataframe df?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

wouldn't be just easier to do what yourself describe, namely

df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()

and then correct for borders:

mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan

to maintain the original index you may do idx = df.index in the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

df.filter(['ticker', 'date', 'value'])

and then join the two dataframes at the end.

edit: alternatively, ( though still not using groupby )

df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan 

for idx in df.index.levels[0]:
    df.diffs[idx] = df.value[idx].diff()

for

   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

this will produce:

             value  diffs
ticker date              
A      22     0.55    NaN
       32    -2.50  -3.05
       34    -0.23   2.27
       76    -0.79  -0.56
B      59    -1.01    NaN
       72    -1.24  -0.23
       92     2.43   3.67
C      22    -1.29    NaN
       63     1.65   2.94
       88    -1.93  -3.58

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...