Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
246 views
in Technique[技术] by (71.8m points)

python - Computing age from to_timedelta is weird, and DateOffset is not scalable over a Series

I have two columns:

          date   age
0   2016-01-05  47.0
1   2016-01-05  43.0
2   2016-01-05  28.0
3   2016-01-05  46.0
4   2016-01-04  39.0

What I want is another column with the difference between the date and age:

          date   age           dob
0   2016-01-05  47.0    1969-01-05
1   2016-01-05  43.0    1973-01-05
2   2016-01-05  28.0    1988-01-05
3   2016-01-05  46.0    1970-01-05
4   2016-01-04  39.0    1977-01-04

Seems simple enough, but the simple df['date'] - df['age'].astype('timedelta64[Y]') gives:

0   1969-01-04 14:27:36
1   1973-01-04 13:44:24
2   1988-01-05 05:02:24
3   1970-01-04 20:16:48
4   1977-01-03 13:01:12

Why the additional time stamp? Even pd.to_timedelta(df['age'], unit='Y') gives the same result, with an additional warning that unit='Y' is deprecated.

Further, df['date'] - pd.DateOffset(years=df['age']) throws (understandably):

TypeError: cannot convert the series to <class 'int'>

I can use apply in the second option, df['date'] - df['age'].apply(lambda a: pd.DateOffset(years=a)), to circuitously get the correct result, and (understandably) PerformanceWarning: Adding/subtracting array of DateOffsets to DatetimeArray not vectorized.

What is a good (pythonic and vectorized) solution here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you need to specify a different non-standard offset (i.e. months or years) for every row it can save time to loop over the unique offsets instead of rows. Accomplish this with a groupby.

This will be especially true when the number of unique offsets is << the number of rows in your DataFrame. This is very likely the case with realistic values for integer ages and a very long DataFrame.

pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=age))
           for age, gp in df.groupby('age', sort=False)])

        date   age        dob
0 2016-01-05  47.0 1969-01-05
1 2016-01-05  43.0 1973-01-05
2 2016-01-05  28.0 1988-01-05
3 2016-01-05  46.0 1970-01-05
4 2016-01-04  39.0 1977-01-04

Some timings:

import perfplot
import pandas as pd
import numpy as np


def with_groupby(df):
    s = pd.concat([gp.date - pd.offsets.DateOffset(years=idx)
                   for idx, gp in df.groupby('age', sort=False)])
    return s
    
def with_apply(df):
    s = df.apply(lambda x: x['date'] - pd.DateOffset(years=int(x['age'])), axis=1)
    return s
    
    
perfplot.show(
    setup=lambda n: pd.DataFrame({'date': np.random.choice(pd.date_range('1980-01-01', 
                                                                         freq='50D', periods=100), n),
                                  'age': np.random.choice(range(100), n)}), 
    kernels=[lambda df: with_groupby(df),
             lambda df: with_apply(df)],
    labels=["groupby", "apply"],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=lambda x,y: x.sort_index().compare(y.sort_index()).empty,
    xlabel='len(df)'
)

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...