Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
392 views
in Technique[技术] by (71.8m points)

python - Why doesn't groupby sum convert boolean to int or float?

I'll start with 3 simple examples:

pd.DataFrame([[True]]).sum()

0    1
dtype: int64

pd.DataFrame([True]).sum()

0    1
dtype: int64

pd.Series([True]).sum()

1

All of these are as expected. Here is a more complicated example.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', True],
        ['b', 'B', True],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.Z.sum()

4

Also as expected. However, if I groupby(['X', 'Y']).sum()

enter image description here

I expected it to look like:

enter image description here

I'm thinking bug. Is there another explanation?


Per @unutbu's answer

pandas is trying to recast as original dtypes. I had thought that maybe the group by I'd performed didn't really groupby anything. So I tried this example to test out the idea.

df = pd.DataFrame([
        ['a', 'A', False],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

I'll groupby('X') and sum. If @unutbu is correct, these sums should be 1 and 0 and are castable to bool, therefore we should see bool

df.groupby('X').sum()

enter image description here

Sure enough... bool

But if the process is the same but the values are slightly different.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.groupby('X').sum()

enter image description here

lesson learned. Always use astype(int) or something similar when doing this.

df.groupby('X').sum().astype(int)

gives consistent results for either scenario.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This occurs because _cython_agg_blocks calls _try_coerce_and_cast_result which calls _try_cast_result which tries to return a result of the same dtype as the original values (in this case, bool).

This returns something a little peculiar when Z has dtype bool (and all the groups have no more than one True value). If any of the groups have 2 or more True values, then the resulting values are floats since _try_cast_result does not convert 2.0 back to a boolean.

_try_cast_result does something more useful when Z has dtype int: Internally, the Cython aggregator used by df.groupby(['X', 'Y']).sum() returns a result of dtype float. Here then, _try_cast_result returns the result to dtype int.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...