Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
161 views
in Technique[技术] by (71.8m points)

python - How do numpy functions operate on pandas objects internally?

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.

But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).

For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.

So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?

a = np.random.rand(4,2)
a
Out[13]: 
array([[ 0.86688862,  0.09682919],
   [ 0.49629578,  0.78263523],
   [ 0.83552411,  0.71907931],
   [ 0.95039642,  0.71795655]])

np.mean(a)
Out[14]: 0.68320065182041034

gives a different result than what the below gives...

df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 
columns=range(np.shape(a)[1]))

df
Out[18]: 
      0         1
0  0.866889  0.096829
1  0.496296  0.782635
2  0.835524  0.719079
3  0.950396  0.717957

np.mean(df)
Out[21]: 
0    0.787276
1    0.579125
dtype: float64

The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you step through this:

--Call--
> d:winpython-64bit-3.4.3.5python-3.4.3.amd64libsite-packages
umpycorefromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:winpython-64bit-3.4.3.5python-3.4.3.amd64libsite-packages
umpycorefromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:winpython-64bit-3.4.3.5python-3.4.3.amd64libsite-packages
umpycorefromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:winpython-64bit-3.4.3.5python-3.4.3.amd64libsite-packages
umpycorefromnumeric.py(2879)mean()
-> mean = a.mean

You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():

In [6]:

df.mean()
Out[6]:
0    0.572999
1    0.468268
dtype: float64

This is why the output is different

Code to reproduce above:

In [3]:
a = np.random.rand(4,2)
a

Out[3]:
array([[ 0.96750329,  0.67623187],
       [ 0.44025179,  0.97312747],
       [ 0.07330062,  0.18341157],
       [ 0.81094166,  0.04030253]])

In [4]:    
np.mean(a)

Out[4]:
0.52063384885403818

In [5]:    
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 
columns=range(np.shape(a)[1]))
?
df

Out[5]:
          0         1
0  0.967503  0.676232
1  0.440252  0.973127
2  0.073301  0.183412
3  0.810942  0.040303

numpy output:

In [7]:
np.mean(df)

Out[7]:
0    0.572999
1    0.468268
dtype: float64

If you'd called .values to return a np array then the output is the same:

In [8]:
np.mean(df.values)

Out[8]:
0.52063384885403818

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...