python - why does pandas rolling use single dimension ndarray

Question

Welcome To Ask or Share your Answers For Others

python - why does pandas rolling use single dimension ndarray

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - why does pandas rolling use single dimension ndarray

I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn't work out that way.

Here is what I found:

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
X = np.random.rand(2, 1).round(2)

What do objects look like:

print "
df = 
", df
print "
X = 
", X
print "
df.shape =", df.shape, ", X.shape =", X.shape

df = 
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

X = 
[[ 0.93]
 [ 0.83]]

df.shape = (5, 2) , X.shape = (2L, 1L)

Matrix multiplication behaves normally:

df.values.dot(X)

array([[ 0.7495],
       [ 0.8179],
       [ 0.4444],
       [ 1.4711],
       [ 1.3562]])

Using apply to perform row by row dot product behaves as expected:

df.apply(lambda x: x.values.dot(X)[0], axis=1)

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

Groupby -> Apply behaves as I'd expect:

df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

But when I run:

df.rolling(1).apply(lambda x: x.values.dot(X))

I get:

AttributeError: 'numpy.ndarray' object has no attribute 'values'

Ok, so pandas is using straight ndarray within its rolling implementation. I can handle that. Instead of using .values to get the ndarray, let's try:

df.rolling(1).apply(lambda x: x.dot(X))

shapes (1,) and (2,1) not aligned: 1 (dim 0) != 2 (dim 0)

Wait! What?!

So I created a custom function to look at the what rolling is doing.

def print_type_sum(x):
    print type(x), x.shape
    return x.sum()

Then ran:

print df.rolling(1).apply(print_type_sum)

<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

My resulting pd.DataFrame is the same, that's good. But it printed out 10 single dimensional ndarray objects. What about rolling(2)

print df.rolling(2).apply(print_type_sum)

<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
      A     B
0   NaN   NaN
1  0.90  0.88
2  0.92  0.49
3  1.31  0.84
4  1.63  1.58

Same thing, expect output but it printed 8 ndarray objects. rolling is producing a single dimensional ndarray of length window for each column as opposed to what I expected which was an ndarray of shape (window, len(df.columns)).

Question is Why?

I now don't have a way to easily run a rolling multi-factor regression.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:10:17+0000

I wanted to share what I've done to work around this problem.

Given a pd.DataFrame and a window, I generate a stacked ndarray using np.dstack (see answer). I then convert it to a pd.Panel and using pd.Panel.to_frame convert it to a pd.DataFrame. At this point, I have a pd.DataFrame that has an additional level on its index relative to the original pd.DataFrame and the new level contains information about each rolled period. For example, if the roll window is 3, the new index level will contain be [0, 1, 2]. An item for each period. I can now groupby level=0 and return the groupby object. This now gives me an object that I can much more intuitively manipulate.

Roll Function

import pandas as pd
import numpy as np

def roll(df, w):
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    panel = pd.Panel(roll_array, 
                     items=df.index[w-1:],
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
    return panel.to_frame().unstack().T.groupby(level=0)

Demonstration

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

print df

      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

Let's sum

rolled_df = roll(df, 2)

print rolled_df.sum()

major     A     B
1      0.90  0.88
2      0.92  0.49
3      1.31  0.84
4      1.63  1.58

To peek under the hood, we can see the stucture:

print rolled_df.apply(lambda x: x)

major      A     B
  roll            
1 0     0.44  0.41
  1     0.46  0.47
2 0     0.46  0.47
  1     0.46  0.02
3 0     0.46  0.02
  1     0.85  0.82
4 0     0.85  0.82
  1     0.78  0.76

But what about the purpose for which I built this, rolling multi-factor regression. But I'll settle for matrix multiplication for now.

X = np.array([2, 3])

print rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 

      0     1
1  2.11  2.33
2  2.33  0.98
3  0.98  4.16
4  4.16  3.84

Categories

python - why does pandas rolling use single dimension ndarray

python - why does pandas rolling use single dimension ndarray

Question is Why?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Roll Function

Demonstration

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags