Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
446 views
in Technique[技术] by (71.8m points)

python - Statsmodels: Calculate fitted values and R squared

I am running a regression as follows (df is a pandas dataframe):

import statsmodels.api as sm
est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit()
est.summary()

Which gave me, among others, an R-squared of 0.942. So then I wanted to plot the original y-values and the fitted values. For this, I sorted the original values:

orig = df['p'].values
fitted = est.fittedvalues.values
args = np.argsort(orig)
import matplotlib.pyplot as plt
plt.plot(orig[args], 'bo')
plt.plot(orig[args]-resid[args], 'ro')
plt.show()

This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of 0.9. Therefore, I tried to calculate it manually myself:

yBar = df['p'].mean()
SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum()
SSReg = ((est.fittedvalues - yBar)**2).sum()  
1 - SSReg/SSTot
Out[79]: 0.2618159806908984

Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? SSTot, SSReg have values of 48084, 35495.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie.

tss = (ys ** 2).sum()  # un-centred total sum of squares

as opposed to

tss = ((ys - ys.mean())**2).sum()  # centred total sum of squares

as a result, R-squared would be much higher.

This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as:

ys = beta1 . xs + beta0 + noise

then the reduced model can be: ys = beta0 + noise, where the estimate for beta0 is the sample average, thus we have: noise = ys - ys.mean(). That is where de-meaning comes from in a model with intercept.

But from a model like:

ys = beta . xs + noise

you may only reduce to: ys = noise. Since noise is assumed zero-mean, you may not de-mean ys. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares.

This is documented here under rsquared item. Set yBar equal to zero, and I would expect you will get the same number.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...