Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
502 views
in Technique[技术] by (71.8m points)

linear regression - lm function in R does not give coefficients for all factor levels in categorical data

I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.

Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.

> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
  states population
1     WA   0.5
2     TE   0.2
3     GE   0.6
4     LA   0.7
5     SF   0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)

Call:
lm(formula = population ~ states, data = df)

Coefficients:
(Intercept)     statesLA     statesSF     statesTE     statesWA  
        0.6          0.1          0.3         -0.4         -0.1

I also tried with a larger data set by doing the following, but still see the same behavior

for(i in 1:10)
{
    df = rbind(df,df)
}

EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?

I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?

> df1

population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0

lm(formula = population ~ (GE+MI+TE+WA),data=df1)

Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)

Coefficients:
(Intercept)           GE           MI           TE           WA  
          1            1            0            1           NA  
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

GE is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states with GE as the baseline (statesLA = 0.1 meaning LA is, on average, 0.1x more than GE).

EDIT:

To respond to your updated question:

If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.

As for the interpretation of GE in your y=mx+c equation, you can calculate the expected y by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.

e.g.

y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c

If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...