Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
635 views
in Technique[技术] by (71.8m points)

regression - Why the built-in lm function is so slow in R?

I always thought that the lm function was extremely fast in R, but as this example would suggest, the closed solution computed using the solve function is way faster.

data<-data.frame(y=rnorm(1000),x1=rnorm(1000),x2=rnorm(1000))
X = cbind(1,data$x1,data$x2)

library(microbenchmark)
microbenchmark(
solve(t(X) %*% X, t(X) %*% data$y),
lm(y ~ .,data=data))

Can someone explain me if this toy example is a bad example or it is the case that lm is actually slow?

EDIT: As suggested by Dirk Eddelbuettel, as lm needs to resolve the formula, the comparison is unfair, so better to use lm.fit which doesn't need to resolve the formula

microbenchmark(
solve(t(X) %*% X, t(X) %*% data$y),
lm.fit(X,data$y))


Unit: microseconds
                           expr     min      lq     mean   median       uq     max neval cld
solve(t(X) %*% X, t(X) %*% data$y)  99.083 108.754 125.1398 118.0305 131.2545 236.060   100  a 
                      lm.fit(X, y) 125.136 136.978 151.4656 143.4915 156.7155 262.114   100   b
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are overlooking that

  • solve() only returns your parameters
  • lm() returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...
  • the main cost of your lm() call is not the projection but the resolution of the formula y ~ . from which the model matrix needs to be built.

To illustrate Rcpp we wrote a few variants of a function fastLm() doing more of what lm() does (ie a bit more than lm.fit() from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.

In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...