Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
729 views
in Technique[技术] by (71.8m points)

r - Can't get aggregate() work for regression by group

I want to use aggregate with this custom function:

#linear regression f-n
CalculateLinRegrDiff = function (sample){
  fit <- lm(value~ date, data = sample)
  diff(range(fit$fitted))
}

dataset2 = aggregate(value ~ id + col, dataset, CalculateLinRegrDiff(dataset))

I receive the error:

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'FUN' of mode 'function' was not found

What is wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your syntax on using aggregate is wrong in the first place. Pass function CalculateLinRegrDiff not an evaluated one CalculateLinRegrDiff(dataset) to FUN argument.

Secondly, you've chosen the wrong tool. aggregate can't help you fit a regression by group. It splits the vector on the LHS of ~ according to combinations on the RHS, and then apply FUN on the LHS. That is, FUN should be a function that works with an atomic vector not a data frame. Say, mean, sd, quantile, etc are all functions that take atomic vector as input. CalculateLinRegrDiff expects a data frame input and that is not going to work with aggregate.

Note that sometimes we use cbind on the LHS, like cbind(x, y) ~ f. This means that we apply FUN in parallel to x ~ f and y ~ f. The LHS variables are independent and not used together.

The right tool for you is the by function. It splits a data frame into sub data frames and applies FUN on each sub frame. So it is ideal for regression by group.

by(dataset[c("value", "date")], dataset[c("id", "col")], CalculateLinRegrDiff)

A simple reproducible example:

set.seed(0)
dataset <- data.frame(value = runif(20), date = runif(20),
                      f = sample(gl(2, 10)), g = sample(gl(4, 5)))
oo <- by(dataset[c("value", "date")], dataset[c("f", "g")], CalculateLinRegrDiff)
str(oo)
# by [1:2, 1:4] 0.307 0.251 0.109 0.201 0.472 ...
# - attr(*, "dimnames")=List of 2
#  ..$ f: chr [1:2] "1" "2"
#  ..$ g: chr [1:4] "1" "2" "3" "4"

Since CalculateLinRegrDiff is a scalar function that returns a single scalar, by will simplify the result oo to an array rather than a list. This array is like a contingency table, so we can use the "table" method of as.data.frame to reshape it to a data frame:

oo <- as.data.frame.table(oo)
#  f g      Freq
#1 1 1 0.3069877
#2 2 1 0.2508591
#3 1 2 0.1087895
#4 2 2 0.2007295
#5 1 3 0.4715680
#6 2 3 0.4942069
#7 1 4 0.3223174
#8 2 4 0.4687340

The name "Freq" may be undesired but you can easily change it. Say names(oo)[3] <- "foo".

As I said in my comments under your question, we can also use split and lapply. But then there is no trivial way to convert the result into a good-looking data frame.

datlist <- split(dataset[c("value", "date")], dataset[c("f", "g")], drop = TRUE)
rr <- lapply(datlist, CalculateLinRegrDiff)
stack(rr)
#     values ind
#1 0.3069877 1.1
#2 0.2508591 2.1
#3 0.1087895 1.2
#4 0.2007295 2.2
#5 0.4715680 1.3
#6 0.4942069 2.3
#7 0.3223174 1.4
#8 0.4687340 2.4

I suggest you read Linear Regression and group by in R for a thorough demonstrations on regression by group.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...