Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
354 views
in Technique[技术] by (71.8m points)

aggregate - How to use R for same field aggregation by multiple separate group

I'm trying to perform count of an indicator on several (actually hundreds) groups separately (NOT on all combinations of all groups). I'll demonstrate it by simplified example:

Assume I have that dataset

data<-cbind(c(1,1,1,2,2,2)
,c(1,1,2,2,2,3)
,c(3,2,1,2,2,3))
> data

      [,1] [,2] [,3]
[1,]    1    1    3
[2,]    1    1    2
[3,]    1    2    1
[4,]    2    2    2
[5,]    2    2    2
[6,]    2    3    3

and an indicator

some_indicator<-c(1,0,0,1,0,1)

then I want to run without loops (like apply by column) something like,

aggregate(some_indicator,list(data[,1]),sum)
aggregate(some_indicator,list(data[,2]),sum)
aggregate(some_indicator,list(data[,3]),sum)

which will generate the following result:

     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

i.e. for each column (values subset do not change much between columns), count the indicator by value and merge it.

Currently I wrote it with a loop over columns, but I need much more efficient way, since there are lot of columns and It takes over an hour.

Thanks in advance, Michael.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

1) tapply The first argument of tapply is data with each column replaced by some_indicator. The second argument indicates that we wish to group by the groups in data and by the column number.

result <- tapply(replace(data, TRUE, some_indicator), list(data, col(data)), sum)
replace(unname(result), is.na(result), 0)

For the input shown in the question, the last line gives:

     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

1a) tapply A somewhat longer tapply solution would be the following. fun takes a column as its argument and uses tapply to sum the groups in some_indicator using that column as the group; however, different columns could have different sets of groups in them so to ensure that they all have the same set of groups (for later alignment) we actually groups by factor(x, levs). The sapply applies fun to each column of data. The as.data.frame is needed since data is a matrix so sapply would apply across each element rather than each column if we were to apply it to that.

 levs <- levels(factor(data))
 fun <- function(x) tapply(some_indicator, factor(x, levs), sum)
 result <- sapply(as.data.frame(data), fun)
 replace(unname(result), is.na(result), 0)

2) xtabs This is quite similar to the tapply solution. It does have the advantages that: (1) sum is implied by xtabs and so need not be specified and also (2) unfilled cells are filled with 0 rather than NA eliminating the extra step of replacing of NAs with 0. On the other hand we must unravel each component of the formula into a vector using c since unlike tapply the xtabs formula will not accept matrices:

result <- xtabs(c(replace(data, TRUE, some_indicator)) ~ c(data) + c(col(data)))
dimnames(result) <- NULL

For the data in the question this gives:

> result
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

REVISED Revised tapply solution and added xtabs solution.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...