r - grouping data with the same name and applying function

Question

Welcome To Ask or Share your Answers For Others

r - grouping data with the same name and applying function

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

r - grouping data with the same name and applying function

I have matrix like this: I want to group the columns by which they have same name and apply function to the rows of my matrix.

>data

      A  A  A  B  B  C
gene1 1  6 11 16 21 26
gene2 2  7 12 17 22 27
gene3 3  8 13 18 23 28
gene4 4  9 14 19 24 29
gene5 5 10 15 20 25 30

basically, I want put columns with same names like A to group 1, B to group 2,... and after that, I calculate T-test for each genes for all groups. can anybody help me how can I do this ? first : grouping, then applying the T-test, which return T score for each genes between different groups .

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:13:45+0000

The OP hasn't mentioned what form they want in their output, but I'm entirely updating this answer with a possible solution.

First, some reproducible sample data to work with (that will actually work with t.test).

set.seed(1)
mymat <- matrix(sample(100, 40, replace = TRUE), 
                ncol = 8, dimnames = list(
                  paste("gene", 1:5, sep = ""), 
                  c("A", "A", "A", "B", "B", "B", "C", "C")))
mymat
#        A  A  A   B  B  B  C  C
# gene1 27 90 21  50 94 39 49 67
# gene2 38 95 18  72 22  2 60 80
# gene3 58 67 69 100 66 39 50 11
# gene4 91 63 39  39 13 87 19 73
# gene5 21  7 77  78 27 35 83 42

I've left all the hard work to the combn function. Within the combn function, I've made use of the FUN argument to add a function that creates a vector of the t.test "statistic" by each row (I'm assuming one gene per row). I've also added an attribute to the resulting vector to remind us which columns were used in calculating the statistic.

temp <- combn(unique(colnames(mymat)), 2, FUN = function(x) {
  out <- vector(length = nrow(mymat))
  for (i in sequence(nrow(mymat))) {
    out[i] <- t.test(mymat[i, colnames(mymat) %in% x[1]], 
           mymat[i, colnames(mymat) %in% x[2]])$statistic
  }
  attr(out, "NAME") <- paste(x, collapse = "")
  out
}, simplify = FALSE)

The output of the above is a list of vectors. It might be more convenient to convert this into a matrix. Since we know that each value in a vector represents one row, and each vector overall represents one column value combination (AB, AC, or BC), we can use that for the dimnames of the resulting matrix.

DimNames <- list(rownames(mymat), sapply(temp, attr, "NAME"))

final <- do.call(cbind, temp)
dimnames(final) <- DimNames
final
#               AB         AC           BC
# gene1 -0.5407966 -0.5035088  0.157386919
# gene2  0.5900350 -0.7822292 -1.645448267
# gene3 -0.2040539  1.7263502  1.438525163
# gene4  0.6825062  0.5933218  0.009627409
# gene5 -0.4384258 -0.9283003 -0.611226402

Some manual verification:

## Should be the same as final[1, "AC"]
t.test(mymat[1, colnames(mymat) %in% "A"],
       mymat[1, colnames(mymat) %in% "C"])$statistic
#          t 
# -0.5035088 

## Should be the same as final[5, "BC"]    
t.test(mymat[5, colnames(mymat) %in% "B"],
       mymat[5, colnames(mymat) %in% "C"])$statistic
#          t 
# -0.6112264 

## Should be the same as final[3, "AB"]
t.test(mymat[3, colnames(mymat) %in% "A"],
       mymat[3, colnames(mymat) %in% "B"])$statistic
#          t 
# -0.2040539

Update

Building on @EDi's answer, here's another approach. It makes use of melt from "reshape2" to convert the data into a "long" format. From there, as before, it's pretty straightforward subsetting work to get what you want. The output there is transposed in relation to the approach taken with the pure combn approach, but the values are the same.

library(reshape2)
mymatL <- melt(mymat)

byGene <- split(mymatL, mymatL$Var1)
RowNames <- combn(unique(as.character(mymatL$Var2)), 2, 
                  FUN = paste, collapse = "")

out <- sapply(byGene, function(combos) {
  combn(unique(as.character(mymatL$Var2)), 2, FUN = function(x) {
    t.test(value ~ Var2, combos[combos[, "Var2"] %in% x, ])$statistic
  }, simplify = TRUE)
})

rownames(out) <- RowNames
out
#         gene1      gene2      gene3       gene4      gene5
# AB -0.5407966  0.5900350 -0.2040539 0.682506188 -0.4384258
# AC -0.5035088 -0.7822292  1.7263502 0.593321770 -0.9283003
# BC  0.1573869 -1.6454483  1.4385252 0.009627409 -0.6112264

The first option is considerably faster, at least on this smaller dataset:

microbenchmark(fun1(), fun2())
# Unit: milliseconds
#    expr       min        lq    median       uq      max neval
#  fun1()  8.812391  9.012188  9.116896  9.20795 17.55585   100
#  fun2() 42.754296 43.388652 44.263760 45.47216 67.10531   100

Categories

r - grouping data with the same name and applying function

r - grouping data with the same name and applying function

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Update

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags