Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
138 views
in Technique[技术] by (71.8m points)

Updating individual elements of a sparse matrix in R is very slow - how can I do it faster?

The profvis analysis of a script that uses a sparse matrix revealed that the update of the sparse matrix elements is the slowest step in the process, by 1 order of magnitude.
I need to understand if I can do this better (and especially faster); I would appreciate if someone could please advise where to look or provide suggestions.

Here is some R code that reproduces the 'critical' part of my script:

require(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L, 
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L, 
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L, 
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L, 
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1, 
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())

system.time(for (i in 1:10000) m[7,c(1,5,6,7)] <- c(0,0,1,0))

On my laptop, this takes about 7 s.
[BTW, obviously I do not repeat the same operation 10000 times; the rows and columns that are updated change each time, but indeed it happens a very large number of times. I did the above to simulate the operations that are performed in the real script and to get a measurable time that can be compared with faster solutions that might come up.]

Any ideas/suggestions?

PS
I had a similar problem in the past, but for a different case; and I can't find it back, as my activity history only goes back to a few months ago.

EDIT OK, I found out how to retrieve all my old posts, and discovered that the problem I am describing here was not covered.

EDIT 2 - following up from discussion with / suggestions by pseudospin

require(Matrix)
require(data.table)

m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L, 
                            7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L, 
                            1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L, 
                            9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L, 
                                            40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3", 
                                                                                            "4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4", 
                                                                                                                                         "5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1, 
                                                                                                                                                                                      3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
                                                                                                                                                                                      1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())

ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]
msdt <- data.table(ms)

time_1 <- system.time(for (i in 1:5000) m[7,c(1,5,7,9)] <- c(0,0,1,0))
cat("
time_1 =", time_1)

time_2 <- system.time(for (i in 1:5000) ms[(ms$i == 7) & (ms$j %in% c(1,5,7,9)),"x"] <- c(0,0,1,0))
cat("
time_2 =", time_2)

time_3 <- system.time(for (i in 1:5000) msdt[(i == 7) & (j %in% c(1,5,7,9)),"x" := c(0,0,1,0)])
cat("
time_3 =", time_3)

which gave:

time_1 = 2.86 0 2.86 NA NA
time_2 = 0.23 0 0.24 NA NA
time_3 = 1.2 0.02 1.22 NA NA

Maybe this example is misleading, though, because normally I would have much higher max values of i and j, so perhaps subsetting a data.table will be much more efficient than subsetting a data.frame.
To be tested with my real data...

EDIT 3 - trial with real data, including testing the dense matrix method as suggested by GKi

Real data (too large to paste here): m is a sparse 5828 x 5828 matrix; 302986 / 33965584 = 0.9% of it is filled (hence its sparseness). It occupies 4.4 MB. The corresponding dense matrix dm = as.matrix(m) occupies 272.5 MB.

Testing the sparseMatrix (1), data.frame (2), data.table (3) and dense matrix (4) updating methods shows the following:

time_1 = 10.25 3.19 13.72 NA NA
time_2 = 41.32 10.94 52.52 NA NA
time_3 = 35.64 7.44 43.34 NA NA
time_4 = 0.05 0.03 0.08 NA NA

So, in agreement with GKi's results, the dense matrix method is by far the fastest, but at the cost of huge memory storage.
On the other hand, the simulated data used originally gave very different results for the sparseMatrix method, which with real data is instead the second fastest among these 4.

Unfortunately it looks like a catch-22 situation: to have fast editing I need to use a dense matrix, but a dense matrix takes far too much memory, so I need to use a sparse matrix, which however is slow to edit :(

Maybe I will need to reconsider the original suggestion by pseudospin, and use sparse vectors for each row of the matrix. For that I will need to find out how I can refer to a stored R object by an indirect reference (string).

question from:https://stackoverflow.com/questions/65644872/updating-individual-elements-of-a-sparse-matrix-in-r-is-very-slow-how-can-i-do

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A comparison of different methods. Note that the methods DataFrame, DataTable and FastMatch will work only in case you are overwriting existing values in the sparse matrix but not when you insert new values.

library(microbenchmark)
microbenchmark(list = fun, control=list(order="block"))
#Unit: microseconds
#         expr     min       lq      mean   median       uq      max neval   cld
# SparseMatrix 618.307 623.1795 639.38102 627.2525 643.3895 1021.301   100     e
#  DenseMatrix   1.178   1.2210   1.36957   1.2635   1.3435    7.060   100 a    
#         Slam 259.703 264.3945 270.57151 265.9780 268.0745  426.610   100   c  
#        Spray 422.129 427.1310 463.21071 434.0705 440.6025 2880.787   100    d 
#    DataFrame  37.031  37.7910  38.98143  38.1660  38.6255   73.283   100  b   
#   DataFrameB  16.928  17.4480  17.85553  17.6910  18.0155   28.859   100 ab   
#   DataFrameC  21.007  21.7170  22.68689  21.9600  22.3735   38.175   100 ab   
#    DataTable 283.409 288.5710 299.43498 292.8395 301.1255  500.484   100   c  
#    FastMatch  40.138  40.7885  42.33623  41.2165  41.6575   82.274   100  b   
#         List   2.703   2.7900   3.28163   2.8535   2.9770   13.375   100 a    
#  Environment   2.157   2.2055   2.29915   2.2575   2.3340    4.211   100 a    

library(bench)
mark(exprs = fun, check = FALSE)
## A tibble: 11 x 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 SparseMatrix 624.43μs 651.98μs     1487.    1.65KB    10.4    712     5
# 2 DenseMatrix    1.61μs   1.86μs   490003.        0B    49.0   9999     1
# 3 Slam         261.46μs 271.96μs     3606.     7.2KB     8.24  1750     4
# 4 Spray        424.08μs 439.86μs     2206.    8.16KB    10.4   1063     5
# 5 DataFrame     37.26μs  40.03μs    24378.    2.08KB     9.75  9996     4
# 6 DataFrameB    18.25μs  19.72μs    49447.     1.7KB     9.89  9998     2
# 7 DataFrameC    22.72μs  24.82μs    39142.      840B    11.7   9997     3
# 8 DataTable    288.24μs 300.25μs     3252.   18.34KB     8.24  1579     4
# 9 FastMatch     41.05μs  43.73μs    22292.    2.46KB    11.2   9995     5
#10 List            3.4μs   3.71μs   257225.        0B    25.7   9999     1
#11 Environment    2.82μs   3.11μs   306445.        0B     0    10000     0

Methods:

fun <- alist(
  SparseMatrix = m[7,c(1,5,6,7)] <- c(0,0,1,0)
, DenseMatrix = dm[7,c(1,5,6,7)] <- c(0,0,1,0)
, Slam = slm[7,c(1,5,6,7)] <- c(0,0,1,0)
, Spray = spm[7,c(1,5,6,7)] <- c(0,0,1,0)
, DataFrame = ms[(ms$j == 7) & (ms$i %in% c(1,5,6,7)),"x"] <- c(0,0,1,0)
, DataFrameB = ms$x[(ms$j == 7) & (ms$i %in% c(1,5,6,7))] <- c(0,0,1,0)
, DataFrameC = {i <- which(ms$j == 7); ms$x[i[ms$i[i] %in% c(1,5,6,7)]] <- c(0,0,1,0)}
, DataTable = msdt[(j == 7) & (i %in% c(1,5,6,7)),"x" := c(0,0,1,0)]
, FastMatch = mf[mf$j %fin% 7 & (mf$i %fin% c(1,5,6,7)),"x"] <- c(0,0,1,0)
, List = ml[["7"]][c("1","5","6","7")] <- c(0,0,1,0)
, Environment = me[["7"]][c("1","5","6","7")] <- c(0,0,1,0)
)

Data:

library(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L, 
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L, 
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L, 
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L, 
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1, 
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())

dm <- as.matrix(m) #Dense Matrix

library(slam)
slm <- as.simple_sparse_array(dm)

library(spray)
spm <- as.spray(dm)

ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]

library(data.table)
msdt <- data.table(ms)

library(fastmatch)
mf <- ms

ml <- split(setNames(ms$x, ms$j), ms$i)

me <- list2env(ml, hash = TRUE)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...