The profvis
analysis of a script that uses a sparse matrix revealed that the update of the sparse matrix elements is the slowest step in the process, by 1 order of magnitude.
I need to understand if I can do this better (and especially faster); I would appreciate if someone could please advise where to look or provide suggestions.
Here is some R
code that reproduces the 'critical' part of my script:
require(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L,
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L,
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L,
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L,
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1,
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())
system.time(for (i in 1:10000) m[7,c(1,5,6,7)] <- c(0,0,1,0))
On my laptop, this takes about 7 s.
[BTW, obviously I do not repeat the same operation 10000 times; the rows and columns that are updated change each time, but indeed it happens a very large number of times. I did the above to simulate the operations that are performed in the real script and to get a measurable time that can be compared with faster solutions that might come up.]
Any ideas/suggestions?
PS
I had a similar problem in the past, but for a different case; and I can't find it back, as my activity history only goes back to a few months ago.
EDIT
OK, I found out how to retrieve all my old posts, and discovered that the problem I am describing here was not covered.
EDIT 2 - following up from discussion with / suggestions by pseudospin
require(Matrix)
require(data.table)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L,
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L,
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L,
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L,
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1,
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())
ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]
msdt <- data.table(ms)
time_1 <- system.time(for (i in 1:5000) m[7,c(1,5,7,9)] <- c(0,0,1,0))
cat("
time_1 =", time_1)
time_2 <- system.time(for (i in 1:5000) ms[(ms$i == 7) & (ms$j %in% c(1,5,7,9)),"x"] <- c(0,0,1,0))
cat("
time_2 =", time_2)
time_3 <- system.time(for (i in 1:5000) msdt[(i == 7) & (j %in% c(1,5,7,9)),"x" := c(0,0,1,0)])
cat("
time_3 =", time_3)
which gave:
time_1 = 2.86 0 2.86 NA NA
time_2 = 0.23 0 0.24 NA NA
time_3 = 1.2 0.02 1.22 NA NA
Maybe this example is misleading, though, because normally I would have much higher max values of i
and j
, so perhaps subsetting a data.table
will be much more efficient than subsetting a data.frame
.
To be tested with my real data...
EDIT 3 - trial with real data, including testing the dense matrix method as suggested by GKi
Real data (too large to paste here): m
is a sparse 5828 x 5828 matrix; 302986 / 33965584 = 0.9% of it is filled (hence its sparseness). It occupies 4.4 MB. The corresponding dense matrix dm = as.matrix(m)
occupies 272.5 MB.
Testing the sparseMatrix
(1), data.frame
(2), data.table
(3) and dense matrix (4) updating methods shows the following:
time_1 = 10.25 3.19 13.72 NA NA
time_2 = 41.32 10.94 52.52 NA NA
time_3 = 35.64 7.44 43.34 NA NA
time_4 = 0.05 0.03 0.08 NA NA
So, in agreement with GKi's results, the dense matrix method is by far the fastest, but at the cost of huge memory storage.
On the other hand, the simulated data used originally gave very different results for the sparseMatrix
method, which with real data is instead the second fastest among these 4.
Unfortunately it looks like a catch-22 situation: to have fast editing I need to use a dense matrix, but a dense matrix takes far too much memory, so I need to use a sparse matrix, which however is slow to edit :(
Maybe I will need to reconsider the original suggestion by pseudospin, and use sparse vectors for each row of the matrix. For that I will need to find out how I can refer to a stored R
object by an indirect reference (string).
question from:
https://stackoverflow.com/questions/65644872/updating-individual-elements-of-a-sparse-matrix-in-r-is-very-slow-how-can-i-do