r - All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches

Question

Welcome To Ask or Share your Answers For Others

r - All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches

What I want to do is more or less a combination of the problems discussed in the two following threads:

Perform non-pairwise all-to-all comparisons between two unordered character vectors --- The opposite of intersect --- all-to-all setdiff
Merge data frames based on numeric rownames within a chosen threshold and keeping unmatched rows as well

I have two numeric vectors:

b_1 <- c(543.4591, 489.36325, 12.03, 896.158, 1002.5698, 301.569)
b_2 <- c(22.12, 53, 12.02, 543.4891, 5666.31, 100.1, 896.131, 489.37)

I want to compare all elements in b_1 against all elements in b_2 and vice versa.

If element_i in b_1 is NOT equal to any number in the range element_j ± 0.045 in b_2 then element_i must be reported.

Likewise, if element_j in b_2 is NOT equal to any number in the range element_i ± 0.045 in b_1 then element_j must be reported.

Therefore, example answer based on the vectors provided above will be:

### based on threshold = 0.045
in_b1_not_in_b2 <- c(1002.5698, 301.569)
in_b2_not_in_b1 <- c(22.12, 53, 5666.31, 100.1)

Is there an R function that would do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:38:08+0000

A vectorized beast:

D <- abs(outer(b_1, b_2, "-")) > 0.045

in_b1_not_in_b2 <- b_1[rowSums(D) == length(b_2)]
#[1] 1002.570  301.569

in_b2_not_in_b1 <- b_2[colSums(D) == length(b_1)]
#[1]   22.12   53.00 5666.31  100.10

hours later...

Henrik shared a question complaining the memory explosion when using outer for long vectors: Matching two very very large vectors with tolerance (fast! but working space sparing). However, the memory bottleneck for outer can be easily killed by blocking.

f <- function (b1, b2, threshold, chunk.size = 5000) {

  n1 <- length(b1)
  n2 <- length(b2)
  chunk.size <- min(chunk.size, n1, n2)

  RS <- numeric(n1)  ## rowSums, to be accumulated
  CS <- numeric(n2)  ## colSums, to be accumulated

  j <- 0
  while (j < n2) {
    chunk.size_j <- min(chunk.size, n2 - j)
    ind_j <- (j + 1):(j + chunk.size_j)
    b2_j <- b2[ind_j]
    i <- 0
    while (i < n1) {
      chunk.size_i <- min(chunk.size, n1 - i)
      ind_i <- (i + 1):(i + chunk.size_i)
      M <- abs(outer(b1[ind_i], b2_j, "-")) > threshold
      RS[ind_i] <- RS[ind_i] + rowSums(M)
      CS[ind_j] <- CS[ind_j] + colSums(M)
      i <- i + chunk.size_i
      }
    j <- j + chunk.size_j
    }

  list(in_b1_not_in_b2 = b1[RS == n2], in_b2_not_in_b1 = b2[CS == n1])
  }

With this function, outer never uses more memory than storing two chunk.size x chunk.size matrices. Now let's do something crazy.

b1 <- runif(1e+5, 0, 10000)
b2 <- b1 + runif(1e+5, -1, 1)

If we do a simple outer, we need memory to store two 1e+5 x 1e+5 matrices, which is up to 149 GB. However, on my Sandy Bridge (2011) laptop with only 4 GB RAM, computation is feasible.

system.time(oo <- f(b1, b2, 0.045, 5000))
#   user  system elapsed 
#365.800 167.348 533.912

The performance is actually good enough, given that we have been using a very poor algorithm.

All answers here do exhausted search, that has complexity length(b1) x length(b2). We could reduce this to length(b1) + length(b2) if we work on sorted arrays. But such deeply optimized algorithm can only be implemented with compiled language to obtain efficiency.

Categories

r - All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches

r - All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags