Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
766 views
in Technique[技术] by (71.8m points)

r - How to fuzzy join based on multiple columns and conditions?

I'm trying to left join two data frames (df1, df2). The data frames have two columns in common: zone and slope. Zone is a factor column and slope is numeric.

    df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)))
    df2 = data.frame(slope = c(2.4, 2.4,6.2), zone = c(rep("Low", 1), rep("High", 2)), other = c(rep("a", 1), rep("b", 1), rep("c", 1)))
    df1
    df2

I want to join the data frames such that they are first matched exactly on zone, and then the closest match for slope. If there are two slope values that equidistant, it doesn't matter if the join rounds up or down as long as the rule is applied consistently and it does not result in duplicate rows.

I'd prefer to do this with a fuzzy_join or dplyr rather than data.table.

The result should look something like:

    df3 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)), other = c(rep("a", 3), rep("b",1), rep("c",2)))
    df3

where the value of "other" is first determined by zone, and then the closest slope.

I've tried:

    distance_left_join(df, df2, by=c("zone"= "zone", "slope"="slope"))

as well as other types of fuzzy joins, but I think they may not be working because the columns are of different types. I suspect there is a fuzzy_left_join solution, but I don't understand how to create a match function.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is how to do fuzzy joins with multiple match_funs. If you want to mix complex match_funs, you'll have to define them with a function yourself as I did here: Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)))
df2 = data.frame(slope = c(2.4, 2.4,6.2), zone = c(rep("Low", 1), rep("High", 2)), other = c(rep("a", 1), rep("b", 1), rep("c", 1)))

library(fuzzyjoin); library(dplyr)

# First, need to define match_fun_distance. 
# This is copied from the source code for distance_join in https://github.com/dgrtwo/fuzzyjoin
match_fun_distance <- function(v1, v2) {
  
  # settings for this method
  method = "manhattan"
  max_dist = 99
  distance_col = "dist"
  
  if (is.null(dim(v1))) {
    v1 <- t(t(v1))
    v2 <- t(t(v2))
  }
  if (method == "euclidean") {
    d <- sqrt(rowSums((v1 - v2)^2))
  }
  else if (method == "manhattan") {
    d <- rowSums(abs(v1 - v2))
  }
  ret <- tibble::tibble(instance = d <= max_dist)
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- d
  }
  ret
}

(joined_result <- fuzzy_join(df1, df2, 
                             by=c("zone"= "zone", "slope"="slope"), 
                             match_fun = list(`==`, match_fun_distance),
                             mode = "left"))
#>   slope.x zone.x slope.y zone.y other slope.dist zone.dist
#> 1       1    Low     2.4    Low     a        1.4        NA
#> 2       2    Low     2.4    Low     a        0.4        NA
#> 3       3    Low     2.4    Low     a        0.6        NA
#> 4       4   High     2.4   High     b        1.6        NA
#> 5       4   High     6.2   High     c        2.2        NA
#> 6       5   High     2.4   High     b        2.6        NA
#> 7       5   High     6.2   High     c        1.2        NA
#> 8       6   High     2.4   High     b        3.6        NA
#> 9       6   High     6.2   High     c        0.2        NA

joined_result %>%
  group_by(slope.x, zone.x) %>%
  top_n(1, -slope.dist)
#> # A tibble: 6 x 7
#> # Groups:   slope.x, zone.x [6]
#>   slope.x zone.x slope.y zone.y other slope.dist zone.dist
#>     <int> <fct>    <dbl> <fct>  <fct>      <dbl>     <dbl>
#> 1       1 Low        2.4 Low    a          1.4          NA
#> 2       2 Low        2.4 Low    a          0.400        NA
#> 3       3 Low        2.4 Low    a          0.6          NA
#> 4       4 High       2.4 High   b          1.6          NA
#> 5       5 High       6.2 High   c          1.2          NA
#> 6       6 High       6.2 High   c          0.2          NA

Created on 2020-10-20 by the reprex package (v0.3.0)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...