Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.4k views
in Technique[技术] by (71.8m points)

r - Imperfect string match using data.table

Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly.

This is the link to that thread which includes all of the data and details:

Accelerate performance and speed of string match in R

But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters?

I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me.

**EDIT:

So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong:

I tried replacing this line:

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

with this:

dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]

But got the following error:

Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins,  : 
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'. 
Character columns must join to factor or character columns.

But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I finally got it.

The agrep-function has a value-option that needs to be altered from FALSE (default) to TRUE:

dt <- dt[lapply(car.vins, agrep, x = vin.vins, max.distance = c(cost=2, all=2), value = TRUE)
         , .(NumTimesFound = .N)
         , by = vin.names]

Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!

Thanks again for all the help!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...