Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
641 views
in Technique[技术] by (71.8m points)

regex - R grep: Match one string against multiple patterns

In R, grep usually matches a vector of multiple strings against one regexp.

Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?

Some background:

I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):

ab  10  37  41
abbrach*    38
abbreche    39
abbrich*    39
abend*  37
abendessen* 60  63
aber    20  23  45
abermals    37

Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit). Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.

[related question, other programming language]

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What about applying the regexpr function over a vector of keywords?

keywords <- c("dog", "cat", "bird")

strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")

sapply(keywords, regexpr, strings, ignore.case=TRUE)

     dog cat bird
[1,]  15  -1   -1
[2,]  -1   4   15
[3,]  -1  -1   -1

    sapply(keywords, regexpr, strings[1], ignore.case=TRUE)

 dog  cat bird 
  15   -1   -1 

Values returned are the position of the first character in the match, with -1 meaning no match.

If the position of the match is irrelevant, use grepl instead:

sapply(keywords, grepl, strings, ignore.case=TRUE)

       dog   cat  bird
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE  TRUE
[3,] FALSE FALSE FALSE

Update: This runs relatively quick on my system, even with a large number of keywords:

# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936

system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))

   user  system elapsed 
  7.495   0.155   7.596 

dim(matches)
[1]      3 234936

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...