Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
984 views
in Technique[技术] by (71.8m points)

regex - R: fastest way to extract all substrings contained between two substrings

I am on the lookout for an efficient way to extract all matches between two substrings in a character string. E.g. say I want to extract all substrings contained between string

start="strt"

and

stop="stp"
in string
x="strt111stpblablastrt222stp"

I would like to get vector

"111" "222"

What is the most efficient way to do this in R? Using a regular expression perhaps? Or are there better ways?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

For something simple like this, base R handles this just fine.

You can switch on PCRE by using perl=T and use lookaround assertions.

x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"

Explanation:

(?<=          # look behind to see if there is:
  strt        #   'strt'
)             # end of look-behind
.*?           # any character except 
 (0 or more times)
(?=           # look ahead to see if there is:
  stp         #   'stp'
)             # end of look-ahead

EDIT: Updated below answers according to the new syntax.

You may also consider using the stringi package.

library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"

And rm_between from the qdapRegex package.

library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...