Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
736 views
in Technique[技术] by (71.8m points)

regex - Non-greedy string regular expression matching

I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")                                         
     [,1]   
[1,] "aaaab"

Base functions behave the same way:

> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

I would expect the match to be just ab as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

Could someone please explain me what's going on?

Update. What's crazy is that in some other cases non-greedy patterns behave as expected:

> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
     [,1]                                          
[1,] "<a href="abc">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
     [,1]              
[1,] "<a href="abc">"
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.

Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab, aaab, aab, and ab are matches to your pattern, but aaaab being the one that starts the most to the left is the one that is returned.

So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:

str_match('xxx aaaab yyy', "a.*?y") 
#      [,1]     
# [1,] "aaaab y"

Here all of the strings aaaab y, aaaab yy, aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.


So what can you do to catch that last ab? Use this:

str_match('xxx aaaab yyy', ".*(a.*b)")
#      [,1]        [,2]
# [1,] "xxx aaaab" "ab"

How does it work? By adding a greedy pattern .* in the front, you are now forcing the process to put the last possible a into the captured group.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...