Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
502 views
in Technique[技术] by (71.8m points)

regex - Question marks in regular expressions

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.

thank you

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.

? - Optional (greedy) quantifier

The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:

https?

This pattern will match both inputs, because it makes the s optional.

?? - Optional (lazy) quantifier

?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).

Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:

Input:       
http123
https456
httpsomething

Expected result:
Pass/Fail  Group 1   Group 2
Pass       http      123
Pass       https     456
Pass       http      something

You try the first thing that comes to mind, which is this:

^(http)([a-zd]+)$
Pass/Fail  Group 1   Group 2    Grouped correctly?
Pass       http      123        Yes
Pass       http      s456       No
Pass       http      something  Yes

They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.

Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:

(https?)([a-z]+|d+)
Pass/Fail  Group 1   Group 2   Grouped correctly?
Pass       http      123       Yes
Pass       https     456       Yes
Pass       https     omething  No

Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.

To fix this, you make one tiny change:

(https??)([a-z]+|d+)$
Pass/Fail  Group 1   Group 2    Grouped correctly?
Pass       http      123        Yes
Pass       https     456        Yes
Pass       http      something  Yes

Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|d+, so it prefers to put it into Group 2.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...