Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
394 views
in Technique[技术] by (71.8m points)

java - When both halves of an OR regex group match, is it defined which will be chosen?

I've run the following code:

public static void main(String[] args) {
    Pattern pattern = Pattern.compile("(asd|asdf).*");
    Pattern pattern2 = Pattern.compile("(asdf|asd).*");
    Matcher m = pattern.matcher("asdf");
    Matcher m2 = pattern2.matcher("asdf");
    if (m.matches()) {
        System.out.println(m.group(1));
    }
    if (m2.matches()) {
        System.out.println(m2.group(1));
    }
}

And I get the following output:

asd

asdf

It seems as though the left hand side of the OR group is chosen in cases when both match. However, I haven't been able to find this behaviour documented. Does anyone know if the behaviour is defined?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In non-POSIX regex flavors (as in Java, as The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5), the first alternative is matched. In POSIX, the longest alternative is matched.

See what Perl help says about alternation:

To match dog or cat, we form the regexp dog|cat. As before, Perl will try to match the regexp at the earliest possible point in the string. At each character position, Perl will first try to match the first alternative, dog. If dog doesn't match, Perl will then try the next alternative, cat. If cat doesn't match either, then the match fails and Perl moves to the next position in the string.

See the Alternation with The Vertical Bar or Pipe Symbol at regular-expressions.info that describes NFA-compliant alternation behavior:

The order of the alternatives matters. Suppose you want to use a regex to match a list of function names in a programming language: Get, GetValue, Set or SetValue. The obvious solution is Get|GetValue|Set|SetValue.

The regex engine starts at the first token in the regex, G, and at the first character in the string, S. The match fails. However, the regex engine studied the entire regular expression before starting. So it knows that this regular expression uses alternation, and that the entire regex has not failed yet. So it continues with the second option, being the second G in the regex. The match fails again. The next token is the first S in the regex. The match succeeds, and the engine continues with the next character in the string, as well as the next token in the regex. The next token in the regex is the e after the S that just successfully matched. e matches e. The next token, t matches t.

At this point, the third option in the alternation has been successfully matched. Because the regex engine is eager, it considers the entire alternation to have been successfully matched as soon as one of the options has. In this example, there are no other tokens in the regex outside the alternation, so the entire regex has successfully matched Set in SetValue.

And then:

But the POSIX standard does mandate that the longest match be returned, even when a regex-directed engine is used. Such an engine cannot be eager. It has to continue trying all alternatives even after a match is found, in order to find the longest one.

However, the order of alternatives may be irrelevant if the context on either side is strictly defined. If you use anchors, ^(asd|asdf)$, to match the full string, you will only get the one that corresponds to the right alternative.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...