Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
425 views
in Technique[技术] by (71.8m points)

java - get unique regex matcher results (without using maps or lists)

Is there a way to get only the unique matches? without using a list or a map after the matching, I want the matcher output to be unique right away.

Sample input/output:

String input = "This is a question from [userName] about finding unique regex matches for [inputString] without using any lists or maps. -[userName].";
Pattern pattern = Pattern.compile("\[[^\[\]]*\]");
Matcher matcher = pattern.matcher(rawText);
while (matcher.find()) {
    String tokenName = matcher.group(0);
    System.out.println(tokenName);
}

This will output the following:

[userName]
[inputString]
[userName]

But I want it to output the following:

[userName]
[inputString]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes there is. You can combine a negative lookahead and a backreference:

"(\[[^\[\]]*\])(?!.*\1)"

That will only match if that, which was matched by your actual pattern, does not occur again in the string. Effectively, that means you always get the last occurrence of every match, so you would get them in a different order:

[inputString]
[userName]

If the order is a problem for you (i.e. if it's crucial to order them by first occurrence), you won't be able to do this using regex only. You would need a variable-length look*behind* for that, and that is not supported by Java.

Further reading:


Some notes on a general solution

Note that this will work with any pattern whose matches are of non-zero width. The general solution is simply:

(yourPatternHere)(?!.*1)

(I left out the double backslash, because that only applies to a few languages.)

If you want it to work with patterns that have zero-width matches (because you only want to know a position and are using lookarounds only for some reason), you could do this:

(zeroWidthPatternHere)(?!.+1)

Also, note that (generally) you might have to use the "singleline" or "dotall" option, if your input may contain linebreaks (otherwise the lookahead will only check in the current line). If you cannot or do not want to activate that (because you have a pattern that includes periods which should not match line breaks; or because you use JavaScript), this is the general solution:

(yourPatternHere)(?![sS]*1)

And to make this answer even more widely applicable, here is how you could match only the first occurrence of every match (in an engine with variable-length lookbehinds, like .NET):

(yourPatternHere)(?<!1.*1)
or
(yourPatternHere)(?<!1[sS]*1)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...