Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
298 views
in Technique[技术] by (71.8m points)

javascript - Regex only capturing last instance of capture group in match

I have the following regular expression in two different languages that produces the same odd results (javaScript and Flash). What I want to know is not how to fix it, but why the behavior is occurring?

The Regular Expression:

[(\{2}|\]|[^]])*]

The goal here is to match a bracketed string, and ensure that I don't stop at an escaped bracket.

If I have the text input [abcdefg] it is correctly matched, but the only thing returned as part of the capture group is g, where as I expect abcdefg. If I change the expression to [((?:\{2}|\]|[^]])*)], then I get the result that I want.

So why is this happening? Will this be consistent across other languages?

note: Simplifing the expression to [([^]])*] produces the same issue.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Regardless of the problem, ActionScript and JavaScript should always yield the same results, as they both implement ECMAScript (or a superset thereof, but for regular expressions they should not disagree).

But yes, this will be happening in any language (or rather any regex flavor). The reason is that you are repeating the capturing group. Let's take a simpler example: match (.)* against abc. So what we are repeating is (.). The first time it is tried, the engine enters the group, matches a with ., leaves the group and captures a. Only now does the quantifier kick in and it repeats the whole thing. So we enter the group again, and match and capture b. This capture overwrites the previous one, hence 1 does now contain b. Same again for the third repetition: the capture will be overwritten with with c.

I don't know of a regex flavor that behaves differently, and the only one that lets you access all previous captures (instead of just overwriting them) is .NET.

The solution is the one p.s.w.g. proposed. Make the grouping you need for the repetition non-capturing (this will improve performance, because you don't need all that capturing and overwriting anyway) and wrap the whole thing in a new group. Your expression has one little flaw though: you need to include include the backslash in the negated character class. Otherwise, backtracking could give you a match in [abc]. So here is an expression that will work as you expect:

[((?:\{2}|\]|[^]\])*)]

Working demo. (unfortunately, it doesn't show the captures, but it shows that it gives correct matches in all cases)

Note that your expression does not allow for other escape sequences. In particular a single , followed by anything but a ] will cause your pattern to fail. If this is not what you desire, you can just use:

[((?:\.|[^]\])*)]

Working demo.

Performance can further be improved with the "unrolling-the-loop" technique:

[([^]\]*(?:\.[^]\]*)*)]

Working demo.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...