Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
345 views
in Technique[技术] by (71.8m points)

.net - Why does this backreference not work inside a lookbehind?

Matching a repeated character in regex is simple with a backreference:

(.)1

Test it here.

However, I would like to match the character after the pair of characters, so I thought I could simply put this in a lookbehind:

(?<=(.)1).

Unfortunately, this doesn't match anything.

Why is that? In other flavours I wouldn't be surprised because there are strong restrictions on lookbehinds, but .NET usually supports arbitrarily complicated patterns inside lookbehinds.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The short version: Lookbehinds are matched from right to left. That means when the regex engine encounters the 1 it hasn't captured anything into that group yet, so the regex always fails. The solution is quite simple:

(?<=1(.)).

Test it here.

Unfortunately, the full story once you start using more complex patterns is a lot more subtle. So here is...

A guide to reading regular expressions in .NET

First, some important acknowledgements. The person who taught me that lookbehinds are matched from right to left (and figured this out on his own through a lot of experimentation), was Kobi in this answer. Unfortunately, the question I asked back then was a very convoluted example which doesn't make for a great reference for such a simple problem. So we figured it would make sense to make a new and more canonical post for future reference and as a suitable dupe target. But please consider giving Kobi an upvote for figuring out a very important aspect of .NET's regex engine that is virtually undocumented (as far as I know, MSDN mentions it in a single sentence on a non-obvious page).

Note that rexegg.com explains the inner workings of .NET's lookbehinds differently (in terms of reversing the string, the regex, and any potential captures). Although that wouldn't make a difference to the result of the match, I find that approach much harder to reason about, and from looking at the code it's fairly clear that this is not what the implementation actually does.

So. The first question is, why is it actually more subtle than the bolded sentence above. Let's try matching a character that is preceded by either a or A using a local case-insensitive modifier. Given the right-to-left matching behaviour, one might expect this to work:

(?<=a(?i)).

However, as you can see here this doesn't seem to use the modifier at all. Indeed, if we put the modifier in front:

(?<=(?i)a).

...it works.

Another example, that might be surprising with right-to-left matching in mind is the following:

(?<=2(.)(.)).

Does the 2 refer to the left or right capturing group? It refers to the right one, as this example shows.

A final example: when matched against abc, does this capture b or ab?

(?<=(b|a.))c

It captures b. (You can see the captures on the "Table" tab.) Once again "lookbehinds are applied from right to left" isn't the full story.

Hence, this post tries to be a comprehensive reference on all things regarding directionality of regex in .NET, as I'm not aware of any such resource. The trick to reading a complicated regex in .NET is doing so in three or four passes. All but the last pass are left-to-right, regardless of lookbehinds or RegexOptions.RightToLeft. I believe this is the case, because .NET processes these when parsing and compiling the regex.

First pass: inline modifiers

This is basically what the above example shows. If anywhere in your regex, you had this snippet:

...a(b(?i)c)d...

Regardless of where in the pattern that is or whether you're using the RTL option, c will be case-insensitive while a, b and d will not (provided they aren't affected by some other preceding or global modifier). That is probably the simplest rule.

Second pass: group numbers [unnamed groups]

For this pass you should completely ignore any named groups in the pattern, i.e. those of the form (?<a>...). Note this does not include groups with explicit numbers like (?<2>...) (which are a thing in .NET).

Capturing groups are numbered from left to right. It doesn't matter how complicated your regex is, whether you're using the RTL option or whether you nest dozens of lookbehinds and lookaheads. When you're only using unnamed capturing groups, they are numbered from left to right depending on the position of their opening parenthesis. An example:

(a)(?<=(b)(?=(.)).((c).(d)))(e)
└1┘    └2┘   └3┘  │└5┘ └6┘│ └7┘
                  └───4───┘

This gets a bit trickier when mixing unlabelled groups with explicitly numbered groups. You should still read all of these from left to right, but the rules are a bit trickier. You can determine the number of a group as follows:

  • If the group has an explicit number, its number is obviously that (and only that) number. Note that this may either add an additional capture to an already existing group number, or it may create a new group number. Also note that when you're giving explicit group numbers, they don't have to be consecutive. (?<1>.)(?<5>.) is a perfectly valid regex with group number 2 to 4 unused.
  • If the group is unlabelled, it takes the first unused number. Due to the gaps I just mentioned, this may be smaller than the maximum number that has already been used.

Here is an example (without nesting, for simplicity; remember to order them by their opening parentheses when they are nested):

(a)(?<1>b)(?<2>c)(d)(e)(?<6>f)(g)(h)
└1┘└──1──┘└──2──┘└3┘└4┘└──6──┘└5┘└7┘

Notice how the explicit group 6 creates a gap, then the group capturing g takes that unused gap between groups 4 and 6, whereas the group capturing h takes 7 because 6 is already used. Remember that there might be named groups anywhere in between these, which we're completely ignoring for now.

If you're wondering what the purpose of repeated groups like group 1 in this example is, you might want to read about balancing groups.

Third pass: group numbers [named groups]

Of course, you can skip this pass entirely if there are no named groups in the regex.

It's a little known feature that named groups also have (implicit) group numbers in .NET, which can be used in backreferences and substitution patterns for Regex.Replace. These get their numbers in a separate pass, once all the unnamed groups have been processed. The rules for giving them numbers are as follows:

  • When a name appears for the first time, the group gets the first unused number. Again, this might be a gap in the used numbers if the regex uses explicit numbers, or it might be one greater than the greatest group number so far. This permanently associates this new number with the current name.
  • Consequently, when a name appears again in the regex, the group will have the same number that was used for that name the last time.

A more complete example with all three types of groups, explicitly showing passes two and three:

         (?<a>.)(.)(.)(?<b>.)(?<a>.)(?<5>.)(.)(?<c>.)
Pass 2:  │     │└1┘└2┘│     ││     │└──5──┘└3┘│     │
Pass 3:  └──4──┘      └──6──┘└──4──┘          └──7──┘

Final pass: following the regex engine

Now that we know which modifiers apply to which tokens and which groups have which numbers, we finally get to the part that actually corresponds to the execution of the regex engine, and where we start going back and forth.

.NET's regex engine can process regex and string in two directions: the usual left-to-right mode (LTR) and its unique right-to-left mode (RTL). You can activate RTL mode for the entire regex with RegexOptions.RightToLeft. In that case, the engine will start trying to find a match at the end of the string and will go left through the regex and the string. For example, the simple regex

a.*b

Would match a b, then it would try to match .* to the left of that (backtracking as necessary) such that there's an a somewhere to the left of it. Of course, in this simple example, the result between LTR and RTL mode is identical, but it helps to make a conscious effort to follow the engine in its backtracking. It can make a difference for something as simple as ungreedy modifiers. Consider the regex

a.*?b

instead. We're trying to match axxbxxb. In LTR mode, you get the match axxb as expected, because the ungreedy quantifier is satisfied with the xx. However, in RTL mode, you'd actually match the entire string, since the first b is found at the end of the string, but then .*? needs to match all of xxbxx for a to match.

And clearly it also makes a difference for backreferences, as the example in the question and at the top of this answer shows. In LTR mode we use (.)1 to match repeated characters and in RTL mode we use 1(.), since we need to make sure that the regex engine encounters the capture before it tries to reference it.

With that in mind, we can view lookarounds in a new light. When the regex engine encounters a lookbehind, it processes it as follows:

  • It remembers its current position x in the target string as well as its current processing direction.
  • Now it enforces RTL mode, regardless of the mode it's currently in.
  • Then the contents of the lookbehind are matched from right to left, starting from the current position x.
  • Once the lookbehind is processed completely, if it passed, the position of the regex engine resets to position x and the original processing direction is restored.

While a lookahead seems a lot more innocuous (since we almost never encounter problems like the one in the question with them), its behaviour is actually virtually the same, except that it enforces LTR mode. Of course in most patterns which are LTR only, this is never noticed. But if the regex itself is matched in RTL mode, or we're doing something as crazy as putting a lookahead inside a lookbehind, then the lookahead will change the processing direction just like the lookbehind does.

So how should you actually read a regex that does funny stuff like this? The first step is to split it into separate components, which are usually individual tokens together with their relevant quantifiers. Then depending on whether the regex is LTR or RTL, start going from top to bottom or bottom to top, re


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...