Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

regex - Fixing Catastrophic Backtracking in Regular Expression

The Problem

I'm using the following regular expression to check for valid file paths:

^(?:[a-zA-Z]:\|\\)([^\/:*?<>"|]+(\){0,1})+$

Using the test string V:Sample NamesLibrariesDeveloperLibDeveloperComDlgs es is recognized as valid. I can even add invalid characters to the beginning of the string without issues. However, when I add an invalid character towards the end of the string, the webpage freezes up from catastrophic backtracking.

What is causing this in this regex string?


Breaking Down the Regex

Full string: ^(?:[a-zA-Z]:\|\\)([^\/:*?<>"|]+(\){0,1})+$

First Group: (?:[a-zA-Z]:\|\\)

  • Checks for either
    • A capital or lowercase alphabetical letter followed by a colon and a backslash
    • A double backslash

Second Group: ([^\/:*?<>"|]+(\){0,1})

  • First Part: [^\/:*?<>"|]+
    • Makes sure there are no illegal characters ( / : * ? < > " | )
  • Second Part: (\){0,1}
    • Checks for a backslash between sections as many times as necessary

I think it may be the {0, 1} causing the issue since this allows for backtracking but I am not sure. Any thoughts?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your current regex can be written as ^(?:[a-zA-Z]:\|\\)([^\/:*?<>"|]+\?)+$: pay attention at the ? quantifier (it is equal to {0,1} limiting quantifier) after \ inside a + quantified group.

Once such a pattern like (a+b?)+ is present inside a pattern, there is a high chance of a catastrophical backtracking. Everything is nice when there is a match, say, c:1234aaaaaaaaaaaaaaaaaaa is matched fine, but once a char that is not allowed appears causing a no-match, (try adding * at the end, c:1234aaaaaaaaaaaaaaaaaaa*), the issue will appear.

To solve this, the quantified subpatterns that can match the same text cannot follow one another in immediate succession. And using optional groups where each subpattern is obligatory enables this.

In most scenarios, you need to replace these pattern parts with unrolled a+(ba+)* (1 or more occurrences of a followed with 0 or more sequences of b (that is no longer optional by itself) and then 1 or more occurrences of a (so, between one a and the next a there must be a b). If you need to match an optional at the end (as ^(a+b?)+$ actually may match b at the end of the string), you need to add a b? at the end: a+(ba+)*b?.

So, translating this to your current scenario:

^(?:[a-zA-Z]:\|\\)[^\/:*?<>"|]+(?:\[^\/:*?<>"|]+)*$

or if the is allowed at the end:

^(?:[a-zA-Z]:\|\\)[^\/:*?<>"|]+(?:\[^\/:*?<>"|]+)*\?$
                     |      a+       (   b       a+       )* b?

See how it fails gracefully upon a no match, or matches as expected.

As @anubhava suggests, you can further enhance the performance by using possessive quantifiers (or atomic groups instead, since, e.g. .NET regex engine does not support possessives) that disallow any backtracking into the grouped patterns. Once matched, these patterns are not re-tried, thus, failure may come much quicker:

^(?:[a-zA-Z]:\|\\)[^\/:*?<>"|]+(?:\[^\/:*?<>"|]+)*+\?$
                                                            ^

or an atomic group example:

^(?:[a-zA-Z]:\|\\)(?>[^\/:*?<>"|]+(?:\[^\/:*?<>"|]+)*)\?$
                     ^^^                                       ^                          

Note that : is not a special regex metacharacter and should not be escaped. Inside a character class, only -, ^, and ] usually require escaping, all others are not special there either.

See more about catastrophical backtracking at The Explosive Quantifier Trap.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...