Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
708 views
in Technique[技术] by (71.8m points)

regex - latest Perl won't match certain regexes more than 32768 characters long

I am hoping some Perl gurus can opine on the following. This is the smallest possible example I could find that reproduces my problem:

>./perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb)*a/)'
1

but

>./perl -e 'print (("a".("f"x32768)."a") =~ /a(?:[^a]|bb)*a/)'
>

and I did compile the latest Perl from source, just to see if it would fix the problem:

>./perl -v

This is perl 5, version 20, subversion 1 (v5.20.1) built for i686-linux

Is this a bug (looks like to me)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is a known bug reported since 2002 and it has yet to be fixed. You now know you are not the first person to encounter this bug (or feature, as you will see soon).

From this comment in the bug report, it seems that quantifiers (*, +, {n,m}, {n,}) are designed to have an upper limit on the number of repetitions, which prevents the engine from segfault when the stack used for backtracking overflows, but violates the very definition of Kleene operator in regular expression (repeat the pattern for arbitrary number of times) and gives wrong answer for the query1.

1 In contrast, Java's regex engine (Oracle's implemetation) simply allow StackOverflowError to occur for cases like this, but the quantifier has the upper limit of 232 - 1, which is sufficient for most use case. And there exists a workaround for cases like this, which is to use possessive quantifier.

The same comment also print the regex compilation debugging info, and the output clearly shows that * is translated to {0,32767}. It is also reproducible on my machine (perl v5.10.1 (*) built for x86_64-linux-thread-multi).

$ perl -Mre=debug -wce '/(A|B)*/'
Compiling REx "(A|B)*"
Final program:
   1: CURLYM[1] {0,32767} (15)
   5:   TRIE-EXACT[AB] (13)
        <A>
        <B>
  13:   SUCCEED (0)
  14: NOTHING (15)
  15: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*"

This following test further confirms the problem, and it shows that perl doesn't let you specify a repetition that exceeds the limit.

$ perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb){0,32767}a/)'
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/a(?:[^a]|bb){ <-- HERE 0,32767}a/ at -e line 1.

Making the quantifier possessive *+ does not solve the problem, since the limit is still there:

$ perl -Mre=debug -wce '/(A|B)*+/'
Compiling REx "(A|B)*+"
Final program:
   1: SUSPEND (19)
   3:   CURLYM[1] {0,32767} (17)
   7:     TRIE-EXACT[AB] (15)
          <A>
          <B>
  15:     SUCCEED (0)
  16:   NOTHING (17)
  17:   SUCCEED (0)
  18: TAIL (19)
  19: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*+"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...