Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
637 views
in Technique[技术] by (71.8m points)

performance - Why does adding one more alternative make my regex over 600 times slower?

I noticed something weird while testing a simple Perl script that's supposed to filter out filenames beginning with certain prefixes.

Basically, I'm constructing a regex like this:

my $regex = join "|", map quotemeta, @prefixes;
$regex = qr/^($regex)/;   # anchor the regex and precompile it

Here, in the scenario I originally tested, @prefixes consists of 32-character hexadecimal strings. What I found is that everything ran nice and smoothly up to 6,552 prefixes — but as soon as I tried 6,553, the execution time of the script jumped by a factor of over 25 (from 1.8 seconds to 51 seconds)!

I played around with it, and constructed the following benchmark. I originally used 32-character hex strings, like in my original program, but found that if I cut the length of the strings down to just 8 characters, the threshold value moved higher — to 16,383, in fact — while the slowdown factor got dramatically higher yet: the regexp with 16,383 alternatives is almost 650 times slower than the one with only 16,382!

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);

my $count = shift || 10;

our @items = map unpack("H8", pack "V", $_), 0..99999;

our $nA = 16382; our $reA = join "|", @items[1 .. $nA];
our $nB = 16383; our $reB = join "|", @items[1 .. $nB];

$_ = qr/^($_)/ for $reA, $reB;  # anchor and compile regex

my $results = timethese( $count, {
    $nA => q{ our (@items, $nA, $reA); $nA == grep /$reA/, @items or die; },
    $nB => q{ our (@items, $nB, $reB); $nB == grep /$reB/, @items or die; },
} );
cmpthese( $results );

Here are the results of running this benchmark on my laptop, using Perl (v5.18.2):

Benchmark: timing 10 iterations of 16382, 16383...
     16382:  2 wallclock secs ( 1.79 usr +  0.00 sys =  1.79 CPU) @  5.59/s (n=10)
     16383: 1163 wallclock secs (1157.28 usr +  2.70 sys = 1159.98 CPU) @  0.01/s (n=10)
      s/iter  16383  16382
16383    116     --  -100%
16382  0.179 64703%     --

Note the 64,703% speed difference.

My original hunch, based on the numerical coincidence that 6553 ≈ 216 / 10, was that this might've been some kind of an arbitrary limit within the Perl regex engine, or maybe that there there might be some kind of an array of 10-byte structs that was limited to 64 kB, or something. But the fact that the threshold number of alternatives depends on their length makes things more confusing.

(On the other hand, it's clearly not just about the length of the regex, either; the original regex with 6,553 32-byte alternatives was 2 + 6,553×33 = 216,251 bytes long, while the one with 16,383 8-byte alternatives is only 2 + 16,383×9 = 147,450 bytes long.)

What is causing this weird jump in regex matching time, and why does it happen at that specific point?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

For a long time, perl's TRIE optimization has not been applied where the initial compilation of the regex produces longjmp instead of jmp (I think) instructions (which depends on the number of alternations and the total lengths of the strings involved and what else is (earlier?) in the regex).

See the difference between:

perl -we'use re "debug"; qr/@{[join"|","a".."afhd"]}/'

and

perl -we'use re "debug"; qr/@{[join"|","a".."afhe"]}/'

You can break your alternation down into smaller chunks and precompile them separately and do e.g. (??{$rxa})|(??{$rxb})|(??{$rxc}).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...