java - Is regex too slow? Real life examples where simple non-regex alternative is better

Question

Welcome To Ask or Share your Answers For Others

java - Is regex too slow? Real life examples where simple non-regex alternative is better

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Is regex too slow? Real life examples where simple non-regex alternative is better

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.

I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.

Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.

What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.

Note: I would very much appreciate answers that demonstrate the following:

a beginner-level regex solution to a non-toy real-life problem that performs horribly
the simple non-regex solution
the expert-level regex rewrite that performs comparably

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:50:09+0000

I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.

The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.

In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match

We use a regex containing exactly 11 commas:

".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"

This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

So we replace the greedy quantifier with the reluctant one:

".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"

This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

A performant regex replaces the dot by the character class "[^,]":

"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"

(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)

Categories

java - Is regex too slow? Real life examples where simple non-regex alternative is better

java - Is regex too slow? Real life examples where simple non-regex alternative is better

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags