I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.
The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.
In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).
1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match
We use a regex containing exactly 11 commas:
".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"
This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)
A simple non-regex solution would be to split()
each line and compare the 6th element. (Much faster: 9 microseconds per string.)
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
So we replace the greedy quantifier with the reluctant one:
".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"
This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.
A performant regex replaces the dot by the character class "[^,]":
"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"
(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)