Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
91 views
in Technique[技术] by (71.8m points)

python - Why do some regex engines match .* twice in a single input string?

Many regex engines match .* twice in a single-line string, e.g., when performing regex-based string replacement:

  • The 1st match is - by definition - the entire (single-line) string, as expected.
  • In many engines there is a 2nd match, namely the empty string; that is, even though the 1st match has consumed the entire input string, .* is matched again, which then matches the empty string at the end of the input string.

    • Note: To ensure that only one match is found, use ^.*

My questions are:

  • Is there a good reason for this behavior? Once the input string has been consumed in full, I wouldn't expect another attempt to find a match.

  • Other than trial and error, can you glean from the documentation / regex dialect/standard supported which engines exhibit this behavior?

Update: revo's helpful answer explains the how of the current behavior; as for the potential why, see this related question.

Languages/platforms that DO exhibit the behavior:

 # .NET, via PowerShell (behavior also applies to the -replace operator)
 PS> [regex]::Replace('a', '.*', '[$&]'
 [a][]  # !! Note the *2* matches, first the whole string, then the empty string

 # Node.js
 $ node -pe "'a'.replace(/.*/g, '[$&]')"
 [a][]

 # Ruby
 $ ruby -e "puts 'a'.gsub(/.*/, '[\0]')"
 [a][]

 # Python 3.7+ only
 $ python -c "import re; print(re.sub('.*', '[g<0>]', 'a'))"
 [a][] 

 # Perl 5
 $ echo a | perl -ple 's/.*/[$&]/g'
 [a][] 

 # Perl 6
 $ echo 'a' | perl6 -pe 's:g/.*/[$/]/'
 [a][]

 # Others?

Languages/platforms that do NOT exhibit the behavior:

# Python 2.x and Python 3.x <= 3.6
$ python -c "import re; print(re.sub('.*', '[g<0>]', 'a'))"
[a]  # !! Only 1 match found.

# Others?

bobble bubble brings up some good related points:

If you make it lazy like .*?, you'd even get 3 matches in some and 2 matches in others. Same with .??. As soon as we use a start anchor, I thought we should get only one match, but interestingly it seems ^.*? gives two matches in PCRE for a, whereas ^.* should result in one match everywhere.


Here's a PowerShell snippet for testing the behavior across languages, with multiple regexes:

Note: Assumes that Python 3.x is available as python3 and Perl 6 as perl6.
You can paste the whole snippet directly on the command line and recall it from the history to modify the inputs.

& {
  param($inputStr, $regexes)

  # Define the commands as script blocks.
  # IMPORTANT: Make sure that $inputStr and $regex are referenced *inside "..."*
  #            Always use "..." as the outer quoting, to work around PS quirks.
  $cmds = { [regex]::Replace("$inputStr", "$regex", '[$&]') },
          { node -pe "'$inputStr'.replace(/$regex/g, '[$&]')" },
          { ruby -e "puts '$inputStr'.gsub(/$regex/, '[\0]')" },
          { python -c "import re; print(re.sub('$regex', '[g<0>]', '$inputStr'))" },
          { python3 -c "import re; print(re.sub('$regex', '[g<0>]', '$inputStr'))" },
          { "$inputStr" | perl -ple "s/$regex/[$&]/g" },
          { "$inputStr" | perl6 -pe "s:g/$regex/[$/]/" }

  $regexes | foreach {
    $regex = $_
    Write-Verbose -vb "----------- '$regex'"
    $cmds | foreach { 
      $cmd = $_.ToString().Trim()
      Write-Verbose -vb ('{0,-10}: {1}' -f (($cmd -split '|')[-1].Trim() -split '[ :]')[0], 
                                           $cmd -replace '$inputStr', $inputStr -replace '$regex', $regex)
      & $_ $regex
    }
  }

} -inputStr 'a' -regexes '.*', '^.*', '.*$', '^.*$', '.*?'

Sample output for regex ^.*, which confirms bobble bubble's expectation that using the start anchor (^) yields only one match in all languages:

VERBOSE: ----------- '^.*'
VERBOSE: [regex]   : [regex]::Replace("a", "^.*", '[$&]')
[a]
VERBOSE: node      : node -pe "'a'.replace(/^.*/g, '[$&]')"
[a]
VERBOSE: ruby      : ruby -e "puts 'a'.gsub(/^.*/, '[\0]')"
[a]
VERBOSE: python    : python -c "import re; print(re.sub('^.*', '[g<0>]', 'a'))"
[a]
VERBOSE: python3   : python3 -c "import re; print(re.sub('^.*', '[g<0>]', 'a'))"
[a]
VERBOSE: perl      : "a" | perl -ple "s/^.*/[$&]/g"
[a]
VERBOSE: perl6     : "a" | perl6 -pe "s:g/^.*/[$/]/"
[a]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Kinda interesting question. Instead of referring to your questions first, I'll go for your comment.

Once the input string has been consumed in full, why would you treat the fact that there is nothing left as the empty string?

A position called end of subject string is left. It's a position and can be matched. Like other zero-width assertions and anchors , B, ^, $... that assert, a dot-star .* can match an empty string. This is highly dependent on regex engine. E.g. TRegEx does it differently.

And if you do, shouldn't this result in an infinite loop?

No, this is of the main jobs of regex engines to handle. They raise a flag and store current cursor data to avoid such loops to occur. Perl docs explain it this way:

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:

'foo' =~ m{ ( o? )* }x;

The o? matches at the beginning of foo, and since the position in the string is not moved by the match, o? would match again and again because of the * quantifier. Another common way to create a similar cycle is with the looping modifier /g...

Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{} , and for higher-level ones like the /g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring.

Now back to your questions:

Is there a good reason for this behavior?

Yes, there is. Every regex engine has to meet a significant amount of challenges in order to process a text. One of which is dealing with zero-length matches. Your question raises another question,

Q: How does an engine should proceed after matching a zero-length string?

A: It all depends.

PCRE (or Ruby here) doesn't skip zero-length matches.

It matches it then raises a flag to not match the same position again with the (same)? pattern. In PCRE .* matches entire subject string then stops right after it. Being at the end, current position is a meaningful position in PCRE, positions can be matched or being asserted so there is a position (zero-length string) left to be matched. PCRE goes through the regex again (if g modifier is enabled) and finds a match at the end of subject.

PCRE then tries to advance to the next immediate position to run whole process again but it fails since there is no position left.

You see if you want to prevent the second match from being happened you need to tell engine in some way:

^.*

Or to provide a better insight into what is going on:

(?!$).*

See live demo here specially take a look at debugger window.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...