Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
116 views
in Technique[技术] by (71.8m points)

python - Inconsistency between $ and ^ in regex when using start/end arguments to re.search?

From what I've read, ^ should match the start of a string, and $ the end. However, with re.search(), it looks like the behavior of ^ continues to work fine, while $ 'breaks'. Example:

>>> a = re.compile( "^a" )
>>> print a.search( "cat", 1, 3 )
None

This seems correct to me -- 'a' is not at the start of the string, even if it is at the start of the search.

>>> a = re.compile( "a$" )
>>> print a.search( "cat", 0, 2 )
<_sre.SRE_Match object at 0x7f41df2334a8>

This seems wrong to me, or inconsistent at least.

The documentation on the re module explicitly mentions that the behavior of ^ does not change due to start/end arguments to re.search, but no change in behavior is mentioned for $ (that I've seen).

Can anyone explain why things were designed this way, and/or suggest a convenient workaround?

By workaround, I would like to compose a regex which always matches the end of the string, even when someone uses the end argument to re.search.

And why was re.search designed such that:

s.search( string, endPos=len(string) - 1 )

is the same as

s.search( string[:-1] )

when

s.search( string, startPos=1 )

is explicitly and intentionally not the same as

s.search( string[1:] )

It seems to be less an issue of inconsistency between ^ and $, and more of an inconsistency within the re.search function.

question from:https://stackoverflow.com/questions/43108558/inconsistency-between-and-in-regex-when-using-start-end-arguments-to-re-sear

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Short Answer

Use A to and to match the literal beginning or end of a string. The relevant lines from the re module's docs:

6.2.1. Regular Expression Syntax

A Matches only at the start of the string.

Matches only at the end of the string.

Caveat about endpos

This won't work "even when someone uses the end argument to re.search". Unlike the "start" parameter pos, which just marks a starting point, the endpos parameter means the search (or match) will be conducted on only a portion of the string (emphasis added):

6.2.3. Regular Expression Objects

regex.search(string[, pos[, endpos]])

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, [...] rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

The matches the end of the string being searched, which is exactly what endpos changes.

Background

The more-familiar ^ and $ don't do what you think they do:

^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

$ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both 'foo' and 'foobar', while the regular expression foo$ matches only 'foo'. More interestingly, searching for foo.$ in 'foo1 foo2 ' matches 'foo2' normally, but 'foo1' in MULTILINE mode; searching for a single $ in 'foo ' will find two (empty) matches: one just before the newline, and one at the end of the string.

Python's regular expressions are heavily influenced by Perl's, which extended the old grep abilities with a host of its own. That included multi-line matching, which raised a question about metacharacters like ^: Was it matching the beginning of the string, or the beginning of the line? When grep was only matching one line at a time, those were equivalent concepts.

As you can see, ^ and $ ended up trying to match everything "start-like" and "end-ish". Perl introduced the new escape sequences A and z (lower-case) to match only the start-of-string and end-of-string.

Those escape sequences were adopted by Python, but with one difference: Python did not adopt Perl's (upper-case), which matched both end-of-string and the special case newline-before-end-of-string... making it not quite the partner to A that one would expect.

(I assume Python upper-cased Perl's z for consistency, avoiding the lopsided 'Apatternz' regexes that were recommended in books like Perl Best Practices.)

History of pos and endpos

It appears that the strange "not actually the start-start position" meaning of pos is as old as the parameter itself:

  • The Python 1.4 match function docs (25 Oct 1996 --- probably pre-dating the regex object) don't show the pos or endpos parameters at all.

  • The Python 1.5 match method docs (17 Feb 1998) introduce both the regular expression object and the pos and endpos parameters. It states that a ^ will match at pos, although later revisions suggest this was a typo. (Speaking of typos: The ^ character itself is missing. It came and went, until finally reappearing for good(?) in Python 2.1.)

  • The Python 1.5.1 match method docs (14 Apr 1998) insert the missing "not", reversing the previous docs.

  • The Python 1.5.1p1 match method docs (06 Aug 1998) clarify the unexpected effects of pos. They match Python 3.6.1's description of pos word-for-word... give or take that pesky ^ typo.

I suspect the numerous changes to the docs over a couple months of bug-fix releases reflect the docs catching up with reality --- not changes to the design of match (although I don't have Python 1 lying around to verify that).

The python-dev mailing list archives only go back to 1999, so unless the earlier messages were saved somewhere else, I think answering the "why" question would require guessing who wrote that code, and asking them.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...