Short Answer
Use A
to and
to match the literal beginning or end of a string.
The relevant lines from the re
module's docs:
6.2.1. Regular Expression Syntax
A
Matches only at the start of the string.
Matches only at the end of the string.
Caveat about endpos
This won't work "even when someone uses the end argument to re.search
".
Unlike the "start" parameter pos
, which just marks a starting point, the endpos
parameter means the search (or match) will be conducted on only a portion of the string (emphasis added):
6.2.3. Regular Expression Objects
regex.search(string[, pos[, endpos]]
)
The optional parameter endpos
limits how far the string will be searched;
it will be as if the string is endpos
characters long,
[...]
rx.search(string, 0, 50)
is equivalent to rx.search(string[:50], 0)
.
The
matches the end of the string being searched, which is exactly what endpos
changes.
Background
The more-familiar ^
and $
don't do what you think they do:
^
(Caret.) Matches the start of the string, and in MULTILINE
mode also matches immediately after each newline.
$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE
mode also matches before a newline.
foo
matches both 'foo' and 'foobar', while the regular expression foo$
matches only 'foo'.
More interestingly, searching for foo.$
in 'foo1
foo2
'
matches 'foo2' normally, but 'foo1' in MULTILINE
mode;
searching for a single $
in 'foo
'
will find two (empty) matches:
one just before the newline, and one at the end of the string.
Python's regular expressions are heavily influenced by Perl's, which extended the old grep
abilities with a host of its own.
That included multi-line matching, which raised a question about metacharacters like ^
:
Was it matching the beginning of the string, or the beginning of the line?
When grep
was only matching one line at a time, those were equivalent concepts.
As you can see, ^
and $
ended up trying to match everything "start-like" and "end-ish".
Perl introduced the new escape sequences A
and z
(lower-case) to match only the start-of-string and end-of-string.
Those escape sequences were adopted by Python, but with one difference:
Python did not adopt Perl's
(upper-case), which matched both end-of-string and the special case newline-before-end-of-string...
making it not quite the partner to A
that one would expect.
(I assume Python upper-cased Perl's z
for consistency, avoiding the lopsided 'Apatternz'
regexes that were recommended in books like Perl Best Practices.)
History of pos
and endpos
It appears that the strange "not actually the start-start position" meaning of pos
is as old as the parameter itself:
The Python 1.4 match
function docs (25 Oct 1996 --- probably pre-dating the regex object) don't show the pos
or endpos
parameters at all.
The Python 1.5 match
method docs (17 Feb 1998) introduce both the regular expression object and the pos
and endpos
parameters.
It states that a ^
will match at pos
, although later revisions suggest this was a typo.
(Speaking of typos:
The ^
character itself is missing.
It came and went, until finally reappearing for good(?) in Python 2.1.)
The Python 1.5.1 match
method docs (14 Apr 1998) insert the missing "not", reversing the previous docs.
The Python 1.5.1p1 match
method docs (06 Aug 1998) clarify the unexpected effects of pos
.
They match Python 3.6.1's description of pos
word-for-word...
give or take that pesky ^
typo.
I suspect the numerous changes to the docs over a couple months of bug-fix releases reflect the docs catching up with reality --- not changes to the design of match
(although I don't have Python 1 lying around to verify that).
The python-dev
mailing list archives only go back to 1999, so unless the earlier messages were saved somewhere else, I think answering the "why" question would require guessing who wrote that code, and asking them.