Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
253 views
in Technique[技术] by (71.8m points)

python - How can I represent this regex to not get a "bad character range" error?

Is there a better way to do this?

$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2

Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[U0001d300-U0001d356]', "", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Python narrow and wide build (Python versions below 3.3)

The error suggests that you are using "narrow" (UCS-2) build, which only supports Unicode code points up to 65535 as one "Unicode character"1. Characters whose code points are above 65536 are represented as surrogate pairs, which means that the Unicode string u'U0001d300' consists of two "Unicode character" in narrow build.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'U0001d300')
2
>>> [hex(ord(i)) for i in u'U0001d300']
['0xd834', '0xdf00']

In "wide" (UCS-4) build, all 1114111 code points are recognized as Unicode character, so the Unicode string u'U0001d300' consists of exactly one "Unicode character"/Unicode code point.

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'U0001d300')
1
>>> [hex(ord(i)) for i in u'U0001d300']
['0x1d300']

1 I use "Unicode character" (in quotes) to refer to one character in Python Unicode string, not one Unicode code point. The number of "Unicode characters" in a string is the len() of the string. In "narrow" build, one "Unicode character" is a 16-bit code unit of UTF-16, so one astral character will appear as two "Unicode character". In "wide" build, one "Unicode character" always corresponds to one Unicode code point.

Matching astral plane characters with regex

Wide build

The regex in the question compiles correctly in "wide" build:

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[U0001d300-U0001d356]', re.DEBUG)
in
  range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>

Narrow build

However, the same regex won't work in "narrow" build, since the engine does not recognize surrogate pairs. It just treats ud834 as one character, then tries to create a character range from udf00 to ud834 and fails.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[U0001d300-U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']

The workaround is to use the same method as done in ECMAScript, where we will construct the regex to match the surrogates representing the code point.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'ud834[udf00-udf56]', re.DEBUG)
literal 55348
in
  range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input =  u'Sample U0001d340. Another U0001d305. Leave alone U00011000'
>>> input
u'Sample U0001d340. Another U0001d305. Leave alone U00011000'
>>> re.sub(u'ud834[udf00-udf56]', '', input)
u'Sample . Another . Leave alone U00011000'

Using regexpu to derive astral plane regex for Python narrow build

Since the construction to match astral plane characters in Python narrow build is the same as ES5, you can use regexpu, a tool to convert RegExp literal in ES6 to ES5, to do the conversion for you.

Just paste the equivalent regex in ES6 (note the u flag and u{hh...h} syntax):

/[u{1d300}-u{1d356}]/u

and you get back the regex which can be used in both Python narrow build and ES5

/(?:uD834[uDF00-uDF56])/

Do take note to remove the delimiter / in JavaScript RegExp literal when you want to use the regex in Python.

The tool is extremely useful when the range spread across multiple high surrogates (U+D800 to U+DBFF). For example, if we have to match the character range

/[u{105c0}-u{1cb40}]/u

The equivalent regex in Python narrow build and ES5 is

/(?:uD801[uDDC0-uDFFF]|[uD802-uD831][uDC00-uDFFF]|uD832[uDC00-uDF40])/

which is rather complex and error-prone to derive.

Python version 3.3 and above

Python 3.3 implements PEP 393, which eliminates the distinction between narrow build and wide build, and Python from now behaves like a wide build. This eliminates the problem in the question altogether.

Compatibility issues

While it's possible to workaround and match astral plane characters in Python narrow builds, going forward, it's best to change the execution environment by using Python wide builds, or port the code to use with Python 3.3 and above.

The regex code for narrow build is hard to read and maintain for average programmers, and it has to be completely rewritten when porting to Python 3.

Reference


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...