Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
622 views
in Technique[技术] by (71.8m points)

regex - matching unicode characters in python regular expressions

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work.

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>w+)/(?P<filename>(w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

All is well, then I try something with Norwegian characters in it ( or something more unicode-like ):

>>> m = re.match(r'^/by_tag/(?P<tag>w+)/(?P<filename>(w|[.,!#%{}()@])+)$', '/by_tag/p?ske/?yfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

How can I match typical unicode characters, like ???? I`d like to be able to match those characters as well, in both the tag-group above and the one for filename.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>w+)/(?P<filename>(w|[.,!#%{}()@])+)$', u'/by_tag/p?ske/?yfjell.jpg', re.UNICODE).groupdict()
{'tag': u'pxe5ske', 'filename': u'xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...