Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
405 views
in Technique[技术] by (71.8m points)

python - Evaluate UTF-8 literal escape sequences in a string in Python3

I have a string of the form:

s = '\xe2\x99\xac'

I would like to convert this to the character ? by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert the escape sequence into a literal unicode character?

What I've read elsewhere suggests that the following line of code should do what I want, but it results in a UnicodeEncodeError.

print(bytes(s, 'utf-8').decode('unicode-escape'))

I also tried the following, which has the same result:

import codecs
print(codecs.getdecoder('unicode_escape')(s)[0])

Both of these approaches produce the string 'ax99?', which print is subsequently unable to handle.

In case it makes any difference the string is being read in from a UTF-8 encoded file and will ultimately be output to a different UTF-8 encoded file after processing.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

...decode('unicode-escape') will give you string 'xe2x99xac'.

>>> s = '\xe2\x99\xac'
>>> s.encode().decode('unicode-escape')
'ax99?'
>>> _ == 'xe2x99xac'
True

You need to decode it. But to decode it, encode it first with latin1 (or iso-8859-1) to preserve the bytes.

>>> s = '\xe2\x99\xac'
>>> s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
'?'

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...