Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
933 views
in Technique[技术] by (71.8m points)

utf 8 - How to decode string representative of utf-8 with python?

I have a unicode like this:

xE5xB1xB1xE4xB8x9C xE6x97xA5xE7x85xA7

And I know it is the string representative of bytes which is encoded with utf-8

Note that the string xE5xB1xB1xE4xB8x9C xE6x97xA5xE7x85xA7 itself is <type 'unicode'>

How to decode it to the real string 山东 日照 ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you printed the repr() output of your unicode string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.

First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:

unicode_string.encode('latin1').decode('utf8')

This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.

The best way to repair such mistakes is using the ftfy library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.

For your small sample, Latin-1 appears to work just fine:

>>> unicode_string = u'xE5xB1xB1xE4xB8x9C xE6x97xA5xE7x85xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照

If you have the literal character , x, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape codec:

>>> unicode_string = ur'xE5xB1xB1xE4xB8x9C xE6x97xA5xE7x85xA7'
>>> unicode_string
u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照

'string_escape' is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...