Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
340 views
in Technique[技术] by (71.8m points)

python - Converting double slash utf-8 encoding

I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:

xe6x89x8exe5x8axa0xe6x8bx89

But, no matter how I import it into Python (3 or 2), I get this string, at best:

\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89

I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:

stringName.encode("utf-8").decode("unicode_escape")

But then it messes up the original encoding, and gives this as the string:

'?x89x8e?x8axa0?x8bx89' (printing this string results in: ?? ? )

Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:

b'xe6x89x8exe5x8axa0xe6x8bx89'.encode("utf-8")

Results in: '扎加拉'

But, I can't do this programmatically. I can't even get rid of the double slashes.

To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:

with open('source.txt','r',encoding='utf-8') as f_open:
    source = f_open.read()

Okay, so I clicked the answer below (I think), but here is what works:

from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')

I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!

To be more clear, the original file is not just these messed up utf encodings. It only uses them for certain fields. For example, here is the beginning of the file:

{'m_cacheHandles': ['s2max00x00CNx1fx1b"x8dxdbx1fr \xbfxd4Dx05Rx87x10x0bx0f9x95x9bxe8x16Tx81bxe4x08x1exa8Ux11',
                's2max00x00CNx1axd9Lx12nxb9x8aLx1dxe7xb8xe6xf8xaaxa1Sxdbxa5+xd3x82^x0cx89xdbxc5x82x8dxb7x0fv',
                's2max00x00CNx92xd8x17Dxc1Dx1bxf6(xedjxb7xe9xd1x94x85xc8`x91Mx8btZx91xf65x1fxf9xdcxd4xe6xbb',
                's2max00x00CNxa1xe9xabxcd?xd2PSxc9x03xabx13Rxa6x85u7(K2x9dx08xb8k+xe2xdeIxc3xabx7fC',
                's2max00x00CNNxa5xe7xafxa0x84xe5xbcxe9HXxb93S*sjxe3xf8xe7x84`xf1Yex15~xb93x1fxc90',
                's2max00x00CN8xc6x13Fx19x1fx97AHxfax81mxacxc9xa6xa8x90sxfddx06
L]zxbbx15xdcIx93xd3V'],
'm_campaignIndex': 0,
'm_defaultDifficulty': 7,
'm_description': '',
'm_difficulty': '',
'm_gameSpeed': 4,
'm_imageFilePath': '',
'm_isBlizzardMap': True,
'm_mapFileName': '',
'm_miniSave': False,
'm_modPaths': None,
'm_playerList': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92,   'm_r': 36},
               'm_control': 2,
               'm_handicap': 0,
               'm_hero': 'xe6x89x8exe5x8axa0xe6x8bx89',

All of the information before the 'm_hero': field is not utf-8. So using ShadowRanger's solution works if the file is only made up of these fake utf-encodings, but it doesn't work when I have already parsed m_hero as a string and try to convert that. Karin's solution does work for that.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The problem is that the unicode_escape codec is implicitly decoding the result of the escape fixes by assuming the bytes are latin-1, not utf-8. You can fix this by:

# Read the file as bytes:
with open(myfile, 'rb') as f:
    data = f.read()

# Decode with unicode-escape to get Py2 unicode/Py3 str, but interpreted
# incorrectly as latin-1
badlatin = data.decode('unicode-escape')

# Encode back as latin-1 to get back the raw bytes (it's a 1-1 encoding),
# then decode them properly as utf-8
goodutf8 = badlatin.encode('latin-1').decode('utf-8')

Which (assuming the file contains the literal backslashes and codes, not the bytes they represent) leaves you with 'u624eu52a0u62c9' (Which should be correct, I'm just on a system without font support for those characters, so that's just the safe repr based on Unicode escapes). You could skip a step in Py2 by using the string-escape codec for the first stage decode (which I believe would allow you to omit the .encode('latin-1') step), but this solution should be portable, and the cost shouldn't be terrible.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...