Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
817 views
in Technique[技术] by (71.8m points)

unicode - What character encoding is c3 82 c2 bf?

I have a source of text data that includes the byte sequence c3 82 c2 bf. In context I think it's supposed to be a capital Greek Phi symbol (Φ).

Anyway I can't figure out what encoding is being used; I'm writing a Python script to process this data into a database that expects Unicode, and it throws an exception on this particular sequence of data.

Any suggestions on how to handle it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Interpreted as UTF-8, c3 82 is “?” U+00C2 and c2 bf is “?” U+00BF, which does not make much sense, but it’s technically valid UTF-8 data, so it should not be reported as character-level data error. Interpreted as UTF-16, it’s Hangul syllables and possibly a CJK ideograph, depending on endianness, but still formally valid data, though most probably not what was meant.

This sounds like the result of double conversion, but it’s difficult to make educated guesses. If it stands for Φ, then the UTF-16 form is 03 A6 or A6 03 and the UTF-8 form is CE A6, which don’t really resemble the actual data. Information about the origin of the data might help in guessing what transcodings may have happened.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...