Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
203 views
in Technique[技术] by (71.8m points)

python - How do I compare a Unicode string that has different bytes, but the same value?

I'm comparing Unicode strings between JSON objects.

They have the same value:

a = '人口じんこうに膾炙かいしゃする'
b = '人口じんこうに膾炙かいしゃする'

But they have different Unicode representations:

String a : u'u4ebau53e3u3058u3093u3053u3046u306bu81beu7099u304bu3044u3057u3083u3059u308b'
String b : u'u4ebau53e3u3058u3093u3053u3046u306bu81beuf9fbu304bu3044u3057u3083u3059u308b'

How can I compare between two Unicode strings on their value?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Unicode normalization will get you there for this one:

>>> import unicodedata
>>> unicodedata.normalize("NFC", "uf9fb") == "u7099"
True

Use unicodedata.normalize on both of your strings before comparing them with == to check for canonical Unicode equivalence.

Character U+F9FB is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...