Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
940 views
in Technique[技术] by (71.8m points)

unicode - Howto identify UTF-8 encoded strings

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.

Code project C# sample that uses Microsoft's MLang for character encoding detection.

UTRAC is a command line tool and library written in c++ to detect string encoding

cpdetector is a java project used for encoding detection

chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.

Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...