Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
525 views
in Technique[技术] by (71.8m points)

unicode - Why Java char uses UTF-16?

Recently I read lots of things about Unicode code points and how they evolved over time and sure I read http://www.joelonsoftware.com/articles/Unicode.html this also.

But something I couldn't find the real reason for is why Java uses UTF-16 for a char.

For example, If I had the string which contains 1024 letters of ASCII scoped character string. It means 1024 * 2 bytes which equals 2KB string memory which it will consume in any way.

So if Java base char would be UTF-8 it would be just 1KB of data. Even if the string has any character which needs to 2bytes for example 10 character of "字" naturally it will increase the size of the memory consumption. (1014 * 1 byte) + (10 * 2 bytes) = 1KB + 20 bytes

The result isn't that obvious 1KB + 20 bytes VS. 2KB I don't say about ASCII but my curiosity about this is why is it not UTF-8 which just takes care of multibyte chars also. UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.

Is there any good reason behind this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as the no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...