Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
356 views
in Technique[技术] by (71.8m points)

java - File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system.

Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.

Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encoded versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles() method.

String fileName = "Tr?cky N?me";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
    System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}

Here's what I get when I run this test code on my systems. Note the %CC versus %C3 character representations.

OS X Snow Leopard:

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux (running in a VM on same OS X system):

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

I have tried various hacks to get the strings to agree, including setting the file.encoding system property and various LC_CTYPE and LANG environment variables. Nothing helps, nor do I want to resort to such hacks.

Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

That is, the first is letter i followed by 0xCC82 which is the UTF-8 encoding of the Unicodeu0302 "combining circumflex accent" character while the second is UTF-8 for u00EE "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

A recent email thread on Apple's java-dev mailing list could be of some help to you.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...