utf 8 - Converting UTF-8 to ISO-8859-1 in Java

Question

Welcome To Ask or Share your Answers For Others

utf 8 - Converting UTF-8 to ISO-8859-1 in Java

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

utf 8 - Converting UTF-8 to ISO-8859-1 in Java

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).

Is it possible to convert these characters from UTF-8 to ISO-8859-1?

Here is a snippet of code I have written to attempt this:

BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();

String line = null;
while ((line = br.readLine()) != null) {
  sb.append(line);
}
br.close();

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with

byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");

I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:37:16+0000

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

Example usage:

String foo = "This is Cyrillic Ya: u044F
"
    + "This is fraktur G: uD835uDD0A
" + "This is a smart quote: u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

Categories

utf 8 - Converting UTF-8 to ISO-8859-1 in Java

utf 8 - Converting UTF-8 to ISO-8859-1 in Java

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags