java - Unicode issue with an HTML Title, question mark? 65533;

Question

Welcome To Ask or Share your Answers For Others

java - Unicode issue with an HTML Title, question mark? 65533;

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Unicode issue with an HTML Title, question mark? 65533;

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

Das hermetische Caf&#65533;: Rock &amp; Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

Using the following code:

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct

TITLE:

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:05:44+0000

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

Categories

java - Unicode issue with an HTML Title, question mark? 65533;

java - Unicode issue with an HTML Title, question mark? 65533;

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags