.net - C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

Question

Welcome To Ask or Share your Answers For Others

.net - C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

.net - C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.

So for example, for the character é :

Entity Name : é

Entity Number : é

Similarly, for the character > :

Entity Name : >

Entity Number : >

For a given string, the HttpUtility.HtmlEncode returns an HTML encoded String, but I can't figure out how it works. Here is what I mean :

Console.WriteLine(HtmlEncode("é>"));
//Outputs &#233;&gt;

It seems to be using the entity number for the é character but the entity name for the > character.

So does the HtmlEncode method really work with the ISO-8859-1 standard? If it does, is there a reason why it sometimes uses the entity name and other times the entity number? More importantly, can I force it to give me the entity name reliably?

EDIT : Thanks for the answers guys. I cannot decode the string before I perform the search though. Without getting into too many details, the text is stored in a SharePoint List and the "search" is done by SharePoint itself (using a CAML query). So basically, I can't.

I'm trying to think of a way to convert the entity numbers into names, is there a function in .NET that does that? Or any other idea?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:55:57+0000

That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of System.Net.WebUtility.HtmlEncode (as seen with reflector):

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&amp;");
            continue;
        }
        case ''':
        {
            output.Write("&#39;");
            continue;
        }
        case '"':
        {
            output.Write("&quot;");
            continue;
        }
        case '<':
        {
            output.Write("&lt;");
            continue;
        }
        case '>':
        {
            output.Write("&gt;");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= 'x00a0') && (ch < 'ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.

Categories

.net - C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

.net - C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags