Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
701 views
in Technique[技术] by (71.8m points)

perl - Using Encode::encode with "utf8"

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.

I have a few usage questions related to this distinction:

  1. Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

  2. What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?

  3. What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
                   ╔════════════════════════════════════════════╤══════════════════════╗
                   ║                                            │                      ║
                   ║                  On Read                   │       On Write       ║
                   ║                                            │                      ║
        Perl       ╟─────────────────────┬──────────────────────┼──────────────────────╢
        5.26       ║                     │                      │                      ║
                   ║ Invalid encoding    │ Outside of Unicode,  │ Outside of Unicode,  ║
                   ║ other than sequence │ Unicode nonchar, or  │ Unicode nonchar, or  ║
                   ║ length              │ Unicode surrogate    │ Unicode surrogate    ║
                   ║                     │                      │                      ║
╔══════════════════╬═════════════════════╪══════════════════════╪══════════════════════╣
║                  ║                     │                      │                      ║
║ :encoding(UTF-8) ║ Warns and Replaces  │ Warns and Replaces   │ Warns and Replaces   ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :encoding(utf8)  ║ Warns and Replaces  │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :utf8            ║ Corrupt scalar      │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╚══════════════════╩═════════════════════╧══════════════════════╧══════════════════════╝

Click here if you have trouble viewing the above table

Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.

(Encoding names are case-insensitive.)


Tests used to generate the above table:

On read

  • :encoding(UTF-8)

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":encoding(UTF-8)";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "xFFFF" does not map to Unicode.
    utf8 "xD800" does not map to Unicode.
    utf8 "x200000" does not map to Unicode.
    utf8 "x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    5C.78.7B.46.46.46.46.7D = x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
    5C.78.7B.44.38.30.30.7D = x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
    5C.78.7B.32.30.30.30.30.30.7D = x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
    5C.78.38.30 = x80 (internal: 5C.78.38.30, UTF8=1)
    
  • :encoding(utf8)

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = x80 (internal: 5C.78.38.30, UTF8=1)
    
  • :utf8

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":utf8";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)
    

On write

  • :encoding(UTF-8)

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print "x{E9}
    ";
       print "x{FFFF}
    ";
       print "x{D800}
    ";
       print "x{20_0000}
    ";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "x{ffff}" does not map to utf8.
    "x{d800}" does not map to utf8.
    "x{200000}" does not map to utf8.
    
    $ od -t c a
    0000000 303 251  
          x   {   F   F   F   F   }  
          x   {   D
    0000020   8   0   0   }  
          x   {   2   0   0   0   0   0   }  
    
    0000040
    
    $ cat a
    é
    x{FFFF}
    x{D800}
    x{200000}
    
  • :encoding(utf8)

    $ perl -e'
       use open ":std", ":encoding(utf8)";
       print "x{E9}
    ";
       print "x{FFFF}
    ";
       print "x{D800}
    ";
       print "x{20_0000}
    ";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
    
    $ od -t c a
    0000000 303 251  
     355 240 200  
     370 210 200 200 200  
    
    0000015
    
    $ cat a
    é
    ?
    ?
    
  • :utf8

    Same results as :encoding(utf8).

Tested using Perl 5.26.


Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...