Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
318 views
in Technique[技术] by (71.8m points)

c - Why am I getting the last octet repeated when my Perl program outputs a UTF-8 encoded string in cmd.exe?

Update

As @ikegami suggested, I reported this as a bug.

Bug #121783 for perl5: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output

Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:

C version:

#include <stdio.h>

int main(void) {
    /* UTF-8 encoded alpha, beta, gamma */
    char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
    puts(x);
    return 0;
}
Output:
C:…> chcp 65001
Active code page: 65001

C:…> cttt.exe
αβγ

Perl version:

C:…>  perl -e "print qq{xcexb1xcexb2xcexb3
}"
αβγ
?

From what I can tell, the last octet, 0xb3 is being output again, on another line, which is being translated to U+FFFD.

Note that redirecting output eliminates this effect.

I can also verify that it is the last octet being repeated:

C:…>  perl -e "print qq{xcexb1xcexb2xcexb3xyz
}"
αβγxyz
z

On the other hand, syswrite avoids this problem.

C:…>  perl -e "syswrite STDOUT, qq{xcexb1xcexb2xcexb3xyz
}"
αβγxyz

I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.

I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.

Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:

C:…> chcp 437
Active code page: 437

C:…> cttt.exe
╬?╬▓╬│

C:…>  perl -e "print qq{xcexb1xcexb2xcexb3
}"
╬?╬▓╬│

What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?

PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.

PPS: Leaving out the results in something even more interesting:

C:…> perl -e "print qq{xcexb1xcexb2xcexb3xyz}"
αβγxyzxyz
C:…> perl -e "print qq{xcexb1xcexb2xcexb3}"
αβγ?γ?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The following program produces the correct output:

use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);

binmode(STDOUT, ":unix:encoding(utf8):crlf");

print 'αβγxyz', "
";

Output:

C:…> chcp 65001
Active code page: 65001
C:…> perl pttt.pl
αβγxyz

which seems to indicate to me there is some funkiness with the :crlf layer. I do not understand the internals enough to comment intelligently about this at this point.

After many experiments, I have come to the conclusion that, if the console is already set to 65001 code page, binmode(STDOUT, ":unix:encoding(utf8):crlf"); will "work". However, note the following:

binmode(STDOUT, ":unix:encoding(utf8):crlf");
print Dump [
    map {
        my $x = defined($_) ? $_ : '';
        $x =~ s/A([0-9]+)z/sprintf '0x%08x', $1/eg;
        $x;
    } PerlIO::get_layers(STDOUT, details => 1)
];
print "αβγxyz
";

gives me:

---
- unix
- ''
- 0x01205200
- crlf
- ''
- 0x00c85200
- unix
- ''
- 0x01201200
- encoding
- utf8
- 0x00c89200
- crlf
- ''
- 0x00c8d200
αβγxyz

As before, I do not know enough to know the full consequences of this. I do intend to build a debug perl at some point to further diagnose this.

I examined this a little further. Here are some observations from that post:

The flags for the first unix layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG. Why is CRLF set for the unix layer on Windows? I do not know about the internals enough to understand this.

However, the flags for the second unix layer, the one pushed by my explicit binmode, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.

The flags for the first crlf layer are 0x00c85200 = CANWRITE | TRUNCATE | CRLF | LINEBUF | FASTGETS | TTY. The flags for the second layer, which I push after the :encoding(utf8) layer are 0x00c8d200 = 0x00c85200 | UTF8.

Now, if I open a file using open my $fh, '>:encoding(utf8)', 'ttt', and dump the same information, I get:

---
- unix
- ''
- 0x00201200
- crlf
- ''
- 0x00405200
- encoding
- utf8
- 0x00409200

As expected, the unix layer does not set the CRLF flag.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...