Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
775 views
in Technique[技术] by (71.8m points)

perl - How do I read UTF-8 with diamond operator (<>)?

I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.

So my script should be callable in these two ways, as usual, giving the same output:

./script.pl utf8.txt
cat utf8.txt | ./script.pl

But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:

#!/usr/bin/perl -w

binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';

while(<>){
    my @chars = split //, $_;
    print "$_
" foreach(@chars);
}

How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.

EDIT:

I realized I should probably describe the different outputs. My input file contains this sequence: axCAxA7b. The method with cat correctly outputs:

a
xCAxA7
b

But the other method gives me this:

a
xC3x8A
xC2xA7
b
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try to use the pragma open instead:

use strict;
use warnings;
use open qw(:std :utf8);

while(<>){
    my @chars = split //, $_;
    print "$_" foreach(@chars);
}

You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in @ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in @ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when @ARGV has files. In this case the <> operator opens a new file handle for each file in @ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...