Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
551 views
in Technique[技术] by (71.8m points)

encoding - R Corpus Is Messing Up My UTF-8 Encoded Text

I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly.

Here is a reproducible example of my problem:

Load in the Russian text:

> data <- c("Renault Logan, 2005","Складское помещение, 345 м2",
          "Су-шеф","3-к квартира, 64 м2, 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")

Create a VectorSource:

> vs <- VectorSource(data)
> vs # outputs correctly

Then, create the corpus:

> corp <- Corpus(vs)
> inspect(corp) # output is not encoded properly

The output that I get is:

> inspect(corp)
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
Renault Logan, 2005

[[2]]
<<PlainTextDocument (metadata: 7)>>
?ê?à??ê?? ??ì?ù?íè?, 345 ì<U+00B2>

[[3]]
<<PlainTextDocument (metadata: 7)>>
?ó-???

[[4]]
<<PlainTextDocument (metadata: 7)>>
3-ê êaàeòèeà, 64 ì<U+00B2>, 3/5 yò.

[[5]]
<<PlainTextDocument (metadata: 7)>>
Samsung galaxy S4 mini GT-I9190 (÷?eí?é)

Why does it output incorrectly? There doesn't seem to be any option to set the encoding on the Corpus method. Is there a way to set it after the fact? I have tried this:

> title_corpus <- tm_map(title_corpus, enc2utf8)
Error in FUN(X[[1L]], ...) : argumemt is not a character vector

But, it errors as shown.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Well, there seems to be good news and bad news.

The good news is that the data appears to be fine even if it doesn't display correctly with inspect(). Try looking at

content(corp[[2]])
# [1] "Складское помещение, 345 м2"

The reason it looks funny in inspect() is because the authors changed the way the print.PlainTextDocument function works. It formerly would cat the value to screen. Now, however, they feed the data though writeLines(). This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.

Try going a step further and building a DocumentTermMatrix

dtm <- DocumentTermMatrix(corp)
Terms(dtm)

Hopefully you will see (as I do) the words correctly displayed.

If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout() on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...