Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
523 views
in Technique[技术] by (71.8m points)

text - bigrams instead of single words in termdocument matrix using R and Rweka

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R

The idea goes something like this:

library(tm)
library(RWeka)
data(crude)

#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

However the final line gives me the error:

Error in rep(seq_along(x), sapply(tflist, length)) : 
  invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

If I remove the tokenizer from the last line it creates a regular tdm, so I guess the problem is somewhere in the BigramTokenizer function although this is the same example that the Weka site gives here: http://tm.r-forge.r-project.org/faq.html#Bigrams.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Inspired by Anthony's comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer):

# Sets the default number of threads to use
options(mc.cores=1)

Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems to work around it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...