Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.2k views
in Technique[技术] by (71.8m points)

r - How to remove zero entries in a DFM when the matrix is too big for usual manipulation?

I have the following problem: I converted a corpus into a dfm and this dfmm has some zero entries that I need to remove before fitting a LDA model. I would usually do as follows:

OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)

Creating a dfm from a corpus input...
   ... lowercasing
   ... found 272,912 documents, 112,588 features
   ... removed 613 features
   ... stemming features (English)
, trimmed 27491 feature variants
   ... created a 272,912 x 84,515 sparse dfm
   ... complete. 
Elapsed time: 78.7 seconds.


# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]

However, when I try to perform the last operations I get: Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 hinting at the fact the the matrix is too large to be manipulated.

Is there anyone who has met and solved this issue before? Any alternative strategy to remove the 0 entries?

Thanks a lot!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your apply with sum transforms the dfm from a sparse matrix into a dense matrix for calculating the row sum.

Either use slam::row_sums since slam functions work on sparse matrices, but better yet, just use quantada::dfm_subset to select all the documents with more than 0 tokens.

dfm_subset(OutDfm, ntoken(OutDfm) > 0)

Example to show how it works with ntokens > 5000:

library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives : among vicissitudes
  1789-Washington               1  71 116      1  48     2               2 1     1            1

# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
               features
docs            fellow-citizens  of the senate and house representatives : among vicissitudes
  1841-Harrison              11 604 829      5 231     1               4 1     3            0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...