text - Keep document ID with R corpus

Question

Welcome To Ask or Share your Answers For Others

text - Keep document ID with R corpus

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

text - Keep document ID with R corpus

I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below:

I have a dataframe: ID and Text (Simple document id/name and then some text)

I have two issues:

Part 1: How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm).
Part 2: I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus, not the tdm/dtm.

For Part 2, I used a solution I got here: How to implement proximity rules in tm dictionary for counting words?

This one happens on the tdm part! Is there a better solution for Part 2 where you use something like "tm_map(my.corpus, keepOnlyWords, customlist)"?

Any help will be greatly appreciated. Thanks much!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:50:00+0000

First, here's a sample data.frame

dd<-data.frame(
    id=10:13,
    text=c("No wonder, then, that ever gathering volume from the mere transit ",
      "So that in many cases such a panic did he finally strike, that few ",
      "But there were still other and more vital practical influences at work",
      "Not even at the present day has the original prestige of the Sperm Whale")
    ,stringsAsFactors=F
 )

Now, in order to read special attributes from a data.frame, we will use the readTabular function to make our own custom data.frame reader. This is all we need to do

library(tm)
myReader <- readTabular(mapping=list(content="text", id="id"))

We just specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource but use our custom reader.

tm <- VCorpus(DataframeSource(dd), readerControl=list(reader=myReader))

Now if we want to only keep a certain set of words, we can create our own content_transformer function. One way to do this is

keepOnlyWords<-content_transformer(function(x,words) {
    regmatches(x, 
        gregexpr(paste0("\b(",  paste(words,collapse="|"),"\b)"), x)
    , invert=T)<-" "
    x
})

This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like

keep<-c("wonder","then","that","the")

tm<-tm_map(tm, content_transformer(tolower))
tm<-tm_map(tm, keepOnlyWords, keep)
tm<-tm_map(tm, stripWhitespace)

And then we can turn that into a document term matrix

dtm<-DocumentTermMatrix(tm)
inspect(dtm)

# <<DocumentTermMatrix (documents: 4, terms: 4)>>
# Non-/sparse entries: 7/9
# Sparsity           : 56%
# Maximal term length: 6
# Weighting          : term frequency (tf)

#     Terms
# Docs that the then wonder
#   10    1   1    1      1
#   11    2   0    0      0
#   12    0   1    0      0
#   13    0   3    0      0

and you can it it has our list of words and the proper document IDs from the data.frame

Categories

text - Keep document ID with R corpus

text - Keep document ID with R corpus

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags