data mining - Finding 2 & 3 word Phrases Using R TM Package

Question

Welcome To Ask or Share your Answers For Others

data mining - Finding 2 & 3 word Phrases Using R TM Package

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:05:53+0000

You can pass in a custom tokenizing function to tm's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward.

library(tm); library(tau);

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))

texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

Where n in the tokenize_ngrams function is the number of words per phrase. This feature is also implemented in package RTextTools, which further simplifies things.

library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)

This returns a class of DocumentTermMatrix for use with package tm.

Categories

data mining - Finding 2 & 3 word Phrases Using R TM Package

data mining - Finding 2 & 3 word Phrases Using R TM Package

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags