Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
524 views
in Technique[技术] by (71.8m points)

text mining - delete stop words in R

I Have a dataframe wich has this structure :

Note.Reco Review Review.clean.lower
10 Good Products  good products
9 Nice film      nice film
....         ....

The first column is the rank of the film, then the second column is the custmer's review then the 3rd column is the review with lowercase letters.

I try now to delete stop words with this :

Data_clean$Raison.Reco.clean1 <- Corpus(VectorSource(Data_clean$Review.clean.lower))
Data_clean$Review.clean.lower1 <- tm_map(Data_clean$Review.clean.lower1, removeWords, stopwords("english"))

But R studio crashes

Can you help me to resolve this problem please?

Thank you

EDIT :

#clean up
# remove grammar/punctuation
Data_clean$Review.clean.lower <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))

Data_corpus <- Corpus(VectorSource(Data_clean$Review.clean.lower))

Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("french"))

train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]

So I get error when I run the 2 last instructions .

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try the below . You can do cleaning on the corpus and not column directly.

Data_corpus <-
  Corpus(VectorSource(Data_clean$Review.clean.lower))

  Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("english"))

EDIT: As mentioned by you, you want to be able to access the output after removing stop words, try the below instead of the above:

library(tm)

stopWords <- stopwords("en")

Data_clean$Review.clean.lower<- as.character(Data_clean$Review.clean.lower)
 '%nin%' <- Negate('%in%')
 Data_clean$Review.clean.lower1<-lapply(Data_clean$Review.clean.lower, function(x) {
  chk <- unlist(strsplit(x," "))
  p <- chk[chk %nin% stopWords]
  paste(p,collapse = " ")
})

Sample Output of above code:

>  print(Data_clean)
>       note Note.Reco.Review Review.clean.lower Review.clean.lower1
>     1   10    Good Products      good products       good products
>     2    9        Nice film     is a nice film           nice film

Also check the below: R remove stopwords from a character vector using %in%


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...