Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
648 views
in Technique[技术] by (71.8m points)

text mining - build word co-occurence edge list in R

I have a chunk of sentences and I want to build the undirected edge list of word co-occurrence and see the frequency of every edge. I took a look at the tm package but didn't find similar functions. Is there some package/script I can use? Thanks a lot!

Note: A word doesn't co-occur with itself. A word which appears twice or more co-occurs with other words for only once in the same sentence.

DF:

sentence_id text
1           a b c d e
2           a b b e
3           b c d
4           a e
5           a
6           a a a

OUTPUT

word1 word2 freq
a     b     2
a     c     1
a     d     1
a     e     3
b     c     2
b     d     2
b     e     2
c     d     2
c     e     1
d     e     1
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It's convoluted so there's got to be a better approach:

dat <- read.csv(text="sentence_id, text
1,           a b c d e
2,           a b b e
3,           b c d
4,           a e", header=TRUE)


library(qdapTools); library(tidyr)
x <- t(mtabulate(with(dat, by(text, sentence_id, bag_o_words))) > 0)
out <- x %*% t(x)
out[upper.tri(out, diag=TRUE)] <- NA

out2 <- matrix2df(out, "word1") %>%
    gather(word2, freq, -word1) %>%
    na.omit() 

rownames(out2) <- NULL
out2

##    word1 word2 freq
## 1      b     a    2
## 2      c     a    1
## 3      d     a    1
## 4      e     a    3
## 5      c     b    2
## 6      d     b    2
## 7      e     b    2
## 8      d     c    2
## 9      e     c    1
## 10     e     d    1

Base only solution

out <- lapply(with(dat, split(text, sentence_id)), function(x) {
    strsplit(gsub("^\s+|\s+$", "", as.character(x)), "\s+")[[1]]
})

nms <- sort(unique(unlist(out)))

out2 <- lapply(out, function(x) {
    as.data.frame(table(x), stringsAsFactors = FALSE)
})

dat2 <- data.frame(x = nms)

for(i in seq_along(out2)) {
    m <- merge(dat2, out2[[i]], all.x = TRUE)
    names(m)[i + 1] <- dat[["sentence_id"]][i]
    dat2 <- m
}

dat2[is.na(dat2)] <- 0
x <- as.matrix(dat2[, -1]) > 0

out3 <- x %*% t(x)
out3[upper.tri(out3, diag=TRUE)] <- NA
dimnames(out3) <- list(dat2[[1]], dat2[[1]])

out4 <- na.omit(data.frame( 
        word1 = rep(rownames(out3), ncol(out3)),  
        word2 = rep(colnames(out3), each = nrow(out3)),
        freq = c(unlist(out3)),
        stringsAsFactors = FALSE)
)

row.names(out4) <- NULL

out4

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...