Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
210 views
in Technique[技术] by (71.8m points)

xml - htmlParse missing values NA

I'm trying to scrape text from an html document using htmlParse (package: XML) in R. In the code below, I would like to know how return a NA when a tag (e.g., <p class="neg">) is missing:

<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="pos">positive</p>
</div>
<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="neg">negative</p>
</div>

I want the result to look like this:

"positive" "negative"

"positive" NA

"positive" "negative"

NA "negative"

Thanks! Majesus

::::::::::::::::::::::::::::::::::::::::

Chris, I have included a new record (hotel_name):

<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="pos">positive</p>
</div>
<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="neg">negative</p>
</div>

<div class="hotel">
<h3 class="hotel_name">Hotel Bla</h3>
</div>


y <-getNodeSet(doc, "//div")

y <- lapply(y, function(x){
       y  <- xpathSApply(x, ".//p[@class]", xmlValue)
 names(y) <- xpathSApply(x, ".//p[@class]", xmlGetAttr, "class") 
       y  
})

ldply(y, "rbind")


t <-getNodeSet(doc, "//div[@class='hotel']")

t <- lapply(t, function(x){
       t  <- xpathSApply(x, ".//h3[@class='hotel_name']", xmlValue)
 names(t) <- xpathSApply(x, ".//h3[@class='hotel_name']", xmlGetAttr, "class") 
       t  
})

ldply(t, "rbind")

How I can combine both records (y and z) in a table ( CSV ??) in Excel? "pos", "neg" and "t" must be columns in the same table. Importantly, each "pos" and each "neg" could be composed of different line breaks. I combined cbind and write.table. However, the result is deconfigured.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You could get the div nodes and return a list of named vectors to rbind

div <-getNodeSet(doc, "//div")

y <- lapply(div, function(x){
       y  <- xpathSApply(x, ".//p[@class]", xmlValue)
 names(y) <- xpathSApply(x, ".//p[@class]", xmlGetAttr, "class") 
       y  
})

ldply(y, "rbind")
       pos      neg
1 positive negative
2 positive     <NA>
3 positive negative
4     <NA> negative

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...