r - Scraping with rvest - complete with NAs when tag is not present

Question

Welcome To Ask or Share your Answers For Others

r - Scraping with rvest - complete with NAs when tag is not present

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Scraping with rvest - complete with NAs when tag is not present

I want to parse this HTML: and get this elements from it:

a) p tag, with class: "normal_encontrado".
b) div with class: "price".

Sometimes, the p tag is not present in some products. If this is the case, an NA should be added to the vector collecting the text from this nodes.

The idea is to have 2 vectors with the same length, and after join them to make a data.frame. Any ideas?

The HTML part:

<html>
<head></head>
<body>

<div class="product_price" id="product_price_186251">
  <p class="normal_encontrado">
    S/. 2,799.00
  </p>

  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 2,299.00
  </div>    
</div>

<div class="product_price" id="product_price_232046">
  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 4,999.00
  </div>
</div>
</body>
</html>

R Code:

library(rvest)

page_source <- read_html("r.html")

r.precio.antes <- page_source %>%
html_nodes(".normal_encontrado") %>%
html_text()

r.precio.actual <- page_source %>%
html_nodes(".price") %>%
html_text()

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:04:18+0000

Using the XML package parse the input with xmlTreeParse and then use xpathSApply to interate over the product_price class div nodes. For each such node the anonyous function gets the value of the div and p subnodes. The resulting character matrix m is reworked into a data frame DF and the columns are cleaned removing any character that is not a dot or digit and also removing any dot followed by a non-digit. Copnvert result to numeric. Note that no special processing for the missing p case is needed.

# input

Lines <- '<html>
<head></head>
<body>

<div class="product_price" id="product_price_186251">
  <p class="normal_encontrado">
    S/. 2,799.00
  </p>

  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 2,299.00
  </div>    
</div>

<div class="product_price" id="product_price_232046">
  <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
    S/. 4,999.00
  </div>
</div>
</body>
</html>'

# code to read input and produce a data.frame

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) {
  list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })

DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\D", "", x))) # clean

The result is:

> DF
     p  div
1 2799 2299
2   NA 4999

Categories

r - Scraping with rvest - complete with NAs when tag is not present

r - Scraping with rvest - complete with NAs when tag is not present

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags