Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.6k views
in Technique[技术] by (71.8m points)

web scraping - How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--

What is the best way to get the tables from inside the comment tags? Thanks!

Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:

library(rvest)

# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')

df <- h %>% html_nodes(xpath = '//comment()') %>%    # select comment nodes
    html_text() %>%    # extract comment text
    paste(collapse = '') %>%    # collapse to a single string
    read_html() %>%    # reparse to HTML
    html_node('table#advanced') %>%    # select the desired table
    html_table() %>%    # parse table
    .[colSums(is.na(.)) < nrow(.)]    # get rid of spacer columns

df[, 1:15]
##    Rk           Player Age  G   MP  PER   TS%  3PAr   FTr ORB% DRB% TRB% AST% STL% BLK%
## 1   1        Pau Gasol  34 78 2681 22.7 0.550 0.023 0.317  9.2 27.6 18.6 14.4  0.5  4.0
## 2   2     Jimmy Butler  25 65 2513 21.3 0.583 0.212 0.508  5.1 11.2  8.2 14.4  2.3  1.0
## 3   3      Joakim Noah  29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0  1.2  2.6
## 4   4     Aaron Brooks  30 82 1885 14.4 0.534 0.383 0.213  1.9  7.5  4.8 24.2  1.5  0.6
## 5   5    Mike Dunleavy  34 63 1838 11.6 0.573 0.547 0.181  1.7 12.7  7.3  9.7  1.1  0.8
## 6   6       Taj Gibson  29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7  6.9  1.1  3.2
## 7   7   Nikola Mirotic  23 82 1654 17.9 0.556 0.502 0.455  4.3 21.8 13.3  9.7  1.7  2.4
## 8   8     Kirk Hinrich  34 66 1610  6.8 0.468 0.441 0.131  1.4  6.6  4.1 13.8  1.5  0.6
## 9   9     Derrick Rose  26 51 1530 15.9 0.493 0.325 0.224  2.6  8.7  5.7 30.7  1.2  0.8
## 10 10       Tony Snell  23 72 1412 10.2 0.550 0.531 0.148  2.5 10.9  6.8  6.8  1.2  0.6
## 11 11    E'Twaun Moore  25 56  504 10.3 0.504 0.273 0.144  2.7  7.1  5.0 10.4  2.1  0.9
## 12 12   Doug McDermott  23 36  321  6.1 0.480 0.383 0.140  2.1 12.2  7.3  3.0  0.6  0.2
## 13 13    Nazr Mohammed  37 23  128  8.7 0.431 0.000 0.100  9.6 22.3 16.1  3.6  1.6  2.8
## 14 14 Cameron Bairstow  24 18   64  2.1 0.309 0.000 0.357 10.5  3.3  6.8  2.2  1.6  1.1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...