Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
357 views
in Technique[技术] by (71.8m points)

html - How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

Here's an example to get you started:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "
"))
x <- gsub("","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character vector of mostly just webpage text (along with some javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:??16:48??(EST+7)"           
[4] "????Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...