Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
438 views
in Technique[技术] by (71.8m points)

html - Cant Scrape webpage with Python Requests Library

I am trying to get some info from a webpage (link below) using Requests in python; however, the HTML data that I see in my browser doesn't seem to exist when I connect via python's request library. None of the xpath queries return any information. I am able to use requests for other sites such as amazon (the site below is actually owned by Amazon, but I can't seem to scrape any information from it).

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'
user_agent = {'User-agent': 'Mozilla/5.0'} 
page = requests.get(url, headers=user_agent)
tree = html.fromstring(page.text)
query = tree.xpath("//span[@id=ourPrice]/text()")
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The element is generated using javascript, you can use selenium to get the source, to get headless browsing combine it with phantomjs:

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'

from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get(url)
_html = browser.page_source

from bs4 import BeautifulSoup

print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text)
$50

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...