Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
414 views
in Technique[技术] by (71.8m points)

python - How do I extract the email address using scrapy?

I'm trying to extract the email address of each restaurant on TripAdvisor.

I've tried this but keeps returning an [ ]:

response.xpath('//*[@class= "restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--89flT6"]')

Code snippet off the TripAdvisor page is below:

<div class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--1flT6"><span><a href="mailto:info@canopylounge.my?subject=?"><span class="ui_icon email restaurants-detail-overview-cards-LocationOverviewCard__detailLinkIcon--T_k32"></span><span class="restaurants-detail-overview-cards-LocationOverviewCard__detailLinkText--co3ei">Email</span><span class="ui_icon external-link-no-box restaurants-detail-overview-cards-LocationOverviewCard__upLinkIcon--1oVn1"></span></a></span></div>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First: you had mistake in class name.

Second: it is class in <div> but @href is in <a>. And <a> is not directly after <div> so you need

'//*[@class="..."]//a/@href'

(I skip class name because it is too long to display it)


But instead of so long class name you can try

'//a[contains(@href, "mailto")]/@href'

I tested xpath using lxml

text = '''<div class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--1flT6">
<span><a href="mailto:info@canopylounge.my?subject=?">
<span class="ui_icon email restaurants-detail-overview-cards-LocationOverviewCard__detailLinkIcon--T_k32"></span>
<span class="restaurants-detail-overview-cards-LocationOverviewCard__detailLinkText--co3ei">Email</span>
<span class="ui_icon external-link-no-box restaurants-detail-overview-cards-LocationOverviewCard__upLinkIcon--1oVn1"></span>
</a></span>
</div>'''

import lxml.html

soup = lxml.html.fromstring(text)

print(soup.xpath('//*[@class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--1flT6"]//a/@href'))
print(soup.xpath('//a[contains(@href, "mailto")]/@href'))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...