Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
306 views
in Technique[技术] by (71.8m points)

python - Webscraping using Beautifulsoup 4 - extracting contact info

screenshot of html code

This is my first post, please forgive me if I break some rules. Im trying to webscrape vendor information using code which looks like

  soup.find_all('span', class_ = "class-name")

Please refer to the image attached. I wanted to get the contact number but it is not given as text or something similar. Each digit seems to be in its own class tag and even inside that the digit isnt in text. Im also not familiar with webdev so if anyone could give suggestions I would really appreciate it.

url : https://www.justdial.com/Pune/Sunrise-Enterprises-Budhwar-Peth/020PXX20-XX20-130817131104-Z3I2_BZDET?xid=UHVuZSBFbGVjdHJvbmljIENvbXBvbmVudCBEZWFsZXJz

another similar page with multiple contact details is : https://www.justdial.com/Pune/Galaxy-Enterprises-And-Electronics-Behind-Bharti-Vidyapeeth-Near-Ichapurti-Mandir-Ambegaon-Budruk/020PXX20-XX20-140930130951-K4X6_BZDET?xid=UHVuZSBFbGVjdHJvbmljIENvbXBvbmVudCBEZWFsZXJz

Thanks

question from:https://stackoverflow.com/questions/65892112/webscraping-using-beautifulsoup-4-extracting-contact-info

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The second style tag contains css code in which the sequence of the icon-xx properties defines which number the property matches with. This is used to load an image with this number on the webpage, so there are no numbers to scrape. The solution is to 1) map the icon-xx properties to numbers based on their sequence in the css string; 2) find the phone number spans in the html body and retrieve the matching numbers:

import requests
from bs4 import BeautifulSoup

url = 'https://www.justdial.com/Pune/Sunrise-Enterprises-Budhwar-Peth/020PXX20-XX20-130817131104-Z3I2_BZDET?xid=UHVuZSBFbGVjdHJvbmljIENvbXBvbmVudCBEZWFsZXJz'
r = requests.get(url, headers={'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; Win64; x64)"})
soup = BeautifulSoup(r.text, "html.parser")

text = soup.find_all('style', {"type": "text/css"}, text=True)[1]
data = text.contents[0].split('smoothing:grayscale}', 1)[1].split('
')[0]
icon_items = [i.split(':')[0] for i in data.split('.') if len(i)>0]
items = ['0','1','2','3','4','5','6','7','8','9','+','-',')','(']
full_list = dict(zip(icon_items, items))

phone_numbers = soup.find_all('span',{'class':'telnowpr'})
for i in phone_numbers:
    numbers = i.find_all('span')
    number = [full_list[y.attrs['class'][1]] for y in numbers]
    print("phone number: " + ''.join([str(elem) for elem in number]) )

Result:

phone number: 07947197693
phone number: 07947197693
phone number: 07947197693

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...