beautifulsoup - Scrape the absolute URL instead of a relative path in python

Question

Welcome To Ask or Share your Answers For Others

beautifulsoup - Scrape the absolute URL instead of a relative path in python

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

beautifulsoup - Scrape the absolute URL instead of a relative path in python

I'm trying to get all the href's from a HTML code and store it in a list for future processing such as this:

Example URL: www.example-page-xl.com

 <body>
    <section>
    <a href="/helloworld/index.php"> Hello World </a>
    </section>
 </body>

I'm using the following code to list the href's:

import bs4 as bs4
import urllib.request

sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print(url.get('href'))

However I would like to store the URL as: www.example-page-xl.com/helloworld/index.php and not just the relative path which is /helloworld/index.php

Appending/joining the URL with the relative path isn't required since the dynamic links may vary when I join the URL and the relative path.

In a nutshell I would like to scrape the absolute URL and not relative paths alone (and without joining)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:55:48+0000

In this case urlparse.urljoin helps you. You should modify your code like this-

import bs4 as bs4
import urllib.request
from urlparse import  urljoin

web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print urljoin(web_url,url.get('href'))

here urljoin manage absolute and relative paths.

Categories

beautifulsoup - Scrape the absolute URL instead of a relative path in python

beautifulsoup - Scrape the absolute URL instead of a relative path in python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags