web scraping - How to generate the start_urls dynamically in crawling?

Question

Welcome To Ask or Share your Answers For Others

web scraping - How to generate the start_urls dynamically in crawling?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

web scraping - How to generate the start_urls dynamically in crawling?

I am crawling a site which may contain a lot of start_urls, like:

http://www.a.com/list_1_2_3.htm

I want to populate start_urls like [list_d+_d+_d+.htm], and extract items from URLs like [node_d+.htm] during crawling.

Can I use CrawlSpider to realize this function? And how can I generate the start_urls dynamically in crawling?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:43:16+0000

The best way to generate URLs dynamically is to override the start_requests method of the spider:

from scrapy.http.request import Request

def start_requests(self):
      with open('urls.txt', 'rb') as urls:
          for url in urls:
              yield Request(url, self.parse)

Categories

web scraping - How to generate the start_urls dynamically in crawling?

web scraping - How to generate the start_urls dynamically in crawling?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags