scrapy - Avoid Duplicate URL Crawling

Question

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I coded a simple crawler. In the settings.py file, by referring to scrapy documentation, I used

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'

If I stop the crawler and restart the crawler again, it is scraping the duplicate urls again. Am I doing something wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:34:42+0000

I believe what you are looking for is "persistence support", to pause and resume crawls.

To enable it you can do:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

You can read more about it here.