Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
454 views
in Technique[技术] by (71.8m points)

python scrapy how to use BaseDupeFilter

I have a website have many pages like this:

mywebsite/?page=1

mywebsite/?page=2

...

...

...

mywebsite/?page=n

each page have links to players. when you click on any link, you go to the page of that player.

Users can add players so I will end up with this situation.

Player1 has a link in page=1.

Player10 has a link in page=2

after an hour

because users have added new players. i will have this situation.

Player1 has a link in page=3

Player10 has a link in page=4

and the new players like Player100 and Player101 have links in page=1

I want to scrap on all players to get their information. However, I don't want to scrap on players that I have already scrap. My question is how to user the BaseDupeFilter in scrapy to identify that this player has been scraped and this not. Remember, I want to sracp on pages of the website because each page will have different players in each time.

Thank you.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'd take another approach and try not to query for the last player during spider run, but rather launch the spider with a pre calculated argument of the last scraped player:

scrapy crawl <my spider> -a last_player=X

then your spider may look like:

class MySpider(BaseSpider):
    start_urls = ["http://....mywebsite/?page=1"]
    ...
    def parse(self, response):
        ...
        last_player_met = False
        player_links = sel.xpath(....)
        for player_link in player_links:
            player_id = player_link.split(....)
            if player_id < self.last_player:
                 yield Request(url=player_link, callback=self.scrape_player)
            else:
                last_player_met = True
        if not last_player_met:
            # try to xpath for 'Next' in pagination 
            # or use meta={} in request to loop over pages like 
            # "http://....mywebsite/?page=" + page_number
            yield Request(url=..., callback=self.parse)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...