Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
502 views
in Technique[技术] by (71.8m points)

python - Checking a url for a 404 error scrapy

I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")

I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.

Any ideas on how to do this?

Currently my code is something along the lines of :

def start_requests(self):
    baseUrl = "http://website.com/page/"
    currentPage = 0
    stillExists = True
    while(stillExists):
        currentUrl = baseUrl + str(currentPage)
        test = Request(currentUrl)
        if test.response.status != 404: #This is what I'm not sure of
            yield test
            currentPage += 1
        else:
            stillExists = False
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can do something like this:

from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))

This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.

Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...