Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
214 views
in Technique[技术] by (71.8m points)

python - Scrapy works fine until page 12 of asp site, then 500 error

My first scraping project with Python/Scrapy. Site is http://pabigtrees.com/ with 78 pages and 20 items (trees) per page. This is the full spider with a few changes to provide a minimal demonstration (scraping only one value per page):

import scrapy
from pabigtrees.items import Tree

class TreesSpider(scrapy.Spider):
  name = "trees"
  start_urls = ["http://pabigtrees.com/view_tree.aspx"]
  allowed_domains = ["pabigtrees.com"]
  download_delay = 2

  def parse(self, response):
    for page in [1,11,12]:
    #for page in range(1,79):
      if page == 1:
        yield scrapy.FormRequest.from_response(
        response,
        #callback=self.parse_page
        callback=self.parse_test
        )
      else:
        yield scrapy.FormRequest.from_response(
          response,
          formdata={
            '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
            '__EVENTARGUMENT': "Page$" + str(page),
            'ctl00$ContentPlaceHolder1$genus_latin': '0',
            'ctl00$ContentPlaceHolder1$genus_common': '0',
            'ctl00$ContentPlaceHolder1$county': '0',
            '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__VIEWSTATEGENERATOR': response.css('input#__VIEWSTATEGENERATOR::attr(value)').extract_first(),
            '__SCROLLPOSITIONX': response.css('input#__SCROLLPOSITIONX::attr(value)').extract_first(),
            '__SCROLLPOSITIONY': response.css('input#__SCROLLPOSITIONY::attr(value)').extract_first(),
            '__EVENTVALIDATION': response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
          },
          #callback=self.parse_page
          callback=self.parse_test
        )

  def parse_test(self, response):
    yield {
      'county':response.xpath('//a[contains(@href,"Select$1''")]/../../../td[5]/font/text()').extract_first()
    }

  def parse_page(self, response):
    for tree in range(0,20):

      yield scrapy.FormRequest.from_response(
        response,
        formdata={
          '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
          '__EVENTARGUMENT': "Select$" + str(tree)
        },        meta={'county':response.xpath('//a[contains(@href,"Select$'+str(tree)+'")]/../../../td[5]/font/text()').extract_first()}, # save the county from the list page because it is not available on the detail page
        callback=self.parse_results
      )

  def parse_results(self, response):
    item = Tree()
    genus = response.css('span#ctl00_ContentPlaceHolder1_tree_genus::text').extract()
    species = response.css('span#ctl00_ContentPlaceHolder1_tree_species::text').extract()
    circumference = response.css('span#ctl00_ContentPlaceHolder1_lblcircum::text').extract()
    spread = response.css('span#ctl00_ContentPlaceHolder1_lblSpread::text').extract()
    height = response.css('span#ctl00_ContentPlaceHolder1_lblHeight::text').extract()
    points = response.css('span#ctl00_ContentPlaceHolder1_lblPoints::text').extract()
    address = response.css('span#ctl00_ContentPlaceHolder1_lblAddress::text').extract()
    crew = response.xpath('//td[text()="Measuring Crew: "]/following-sibling::td/text()').extract()
    nominator = response.xpath('//td[text()="Original Nominator: "]/following-sibling::td/text()').extract()
    comments = response.xpath('//td[text()="Comments: "]/following-sibling::td/text()').extract()
    gps = response.xpath('//td[text()="GPS Coordinates: "]/following-sibling::td/text()').extract()
    technique = response.css('span#ctl00_ContentPlaceHolder1_lblTech::text').extract()
    yearnominated = response.css('span#ctl00_ContentPlaceHolder1_lblNom::text').extract()
    yearlastmeasured = response.css('span#ctl00_ContentPlaceHolder1_lblMeasured::text').extract()
    item['a_county'] = response.meta['county']
    item['b_genus'] = genus
    item['c_species'] = species
    item['d_circumference'] = circumference
    item['e_spread'] = spread
    item['f_height'] = height
    item['g_points'] = points
    item['h_address'] = address
    item['i_crew'] = crew
    item['j_nominator'] = nominator
    item['k_comments'] = comments
    item['l_gps'] = gps
    item['m_technique'] = technique
    item['n_yearnominated'] = yearnominated
    item['o_yearlastmeasured'] = yearlastmeasured
    return item

The crawler works fine up through page 11. On page 12 and above, I get 500 errors. I believe it has something to do with the pagination, but I think I am sending the correct VIEWSTATE etc. Here’s the output:

(python3) Al-Green:pabigtrees Tony$ scrapy crawl trees -o trees.csv
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: pabigtrees)
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-14 15:31:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pabigtrees', 'FEED_FORMAT': 'csv', 'FEED_URI': 'trees.csv', 'NEWSPIDER_MODULE': 'pabigtrees.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pabigtrees.spiders']}
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-14 15:31:18 [scrapy.core.engine] INFO: Spider opened
2018-04-14 15:31:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-14 15:31:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-14 15:31:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://pabigtrees.com/robots.txt> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://pabigtrees.com/view_tree.aspx> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Dauphin'}
2018-04-14 15:31:33 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Delaware'}
2018-04-14 15:31:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 1 times): 500 Internal Server Error
2018-04-14 15:31:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 2 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 3 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.core.engine] DEBUG: Crawled (500) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:39 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://pabigtrees.com/view_tree.aspx>: HTTP status code is not handled or not allowed
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-14 15:31:39 [scrapy.extensions.feedexport] INFO: Stored csv feed (2 items) in: trees.csv
2018-04-14 15:31:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 134895,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 5,
 'downloader/response_bytes': 98019,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 14, 19, 31, 39, 475017),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/500': 1,
 'item_scraped_count': 2,
 'log_count/DEBUG': 11,
 'log_count/INFO': 9,
 'memusage/max': 50180096,
 'memusage/startup': 50176000,
 'request_depth_max': 1,
 'response_received_count': 5,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/500 Internal Server Error': 2,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2018, 4, 14, 19, 31, 18, 563326)}
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Spider closed (finished)

I’m stumped, thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The __VIEWSTATE is indeed what is causing you trouble.

If you take a look at the navigation of the site you're trying to scrape, you'll see it only links to 10 other pages:

navigation

Those are the only 10 links of this search you're allowed to access from the current page (with the current view state). The next 10 will be accessible from page 11 of the search.

One possible solution would be to check in parse_page() if you're on page 11 (or 21, or 31...), and if so, create the requests for the next 10 pages.

Also, you only need to populate the formdata you want to change, FormRequest.from_response() will take care of the ones available in hidden input fields, such as e.g. __VIEWSTATE or __EVENTVALIDATION.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...