Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
637 views
in Technique[技术] by (71.8m points)

pyspider 每次重试的时候,是不是用的第一次爬取的信息。比如代理,无论retry多少次都是一个代理。

用pyspider写爬虫,发现只要出现错误之后,所有的错误重试都不能够成功。猜测是因为每次重试都用的同一个代理。
代码部分如下

def on_start(self):

 self.crawl(
        'http://he.gsxt.gov.cn/notice/search/ent_announce_unit?announceType=0101&keyword=&pageNo={}&organ='.format(
            1), callback=self.index_page, proxy=hc_proxy(),save={'page':1},headers=self.crawl_config['headers'],connect_timeout=60)

每次通过一个方法调用代理,给crawl进行爬取

clipboard.png
最后的结果就变成这样了

对了我的代理情况是只能用一分钟,一分钟就失效了。
请教到底什么情况造成这样,自己去翻了源码,不过水平有限找不到。

试过其他的代理,用那种squid之类的代理(就是我只需要连接一个服务器,这个服务器帮我分发代理),这种模式可以。


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

pyspider 不打算实现代理管理,即使我要实现也会新建别的项目。
目前用 squid 就挺好,例如下面就是一个使用 squid 分发 http 代理的配置例子:

#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#
via off
forwarded_for off

request_header_access From deny all
request_header_access Server deny all
request_header_access WWW-Authenticate deny all
request_header_access Link deny all
request_header_access Cache-Control deny all
request_header_access Proxy-Connection deny all
request_header_access X-Cache deny all
request_header_access X-Cache-Lookup deny all
request_header_access Via deny all
request_header_access X-Forwarded-For deny all
request_header_access Pragma deny all
request_header_access Keep-Alive deny all

include /etc/squid/peers.conf

never_direct allow all

peers.conf 可以这样生成

cache_peer ${IP} parent ${PORT} 0 login=${username}:${password} round-robin proxy-only no-query connect-fail-limit=2
cache_peer ${IP} parent ${PORT} 0 login=${username}:${password} round-robin proxy-only no-query connect-fail-limit=2
cache_peer ${IP} parent ${PORT} 0 login=${username}:${password} round-robin proxy-only no-query connect-fail-limit=2

定时更新 peers.conf 然后 service squid reload 就好了


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...