Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

python - Multiprocessing of Scrapy Spiders in Parallel Processes

There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpectedly.

All of the above questions, couldn't help me. Either, some of them have used CELERY or some of them SCRAPYD, and I want to use the MULTIPROCESSISNG Library. Also, the Scrapy Official Documentation shows how to run multiple spiders on a SINGLE PROCESS, not on MULTIPLE PROCESSES.

None of them couldn't help me, and hence I decided to ask this question.

After several try's, I came up with this code.

My Output-:

Enter a product to search for: apple
2015-06-27 14:34:15 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-27 14:34:15 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-27 14:34:15 [scrapy] INFO: Optional features available: ssl, http11
2015-06-27 14:34:15 [scrapy] INFO: Optional features available: ssl, http11
2015-06-27 14:34:15 [scrapy] INFO: Overridden settings: {}
2015-06-27 14:34:15 [scrapy] INFO: Overridden settings: {}
2015-06-27 14:34:15 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-27 14:34:15 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-27 14:34:15 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-27 14:34:15 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-27 14:34:15 [scrapy] INFO: Enabled item pipelines: 
2015-06-27 14:34:15 [scrapy] INFO: Spider opened
2015-06-27 14:34:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-27 14:34:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-27 14:34:15 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-27 14:34:15 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-27 14:34:15 [scrapy] INFO: Enabled item pipelines: 
2015-06-27 14:34:15 [scrapy] INFO: Spider opened
2015-06-27 14:34:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-27 14:34:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2015-06-27 14:34:15 [twisted] ERROR: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 619, in _doReadOrWrite
    why = selectable.doWrite()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1117, in doWrite
    "doWrite called on a %s" % reflect.qual(self.__class__))
exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port

Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 619, in _doReadOrWrite
    why = selectable.doWrite()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1117, in doWrite
    "doWrite called on a %s" % reflect.qual(self.__class__))
exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port
2015-06-27 14:34:16 [twisted] ERROR: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 619, in _doReadOrWrite
    why = selectable.doWrite()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1117, in doWrite
    "doWrite called on a %s" % reflect.qual(self.__class__))
exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port

Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 619, in _doReadOrWrite
    why = selectable.doWrite()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1117, in doWrite
    "doWrite called on a %s" % reflect.qual(self.__class__))
exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port
2015-06-27 14:34:17 [scrapy] DEBUG: Crawled (200) <GET http://bigbasket.com/ps/?q=apple> (referer: None)
hello, world
Current second: 17
Current microsecond: 546862
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 170', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/10000007_18-fresho-apple-washington.jpg', 'product_link': 'http://bigbasket.com/pd/10000007/fresho-apple-washington-1-kg/', 'productname': 'Apple - Washington', 'current_price': 'Rs. 170'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 199', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/10000003_7-fresho-apple-fuji.jpg', 'product_link': 'http://bigbasket.com/pd/10000003/fresho-apple-fuji-1-kg/', 'productname': 'Apple - Fuji', 'current_price': 'Rs. 199'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 229', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/10000005_16-fresho-apple-royal-gala.jpg', 'product_link': 'http://bigbasket.com/pd/10000005/fresho-apple-royal-gala-1-kg/', 'productname': 'Apple - Royal Gala', 'current_price': 'Rs. 229'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 156.75', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/205988_2-american-garden-vinegar-apple-cider.jpg', 'product_link': 'http://bigbasket.com/pd/205988/american-garden-vinegar-apple-cider-473-ml-bottle/', 'productname': 'Vinegar - Apple Cider', 'current_price': 'Rs. 156.75'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 151', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/10000397_7-fresho-apple-green.jpg', 'product_link': 'http://bigbasket.com/pd/10000397/fresho-apple-green-500-gm/', 'productname': 'Apple - Green', 'current_price': 'Rs. 151'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 114', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/229785_5-tropicana-100-juice-apple.jpg', 'product_link': 'http://bigbasket.com/pd/229785/tropicana-100-juice-apple-1-ltr-tetra/', 'productname': '100% Juice - Apple', 'current_price': 'Rs. 114'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 266', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/40015763_1-mylife-vinegar-apple-cider.jpg', 'product_link': 'http://bigbasket.com/pd/40015763/mylife-vinegar-apple-cider-300-ml/', 'productname': 'Vinegar - Apple Cider', 'current_price': 'Rs. 266'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 175', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/40015525_1-fresho-apple-chilli.jpg', 'product_link': 'http://bigbasket.com/pd/40015525/fresho-apple-chilli-1-kg/', 'productname': 'Apple - Chilli', 'current_price': 'Rs. 175'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 94.05', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/229791_3-tropicana-juice-apple.jpg', 'product_link': 'http://bigbasket.com/pd/229791/tropicana-juice-apple-1-ltr-tetra/', 'productname': 'Juice - Apple', 'current_price': 'Rs. 94.05'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 93', 'imageurl': 'http://bigbasket.com/media/uploads/p/s/40015526_1-fresho-apple-chilli.jpg', 'product_link': 'http://bigbasket.com/pd/40015526/fresho-apple-chilli-500-gm/', 'productname': 'Apple - Chilli', 'current_price': 'Rs. 93'}
2015-06-27 14:34:17 [scrapy] DEBUG: Scraped from <200 http://bigbasket.com/ps/?q=apple>
{'outofstock_status': 'In Stock', 'offer': 'No additional offer available', 'mrp': 'Rs. 94.05', 'imageurl': 'http://bigb

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Scrapy is created with Twisted, and this framework already has its way of running multiple processes. There is nice question about this here. In your approach you are actually trying to marry two incompatible and competing libraries (Scrapy/Twisted + multiprocessing). This is probably not best idea, you can run into lots of problems with that.

If you would like to run Scrapy spiders in multiple processes it will much easier to just use Twisted. You could just read Twisted docs for spawnProcess and other calls and try to those tools for your goal. For example here's quick and dirty implementation that runs two spiders in two processes

from twisted.internet import defer, protocol, reactor
import os


class SpiderRunnerProtocol(protocol.ProcessProtocol):
    def __init__(self, d, inputt=None):
        self.deferred = d
        self.inputt = inputt
        self.output = ""
        self.err = ""

    def connectionMade(self):
        if self.inputt:
            self.transport.write(self.inputt)
        self.transport.closeStdin()

    def outReceived(self, data):
        self.output += data

    def processEnded(self, reason):
        print(reason.value)
        print(self.err)
        self.deferred.callback(self.output)

    def errReceived(self, data):
        self.err += data


def run_spider(cmd, *args, **kwargs):
    d = defer.Deferred()
    pipe = SpiderRunnerProtocol(d)
    args = [cmd] + list(args)
    env = os.environ.copy()
    x = reactor.spawnProcess(pipe, cmd, args, env=env)
    print(x.pid)
    print(x)
    return d


def print_out(result):
    print(result)

d = run_spider("scrapy", "crawl", "reddit")
d = run_spider("scrapy", "crawl", "dmoz")
d.addCallback(print_out)
d.addCallback(lambda _: reactor.stop())
reactor.run()

There's a nice blog post explaining usage of Twisted subprocesses here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...