twisted - Run a Scrapy spider in a Celery Task

Question

Welcome To Ask or Share your Answers For Others

twisted - Run a Scrapy spider in a Celery Task

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

twisted - Run a Scrapy spider in a Celery Task

This is not working anymore, scrapy's API has changed.

Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error.

My task:

from celery import Task

from twisted.internet import reactor

from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

from .spiders import MySpider



class MyTask(Task):
    def run(self, *args, **kwargs):
        spider = MySpider
        settings = get_project_settings()
        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()

        log.start()
        reactor.run()

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:30:21+0000

The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post:

Running Scrapy spiders in a Celery task

This gets around the "reactor cannot be restart-able" issue by utilizing the multiprocessing package. But the problem with this is that the workaround is now obsolete with the latest celery version due to the fact that you will instead run into another issue where a daemon process can't spawn sub processes. So in order for the workaround to work you need to go down in celery version.

Yes, and the scrapy API has changed. But with minor modifications (import Crawler instead of CrawlerProcess). You can get the workaround to work by going down in celery version.

The Celery issue can be found here: Celery Issue #1709

Here is my updated crawl-script that works with newer celery versions by utilizing billiard instead of multiprocessing:

from scrapy.crawler import Crawler
from scrapy.conf import settings
from myspider import MySpider
from scrapy import log, project
from twisted.internet import reactor
from billiard import Process
from scrapy.utils.project import get_project_settings
from scrapy import signals


class UrlCrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(settings)
        self.crawler.configure()
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        reactor.run()

def run_spider(url):
    spider = MySpider(url)
    crawler = UrlCrawlerScript(spider)
    crawler.start()
    crawler.join()

Edit: By reading the celery issue #1709 they suggest to use billiard instead of multiprocessing in order for the subprocess limitation to be lifted. In other words we should try billiard and see if it works!

Edit 2: Yes, by using billiard, my script works with the latest celery build! See my updated script.

Categories

twisted - Run a Scrapy spider in a Celery Task

twisted - Run a Scrapy spider in a Celery Task

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags