correct way to nest Item data in scrapy

Question

Welcome To Ask or Share your Answers For Others

correct way to nest Item data in scrapy

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

correct way to nest Item data in scrapy

What is the correct way to nest Item data?

For example, I want the output of a product:

{
'price': price,
'title': title,
'meta': {
    'url': url,
    'added_on': added_on
}

I have scrapy.Item of:

class ProductItem(scrapy.Item):
    url = scrapy.Field(output_processor=TakeFirst())
    price = scrapy.Field(output_processor=TakeFirst())
    title = scrapy.Field(output_processor=TakeFirst())
    url = scrapy.Field(output_processor=TakeFirst())
    added_on = scrapy.Field(output_processor=TakeFirst())

Now, the way I do it is just to reformat the whole item in the pipeline according to new item template:

class FormatedItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    meta = scrapy.Field()

and in pipeline:

def process_item(self, item, spider):
    formated_item = FormatedItem()
    formated_item['title'] = item['title']
    formated_item['price'] = item['price']
    formated_item['meta'] = {
        'url': item['url'],
        'added_on': item['added_on']
    }
    return formated_item

Is this correct way to approach this or is there a more straight-forward way to approach this without breaking the philosophy of the framework?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:38:47+0000

UPDATE from comments: Looks like nested loaders is the updated approach. Another comment suggests this approach will cause errors during serialization.

Best way to approach this is by creating a main and a meta item class/loader.

from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst


class MetaItem(Item):
    url = Field()
    added_on = Field()


class MainItem(Item):
    price = Field()
    title = Field()
    meta = Field(serializer=MetaItem)


class MainItemLoader(ItemLoader):
    default_item_class = MainItem
    default_output_processor = TakeFirst()


class MetaItemLoader(ItemLoader):
    default_item_class = MetaItem
    default_output_processor = TakeFirst()

Sample usage:

from scrapy.spider import Spider
from qwerty.items import  MainItemLoader, MetaItemLoader
from scrapy.selector import Selector


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com"]

    def parse(self, response):
        mainloader = MainItemLoader(selector=Selector(response))
        mainloader.add_value('title', 'test')
        mainloader.add_value('price', 'price')
        mainloader.add_value('meta', self.get_meta(response))
        return mainloader.load_item()

    def get_meta(self, response):
        metaloader = MetaItemLoader(selector=Selector(response))
        metaloader.add_value('url', response.url)
        metaloader.add_value('added_on', 'now')
        return metaloader.load_item()

After that, you can easily expand your items in the future by creating more "sub-items."

Categories

correct way to nest Item data in scrapy

correct way to nest Item data in scrapy

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags