Add post-processing to AI spiders#

In this chapter of the Zyte AI spiders tutorial, you will write post-processing code to edit output items with website-independent logic.

Compute a field out of another field#

You will start with something simple: extending Product.additionalProperties to indicate the number of characters in the product description.

You can create a page object class that does the same, but it could become a problem for website-independent changes.

You will instead create a Scrapy item pipeline.

Create a zyte_spider_templates_project/item_pipelines.py file with the following code:

zyte_spider_templates_project/item_pipelines.py#
from zyte_common_items import AdditionalProperty

class CustomItemPipeline:

    def process_item(self, item, spider):
        additional_property = AdditionalProperty(
            name="descriptionLength",
            value=str(len(item.description)),
        )
        item.additionalProperties.append(additional_property)
        return item

And then enable your item pipeline by adding the following code at the end of your zyte_spider_templates_project/settings.py file:

zyte_spider_templates_project/settings.py#
ITEM_PIPELINES = {
    "zyte_spider_templates_project.item_pipelines.CustomItemPipeline": 0,
}

The item pipeline code above reads the item description (item.description), calculates its length with len(), and converts the length number into a string with str to conform to the typing of AdditionalProperty.value. The result is added to item.additionalProperties.

If you run your spider now, you will be able to see this new additional property in the output of every item.

Change the output data schema#

It is not uncommon among people who wish to use AI spiders to want the output data to follow a different data schema. Some people do not want nested fields, or prefer different names for fields or different data types or format for values.

While it is possible to declare custom item types with a custom schema, as in this example to add a new field to the output of an AI spider, it is generally recommended to stick to the standard item types in page objects and spider code, and apply schema changes in an item pipeline.

For example, replace your item pipeline code above with the following code:

zyte_spider_templates_project/item_pipelines.py#
class CustomItemPipeline:

    def process_item(self, item, spider):
        return {
            "name": item.name,
            "id": item.sku,
            "price": f"{item.currency} {item.price}",
            "image_url": item.mainImage.url,
            "in_stock": True if item.availability == "InStock" else False,
        }

The code above outputs items as 5-key dictionaries:

  • name is kept as is.

  • sku is renamed to id.

  • price gets the currency code prefixed, e.g. "GBP 13.92".

  • The nested url field of mainImage is replaced by a flat image_url field.

  • availability is replaced by a boolean in_stock field.

Run your spider again to see how the schema of the data in items.jsonl changes completely.

Next steps#

You now know the recommended approach to make post-processing changes to your item in a website-independent way.

This is the end of the Zyte AI spiders tutorial.

See also