Add post-processing to AI spiders#
In this chapter of the Zyte AI spiders tutorial, you will write post-processing code to edit output items with website-independent logic.
Compute a field out of another field#
You will start with something simple: extending
Product.additionalProperties
to indicate the number of characters in the product description.
You can create a page object class that does the same, but it could become a problem for website-independent changes.
You will instead create a Scrapy item pipeline.
Create a zyte_spider_templates_project/item_pipelines.py
file with the
following code:
from zyte_common_items import AdditionalProperty
class CustomItemPipeline:
def process_item(self, item, spider):
additional_property = AdditionalProperty(
name="descriptionLength",
value=str(len(item.description)),
)
item.additionalProperties.append(additional_property)
return item
And then enable your item pipeline by adding the following code at the end
of your zyte_spider_templates_project/settings.py
file:
ITEM_PIPELINES = {
"zyte_spider_templates_project.item_pipelines.CustomItemPipeline": 0,
}
The item pipeline code above reads the item description (item.description
),
calculates its length with len()
, and converts the length number into a
string with str
to conform to the typing of
AdditionalProperty.value
.
The result is added to item.additionalProperties
.
If you run your spider now, you will be able to see this new additional property in the output of every item.
Change the output data schema#
It is not uncommon among people who wish to use AI spiders to want the output data to follow a different data schema. Some people do not want nested fields, or prefer different names for fields or different data types or format for values.
While it is possible to declare custom item types with a custom schema, as in this example to add a new field to the output of an AI spider, it is generally recommended to stick to the standard item types in page objects and spider code, and apply schema changes in an item pipeline.
For example, replace your item pipeline code above with the following code:
class CustomItemPipeline:
def process_item(self, item, spider):
return {
"name": item.name,
"id": item.sku,
"price": f"{item.currency} {item.price}",
"image_url": item.mainImage.url,
"in_stock": True if item.availability == "InStock" else False,
}
The code above outputs items as 5-key dictionaries:
name
is kept as is.sku
is renamed toid
.price
gets the currency code prefixed, e.g."GBP 13.92"
.The nested
url
field ofmainImage
is replaced by a flatimage_url
field.availability
is replaced by a booleanin_stock
field.
Run your spider again to see how the
schema of the data in items.jsonl
changes completely.
Next steps#
You now know the recommended approach to make post-processing changes to your item in a website-independent way.
This is the end of the Zyte AI spiders tutorial.
See also
Exporting scraped data, to learn how to export extracted data automatically to different data storage services.
Web scraping tutorial, featuring key features of Zyte API.