Use virtual spiders and templates#

In this chapter of the Zyte AI spiders tutorial, you will learn about Scrapy Cloud spider templates and virtual spiders.

Scrapy Cloud allows creating virtual spiders, spiders that are not actual Scrapy spiders, but a set of spider arguments to be used when running an actual Scrapy spider, a spider template.

Create your first virtual spider#

Before you continue, deploy your latest code to Scrapy Cloud again if you have not done so since your latest changes:

shub deploy <project ID>

You will now create a virtual spider out of the e-commerce spider:

  1. Log into the Zyte dashboard and click on your Scrapy Cloud project.

    ../../../_images/open-project.png
  2. Click Spiders → Create spider, and under E-commerce, click Select.

    ../../../_images/create-spider.png

    Note

    You will notice there are 2 identical templates to choose from. Select the first one.

  3. Enter a name for your spider, e.g. books.toscrape.com / dark, and fill all remaining fields with the same values that you passed when you last run your custom spider locally:

    Extraction Source: httpResponseBody
    Crawl Strategy: Navigation
    Require Url Substring: dark
  4. Click Save at the bottom.

If you open Spiders → Dashboard now, you will see that in addition to AI spiders and your custom spider there is a new virtual spider:

../../../_images/virtual-spider-in-list.png

To run your new virtual spider, click its entry on the spider list and then click Run on the top-right corner. The virtual spider job will actually run your custom-ecommerce spider, with the parameters you specified when creating the virtual spider.

You could now repeat the steps above, but change some of the parameters. For example, you could create a virtual spider called books.toscrape.com / murder and change Require Url Substring to murder. It would also show as a separate spider in Scrapy Cloud, but actually run your custom-ecommerce spider with the specified parameters.

Virtual spiders make it easier to run and monitor jobs when the same spider needs to be called with different arguments on a regular basis.

If you modify your custom-ecommerce spider in the future, the changes will apply to all the virtual spiders that used it as a template.

Add metadata to your spider template#

Not every spider in a Scrapy project is considered a template. For a spider to be considered a spider template, it needs to declare spider metadata with the scrapy-spider-metadata Scrapy plugin, and set template to True in that metadata.

All AI spiders are spider templates. And when you created the custom-ecommerce spider, you inherited the metadata from the ecommerce spider, which is why after selecting Create spider you got 2 identical spider templates.

However, it is confusing to have 2 identical templates. Time to fix that. Replace your custom spider code with the following:

zyte_spider_templates_project/spiders/custom_ecommerce.py#
from typing import Iterable

from pydantic import Field
from scrapy import Request
from scrapy_poet import DummyResponse
from scrapy_spider_metadata import Args
from zyte_common_items import ProductNavigation
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams


class CustomEcommerceSpiderParams(EcommerceSpiderParams):
    require_url_substring: str = Field(
        title="Require URL substring",
        description="Only visit product URLs with this substring.",
        default="murder",
    )


class CustomEcommerceSpider(EcommerceSpider, Args[CustomEcommerceSpiderParams]):
    name = "custom-ecommerce"
    metadata = {
        **EcommerceSpider.metadata,
        "title": "Custom Ecommerce",
        "description": (
            "Ecommerce spider template that only visits books with URLs "
            "matching a specified substring."
        ),
    }

    def parse_navigation(
        self, response: DummyResponse, navigation: ProductNavigation
    ) -> Iterable[Request]:
        for request in super().parse_navigation(response, navigation):
            if (
                request.callback == self.parse_product
                and self.args.require_url_substring not in request.url
            ):
                continue
            yield request

The code above makes the following changes to your earlier code:

  • A new metadata class attribute is defined in your spider class. It inherits the metadata from the parent class, and overrides the title and description parameters, while keeping the template parameter that is set to True in the parent class.

  • pydantic.Field() is now used to define additional metadata for your custom require_url_substring spider parameter.

Save your changes, re-deploy your code (shub deploy <project ID>) and open Spiders → Create spider again in Scrapy Cloud. You will now see that your custom spider template has changed accordingly.

../../../_images/custom-template.png

If you select your custom template and you scroll down, you will also notice that your custom parameter has changed accordingly as well.

../../../_images/custom-parameter.png

You could also remove your custom spider from the template list altogether by setting "template" to False in the metadata dictionary.

Next steps#

Feel free to play around with spider templates and virtual spiders.

Once you are ready for more, move on to the next chapter.