Subclass AI spiders#

In this chapter of the Zyte AI spiders tutorial, you will create a custom subclass of the e-commerce spider.

While page object classes are very powerful, they have some limitations. However, to implement features in AI spiders that cannot be implemented with page object classes, you always have the option of subclassing AI spiders.

Add website-independent logic#

One of the limitations of page objects is that they are not good for implementing website-independent logic.

You can create a page object class that applies to any domain, and set it with a low priority so that you can still override it for specific domains. But if you were to implement some website-independent crawling logic that way, by changing the output of ProductNavigation.items for example, every time you implemented a website-specific page object class with a higher priority, you would need to reproduce that logic.

Instead, for website-independent logic, it is usually better to create a new spider by subclassing the corresponding AI spider, and apply that website-independent logic to the spider itself.

You will now create a custom subclass of the e-commerce spider that only crawls book URLs with the string "murder" in them.

Create zyte_spider_templates_project/spiders/custom_ecommerce.py with the following code:

zyte_spider_templates_project/spiders/custom_ecommerce.py#
from typing import Iterable

from scrapy import Request
from scrapy_poet import DummyResponse
from zyte_common_items import ProductNavigation
from zyte_spider_templates import EcommerceSpider


class CustomEcommerceSpider(EcommerceSpider):
    name = "custom-ecommerce"

    def parse_navigation(
        self, response: DummyResponse, navigation: ProductNavigation
    ) -> Iterable[Request]:
        for request in super().parse_navigation(response, navigation):
            if (
                request.callback == self.parse_product
                and "murder" not in request.url
            ):
                continue
            yield request

The code above:

  • Defines a new spider that subclasses EcommerceSpider.

  • Sets the name of your new spider to custom-ecommerce. To run your new spider replace ecommerce with custom-ecommerce in your scrapy crawl command calls.

  • Overrides the parse_navigation callback of the e-commerce spider, the one that gets productNavigation and turns productNavigation.items and productNavigation.nextPage into new Scrapy requests. It iterates the output of the implementation from the parent class, and skips book requests (i.e. those with self.parse_product as callback) that do not contain the string "murder" in their URL.

If you run this spider:

scrapy crawl custom-ecommerce \
    -a url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html" \
    -a crawl_strategy=navigation \
    -a extract_from=httpResponseBody \
    -O items.jsonl

If you inspect items.jsonl, you may notice the following:

  • Now it only contains 6 books, instead of 32, because there’s 6 books with "murder" in their URL, due to it being part of the book title.

  • The improvements you made in your AI spiders through page object classes also affect your new spider. aggregateRating is properly extracted, and only 2 requests are used for pagination.

Add a spider parameter#

You will now modify your spider so that the substring to filter by ("murder") is no longer set in code, and instead can be defined with a new spider argument on the command line.

Replace your custom spider code with the following:

zyte_spider_templates_project/spiders/custom_ecommerce.py#
from typing import Iterable

from scrapy import Request
from scrapy_poet import DummyResponse
from scrapy_spider_metadata import Args
from zyte_common_items import ProductNavigation
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams


class CustomEcommerceSpiderParams(EcommerceSpiderParams):
    require_url_substring: str = "murder"


class CustomEcommerceSpider(EcommerceSpider, Args[CustomEcommerceSpiderParams]):
    name = "custom-ecommerce"

    def parse_navigation(
        self, response: DummyResponse, navigation: ProductNavigation
    ) -> Iterable[Request]:
        for request in super().parse_navigation(response, navigation):
            if (
                request.callback == self.parse_product
                and self.args.require_url_substring not in request.url
            ):
                continue
            yield request

Note

If you already know how to implement parameters in Scrapy spiders, you may have been surprised by the code above. AI spiders implement spider parameters using a more powerful approach enabled by the scrapy-spider-metadata Scrapy plugin.

In the code above:

  • You create a subclass of EcommerceSpiderParams that extends the parameters of the e-commerce spider with a new one, a parameter of type string called require_url_substring with "murder" as its default value.

  • You assign your new spider parameters class to your custom spider by adding Args to its parent classes.

  • You replace the previously hard-coded "murder" string in CustomEcommerceSpider.parse_navigation with self.args.require_url_substring.

Because you set "murder" as the default value of the new parameter, if you run your spider with the same command call as before, without the parameter, it will also yield 6 items.

However, you can now change the URL substring to filter to something else. For example, if you run:

scrapy crawl custom-ecommerce \
    -a url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html" \
    -a crawl_strategy=navigation \
    -a extract_from=httpResponseBody \
    -a require_url_substring=dark \
    -O items.jsonl

When inspecting items.jsonl, you will see that you now get 2 books instead, as only 2 books have the "dark" substring in their URL.

Next steps#

Now that you know how to subclass AI spiders and extend them with new parameters, you can move on to the next chapter.