Subclass AI spiders#
In this chapter of the Zyte AI spiders tutorial, you will create a custom subclass of the e-commerce spider.
While page object classes are very powerful, they have some limitations. However, to implement features in AI spiders that cannot be implemented with page object classes, you always have the option of subclassing AI spiders.
Add website-independent logic#
One of the limitations of page objects is that they are not good for implementing website-independent logic.
You can create a page object class that applies to any domain, and set it
with a low priority so that you can still override it for specific domains. But
if you were to implement some website-independent crawling logic that way, by
changing the output of ProductNavigation.items
for example, every time you
implemented a website-specific page object class with a higher priority, you
would need to reproduce that logic.
Instead, for website-independent logic, it is usually better to create a new spider by subclassing the corresponding AI spider, and apply that website-independent logic to the spider itself.
You will now create a custom subclass of the e-commerce spider that only crawls book URLs with the string "murder"
in them.
Create zyte_spider_templates_project/spiders/custom_ecommerce.py
with the
following code:
from typing import Iterable
from scrapy import Request
from scrapy_poet import DummyResponse
from zyte_common_items import ProductNavigation
from zyte_spider_templates import EcommerceSpider
class CustomEcommerceSpider(EcommerceSpider):
name = "custom-ecommerce"
def parse_navigation(
self, response: DummyResponse, navigation: ProductNavigation
) -> Iterable[Request]:
for request in super().parse_navigation(response, navigation):
if (
request.callback == self.parse_product
and "murder" not in request.url
):
continue
yield request
The code above:
Defines a new spider that subclasses
EcommerceSpider
.Sets the name of your new spider to
custom-ecommerce
. To run your new spider replaceecommerce
withcustom-ecommerce
in yourscrapy crawl
command calls.Overrides the
parse_navigation
callback of the e-commerce spider, the one that gets productNavigation and turns productNavigation.items and productNavigation.nextPage into new Scrapy requests. It iterates the output of the implementation from the parent class, and skips book requests (i.e. those withself.parse_product
as callback) that do not contain the string"murder"
in their URL.
If you run this spider:
scrapy crawl custom-ecommerce \
-a url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html" \
-a crawl_strategy=navigation \
-a extract_from=httpResponseBody \
-O items.jsonl
If you inspect items.jsonl
, you may notice the following:
Now it only contains 6 books, instead of 32, because there’s 6 books with
"murder"
in their URL, due to it being part of the book title.The improvements you made in your AI spiders through page object classes also affect your new spider. aggregateRating is properly extracted, and only 2 requests are used for pagination.
Add a spider parameter#
You will now modify your spider so that the substring to filter by
("murder"
) is no longer set in code, and instead can be defined with a new
spider argument on the command line.
Replace your custom spider code with the following:
from typing import Iterable
from scrapy import Request
from scrapy_poet import DummyResponse
from scrapy_spider_metadata import Args
from zyte_common_items import ProductNavigation
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams
class CustomEcommerceSpiderParams(EcommerceSpiderParams):
require_url_substring: str = "murder"
class CustomEcommerceSpider(EcommerceSpider, Args[CustomEcommerceSpiderParams]):
name = "custom-ecommerce"
def parse_navigation(
self, response: DummyResponse, navigation: ProductNavigation
) -> Iterable[Request]:
for request in super().parse_navigation(response, navigation):
if (
request.callback == self.parse_product
and self.args.require_url_substring not in request.url
):
continue
yield request
Note
If you already know how to implement parameters in Scrapy spiders, you may have been surprised by the code above. AI spiders implement spider parameters using a more powerful approach enabled by the scrapy-spider-metadata Scrapy plugin.
In the code above:
You create a subclass of
EcommerceSpiderParams
that extends the parameters of the e-commerce spider with a new one, a parameter of type string calledrequire_url_substring
with"murder"
as its default value.You assign your new spider parameters class to your custom spider by adding
Args
to its parent classes.You replace the previously hard-coded
"murder"
string inCustomEcommerceSpider.parse_navigation
withself.args.require_url_substring
.
Because you set "murder"
as the default value of the new parameter, if you
run your spider with the same command call as before, without the parameter, it will also yield
6 items.
However, you can now change the URL substring to filter to something else. For example, if you run:
scrapy crawl custom-ecommerce \
-a url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html" \
-a crawl_strategy=navigation \
-a extract_from=httpResponseBody \
-a require_url_substring=dark \
-O items.jsonl
When inspecting items.jsonl
, you will see that you now get 2 books
instead, as only 2 books have the "dark"
substring in their URL.
Next steps#
Now that you know how to subclass AI spiders and extend them with new parameters, you can move on to the next chapter.