Add page objects to AI spiders#

In this chapter of the Zyte AI spiders tutorial, you will add custom code to AI spiders.

AI is not a perfect solution. Eventually, you will find that it does not extract some specific field from some specific website the way it should. And you do not need to wait for us to fix it if you are in a hurry, AI spiders are designed so that you are always in full control.

The first step to customize AI spiders is to download them. Our AI spiders are implemented as an open-source Scrapy project, so you just need to download that project, customize it, and run it wherever you want, e.g. locally or in Scrapy Cloud.

Download and run AI spiders#

Before you can customize AI spiders with code, you need to download their source code and make sure you can run it locally:

  1. Install Git.

  2. Install Python, version 3.8 or higher.

    Tip

    You can run python --version on a terminal window to make sure that you have a good-enough version of Python.

  3. Open a terminal window.

  4. Clone the https://github.com/zytedata/zyte-spider-templates-project repository and enter it:

    git clone https://github.com/zytedata/zyte-spider-templates-project.git
    cd zyte-spider-templates-project
    
  5. Create and activate a Python virtual environment and install requirements.txt into it.

    • On Windows:

      python3 -m venv venv
      venv\Scripts\activate.bat
      pip install -r requirements.txt
      
    • On macOS and Linux:

      python3 -m venv venv
      . venv/bin/activate
      pip install -r requirements.txt
      
  6. With a plain text editor or an IDE, set your Zyte API key in zyte_spider_templates_project/settings.py:

    ZYTE_API_KEY = "YOUR_API_KEY"
    

If you have followed these steps correctly, you should be able to run the e-commerce spider from your terminal window as follows:

scrapy crawl ecommerce \
    -a url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html" \
    -a crawl_strategy=navigation \
    -a extract_from=httpResponseBody \
    -O items.jsonl

And the extracted data will be stored in an items.jsonl file in JSON Lines format, i.e. it should contain 32 lines, each with a JSON object ({…}).

Customize product parsing#

When you switched your extraction source to httpResponseBody, you stopped getting the aggregateRating field. But you can fix that by writing custom code that parses that bit of data out of the webpage HTML.

AI spiders are implemented using web-poet, a web scraping framework with Scrapy support that allows implementing the parsing of a webpage as a Python class, called a page object class.

You can configure these page object classes to match a specific combination of URL pattern (e.g. a given domain) and output type (e.g. a product), and whenever your spider asks for data of that output type from a matching URL, your page object class will be used for the parsing.

So, first you need to add number-parser to requirements.txt, ideally with a pinned version, e.g.

requirements.txt#
number-parser==0.3.2

Then re-run the command to install requirements:

pip install -r requirements.txt

Finally, create a file at zyte_spider_templates_project/pages/books_toscrape_com.py with the following code:

zyte_spider_templates_project/pages/books_toscrape_com.py#
import attrs
from number_parser import parse_number
from web_poet import AnyResponse, field, handle_urls
from zyte_common_items import AggregateRating, AutoProductPage


@handle_urls("books.toscrape.com")
@attrs.define
class BooksToScrapeComProductPage(AutoProductPage):
    response: AnyResponse

    @field
    def aggregateRating(self):
        element_class = self.response.css(".star-rating::attr(class)").get()
        if not element_class:
            return None
        rating_str = element_class.split(" ")[-1]
        rating = parse_number(rating_str)
        if not rating:
            return None
        return AggregateRating(ratingValue=rating, bestRating=5)

This is what the code above does:

  • The @handle_urls decorator indicates that this parsing code should only be used on the books.toscrape.com domain.

    Tip

    To write more complex URL patterns, check out the @handle_urls reference.

  • Do not think too hard about the @attrs.define decorator. You will learn why it is needed later on.

  • BooksToScrapeComProductPage is an arbitrary class name that follows naming conventions for page object classes. What’s important is that it subclasses AutoProductPage, a base class for page object classes that parse products. It uses Zyte API automatic extraction by default for all Product fields, so that you can override only specific fields with custom parsing code.

  • response: AnyResponse asks to get an instance of AnyResponse through dependency injection, that your code can then access at self.response.

    Similarly to scrapy.http.TextResponse, AnyResponse gives you methods to easily run XPath or CSS selectors to get specific data.

    AnyResponse also allows your code to continue working as expected even if you change your extraction source from httpResponseBody to browserHtml in the future.

  • The @field-decorated aggregateRating method defines how to extract the Product.aggregateRating field.

    If you inspect the HTML code of the star rating in book webpages, you will notice that it starts with something like:

    <p class="star-rating Three">
    

    What this field implementation does is extract that class value, then take the last word, which is the rating in English (e.g. "Three"), and finally convert that into an actual number (e.g. 3) using number-parser.

If you run your spider again, and you inspect the new content of items.jsonl, you will see that now the right rating is being extracted from every book webpage.

This is only one of many ways you can customize AI spiders through page objects. You can also:

  • Apply field processors.

  • Use data from AI parsing in your custom field implementations by reading it from AutoProductPage.product.

    For example, to make product names all caps:

    zyte_spider_templates_project/pages/books_toscrape_com.py#
    @field
    def name(self):
        return self.product.name.upper()
    
  • Replace Product with a custom item class with custom fields.

    Tip

    It is recommended to stick to the standard item classes where possible, and use the additionalAttributes field to extract custom fields if their values can be strings.

Deploy and run your code on Scrapy Cloud#

Now that you have successfully customized your AI spiders project to improve its output for products from https://books.toscrape.com, and you have run the e-commerce spider locally to verify that it works as expected, you can deploy your custom code to Scrapy Cloud, so that you can run your improved AI spiders there as well:

  1. Install the latest version of shub, the Scrapy Cloud command-line application:

    pip install --upgrade shub
    
  2. Uncomment lines 5 and 6 at zyte-spider-templates-project/scrapinghub.yml and replace NNNNNN with the ID of your Scrapy Cloud project.

    To find your project ID, open the Zyte dashboard, select your project under Scrapy Cloud Projects, and copy your project ID from the browser address bar.

    For example, if the URL is https://app.zyte.com/p/000000/jobs, 000000 is your project ID.

  3. Copy your Scrapy Cloud API key from the Zyte dashboard.

  4. Run the following command and, when prompted, paste your API key and press Enter:

    shub login
    
  5. Run the following command, replacing 000000 with your actual project ID:

    shub deploy 000000
    

Your AI spiders project has now been deployed to your Scrapy Cloud project, replacing the default AI spiders tech stack.

Now you can run the e-commerce spider again to verify that aggregateRating is now extracted as expected also in Scrapy Cloud.

Whenever you make new changes to your AI spiders project locally, remember that you need to re-run the shub deploy <project ID> command to deploy your new changes to your Scrapy Cloud project.

Note

The default AI spiders tech stack, the one you get when you select Zyte’s AI-Powered Spiders during project creation, is automatically upgraded after new releases of the zyte-spider-templates library.

In contrast, after you deploy your own code to your Scrapy Cloud project, the tech stack of that project is no longer updated automatically. You are instead in full control, and decide when and how to upgrade it.

Customize navigation parsing#

AI spiders are extremely flexible. So far, you have extended the output of Zyte API automatic extraction for product. However, AI spiders also support replacing AI parsing altogether, or even using your own or third-party extraction instead of Zyte’s. No vendor lock-in!

You are now going to replace the AI parsing of productNavigation entirely with custom parsing code. Edit your zyte_spider_templates_project/pages/books_toscrape_com.py file to add new import statements and a new page object class:

zyte_spider_templates_project/pages/books_toscrape_com.py#
from zyte_common_items import ProbabilityRequest, ProductNavigationPage, Request


@handle_urls("books.toscrape.com")
class BooksToScrapeComProductNavigationPage(ProductNavigationPage):

    @field
    def items(self):
        return [
            ProbabilityRequest(url=self.response.urljoin(relative_url))
            for relative_url in self.response.css("article a::attr(href)").getall()
        ]

    @field
    def nextPage(self):
        relative_url = self.response.css(".next a::attr(href)").get()
        if not relative_url:
            return None
        return Request(url=self.response.urljoin(relative_url))

The code above:

  • Subclasses zyte_common_items.ProductNavigationPage instead of zyte_common_items.AutoProductNavigationPage. As a result, AI parsing is not used at all.

  • Implements only the necessary fields of ProductNavigation: items to parse the links to book pages, and nextPage to parse the link to the next page in a list of books.

  • You use self.response. This time you did not need to declare it, because it is declared already by the ProductNavigationPage parent class.

    Note the use of self.response.urljoin(). This allows your code to handle relative URLs that you may find in some webpages, like ../a.html, and convert them to absolute URLs that Scrapy requests can use, e.g. https://example.com/a.html.

If you run your spider again, its output should be the same, but how it got that output has changed:

  • You no longer ask Zyte API for productNavigation, and instead ask only for httpResponseBody. As you can check for yourself in the Stats page, this makes the corresponding requests 80% cheaper; instead of $0.00030 each, they now cost $0.00006 each.

  • If you look at the downloader/request_count stats from your job output, or at the request count if you are using Scrapy Cloud, you might notice that the number of total requests went from 35 to 34. You are now using 1 fewer request to get the same content.

    The AI parsing of the second page of the Mystery category of books makes a mistake, extracting the link to the previous page as nextPage. Scrapy skips seen URLs by default, however in https://books.toscrape.com there are 2 different URLs for the first page of a book category, one ending in index.html (the one you use as start URL) and one ending in page-1.html (the one from the Previous link in the 2nd page). You could have addressed this by changing your start URL to the one ending in page-1.html, but your custom implementation properly fixes the issue by not extracting the previous link to begin with.

All in all, you replaced 3 requests at $0.00030 each with 2 requests at $0.00006 each. Your navigation requests are 50% faster, and 87% cheaper. Your overall crawl is 3% faster and 12% cheaper.

While optimizing feels great, in the future it would be a good idea to analyze the potential gains of such a change before you decide to implement it: Do the cost savings justify the time investment?

In this specific case, it might be hard to justify making this change. Even in a best case scenario where you run this crawl daily and the change only took you 10 minutes, if you put a value of $20/hour to your time, you would only see a return on investment after 11.7 years.

More importantly, while AI parsing is quite resilient to websites changes, your custom parsing code could easily break. If website changes every 10 years broke your code, you may never see a return on investment.

That said, for crawls that run often and send thousands or even millions of requests, custom parsing code could sometimes offer significant savings. Also, while your custom code may break due to website changes, you will always have the option to switch to AI parsing when needed with a trivial code change (by commenting out the @handle_urls(…) line of the corresponding page object class).

Next steps#

You have learned how you can use page object classes to override, extend or replace AI parsing with custom parsing code for specific combinations of item types and URL patterns.

On the next chapter you will go further and create entire new spider classes based on AI spiders.