Generate crawling code with AI

Now that you have generated parsing code with AI, you will use AI to generate book URL discovery code by parsing navigation webpages (homepage, categories) from https://books.toscrape.com.

1. Generate navigation code

This time around, you can try generating an item class and a page object with a single prompt:

Now I want you to create a new item type, BookNavigation, for navigation data from the homepage and categories with the following optional fields: url, book_urls, next_page_url.

Then create a page object for that item type and books.toscrape.com.

The workflow will be the same as before.

The generated item should look something like:

copilot-tutorial/copilot_tutorial/items.py
@dataclass
class BookNavigation:
    url: str | None = None
    book_urls: list[str] | None = None
    next_page_url: str | None = None

2. Create a crawling spider

Finally, add a new spider to copilot-tutorial/copilot_tutorial/spiders/books.py that uses the new BookNavigation item to implement crawling:

from copilot_tutorial.items import BookNavigation


class BookNavigationSpider(Spider):
    name = "books"
    url: str

    async def start(self):
        yield Request(self.url, callback=self.parse_navigation)

    async def parse_navigation(self, response, navigation: BookNavigation):
        if navigation.next_page_url:
            yield response.follow(navigation.next_page_url, callback=self.parse_navigation)
        for url in navigation.book_urls or []:
            yield response.follow(url, callback=self.parse_book)

    async def parse_book(self, _, book: Book):
        yield book

Your new spider expects a navigation page as its url argument, and can follow pagination and extract all relevant books.

Before you run it, however, you best add the following at the end of copilot-tutorial/copilot_tutorial/settings.py:

copilot-tutorial/copilot_tutorial/settings.py
DOWNLOAD_SLOTS = {
    "books.toscrape.com": {
        "delay": 0.01,
        "concurrency": 16,
    },
}

By default, Scrapy rate-limits requests. For https://books.toscrape.com, however, it is safe to use a higher concurrency and a lower delay, and it will make running the spider much quicker.

Now run the books spider again with the following Arguments:

-a url=https://books.toscrape.com/catalogue/category/books/mystery_3/index.html
-o books.jsonl

It will add the 32 books from the Mystery category to books.jsonl.

Next steps

Congratulations! You have successfully used AI to generate maintainable web scraping code.

This concludes our AI-assisted web scraping tutorial.

If you are wondering what to do next, consider enabling Zyte API to avoid bans. See Enable Zyte API to avoid bans in Web scraping tutorial.