Generate crawling code with AI#

Now that you have generated parsing code with AI, you will use AI to generate book URL discovery code by parsing navigation webpages (homepage, categories) from https://books.toscrape.com.

1. Generate navigation code#

This time around, you can try generating an item class and a page object with a single prompt:

Now I want you to create a new item type, BookNavigation, for navigation data from the homepage and categories with the following optional fields: url, book_urls, next_page_url.

Then create a page object for that item type and books.toscrape.com.

The workflow will the same as before (except for the scrapy-poet setup, which is not necessary anymore).

The generated item should look something like:

project/items.py#
@dataclass
class BookNavigation:
    url: str | None = None
    book_urls: list[str] | None = None
    next_page_url: str | None = None

2. Create a crawling spider#

Finally, add a new spider to project/spiders/books.py that uses the new BookNavigation item to implement crawling:

from project.items import BookNavigation


class BookNavigationSpider(Spider):
    name = "books"
    url: str

    async def start(self):
        yield Request(self.url, callback=self.parse_navigation)

    async def parse_navigation(self, response, navigation: BookNavigation):
        if navigation.next_page_url:
            yield response.follow(navigation.next_page_url, callback=self.parse_navigation)
        for url in navigation.book_urls or []:
            yield response.follow(url, callback=self.parse_book)

    async def parse_book(self, _, book: Book):
        yield book

Your new spider expects a navigation page as its url argument, and can follow pagination and extract all relevant books.

Before you run it, however, you best add the following at the end of project/settings.py:

project/settings.py#
DOWNLOAD_SLOTS = {
    "books.toscrape.com": {
        "delay": 0.01,
        "concurrency": 16,
    },
}

By default, Scrapy rate limits requests. For https://books.toscrape.com, however, it is safe to use a higher concurrency and a lower delay, and it will make running the spider much quicker.

Now run the spider:

scrapy crawl books -a url=https://books.toscrape.com/catalogue/category/books/mystery_3/index.html -o books.jsonl

It will add the 32 books from the Mystery category to books.jsonl.

Next steps#

Congratulations! You have successfully used AI to generate maintainable web scraping code.

This concludes our AI-assisted web scraping tutorial.

If you are wondering what to do next, here are some suggestions: