Generate crawling code with AI#
Now that you have generated parsing code with AI, you will use AI to generate book URL discovery code by parsing navigation webpages (homepage, categories) from https://books.toscrape.com.
2. Create a crawling spider#
Finally, add a new spider to project/spiders/books.py
that uses the new
BookNavigation
item to implement crawling:
from project.items import BookNavigation
class BookNavigationSpider(Spider):
name = "books"
url: str
async def start(self):
yield Request(self.url, callback=self.parse_navigation)
async def parse_navigation(self, response, navigation: BookNavigation):
if navigation.next_page_url:
yield response.follow(navigation.next_page_url, callback=self.parse_navigation)
for url in navigation.book_urls or []:
yield response.follow(url, callback=self.parse_book)
async def parse_book(self, _, book: Book):
yield book
Your new spider expects a navigation page as its url
argument, and can
follow pagination and extract all relevant books.
Before you run it, however, you best add the following at the end of
project/settings.py
:
project/settings.py
#DOWNLOAD_SLOTS = {
"books.toscrape.com": {
"delay": 0.01,
"concurrency": 16,
},
}
By default, Scrapy rate limits requests. For https://books.toscrape.com, however, it is safe to use a higher concurrency and a lower delay, and it will make running the spider much quicker.
Now run the spider:
scrapy crawl books -a url=https://books.toscrape.com/catalogue/category/books/mystery_3/index.html -o books.jsonl
It will add the 32 books from the Mystery category to books.jsonl
.
Next steps#
Congratulations! You have successfully used AI to generate maintainable web scraping code.
This concludes our AI-assisted web scraping tutorial.
If you are wondering what to do next, here are some suggestions:
Add scrapy-zyte-api to your project, so that you can generate page objects for any website without the risk of getting ban responses.
Follow our other tutorials: Web scraping tutorial and Zyte AI spiders tutorial.