Generate crawling code with AI¶
Now that you have generated parsing code with AI, you will use AI to generate book URL discovery code by parsing navigation webpages (homepage, categories) from https://books.toscrape.com.
2. Create a crawling spider¶
Finally, add a new spider to
copilot-tutorial/copilot_tutorial/spiders/books.py that uses the new
BookNavigation item to implement crawling:
from copilot_tutorial.items import BookNavigation
class BookNavigationSpider(Spider):
name = "books"
url: str
async def start(self):
yield Request(self.url, callback=self.parse_navigation)
async def parse_navigation(self, response, navigation: BookNavigation):
if navigation.next_page_url:
yield response.follow(navigation.next_page_url, callback=self.parse_navigation)
for url in navigation.book_urls or []:
yield response.follow(url, callback=self.parse_book)
async def parse_book(self, _, book: Book):
yield book
Your new spider expects a navigation page as its url argument, and can
follow pagination and extract all relevant books.
Before you run it, however, you best add the following at the end of
copilot-tutorial/copilot_tutorial/settings.py:
copilot-tutorial/copilot_tutorial/settings.py¶DOWNLOAD_SLOTS = {
"books.toscrape.com": {
"delay": 0.01,
"concurrency": 16,
},
}
By default, Scrapy rate-limits requests. For https://books.toscrape.com, however, it is safe to use a higher concurrency and a lower delay, and it will make running the spider much quicker.
Now run the books spider again with the
following Arguments:
-a url=https://books.toscrape.com/catalogue/category/books/mystery_3/index.html
-o books.jsonl
It will add the 32 books from the Mystery category to books.jsonl.
Next steps¶
Congratulations! You have successfully used AI to generate maintainable web scraping code.
This concludes our AI-assisted web scraping tutorial.
If you are wondering what to do next, consider enabling Zyte API to avoid bans. See Enable Zyte API to avoid bans in Web scraping tutorial.