Automate parsing#
Now that you are familiar with browser automation, it is time to learn about parsing automation.
Your first spider parsed 3 fields from book webpages of
books.toscrape.com: name
, price
, url
.
When targeting other websites, there are 2 challenges you are going to face:
You will probably want more fields.
Our schema recommendation for an e-commerce product has more than 25 fields. You need to write parsing logic for every field and for every target website.
Websites change, and when they do they can break your parsing code.
You need to monitor your web scraping project for breaking website changes, and update your parsing code accordingly when they occur.
These issues are time-consuming and scale up with additional fields and websites. To avoid them, let Zyte API handle parsing for you.
Create a file at tutorial/spiders/books_toscrape_com_extract.py
with the
following code:
from scrapy import Spider
class BooksToScrapeComExtractSpider(Spider):
name = "books_toscrape_com_extract"
start_urls = [
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
]
def parse(self, response):
next_page_links = response.css(".next a")
yield from response.follow_all(next_page_links)
book_links = response.css("article a")
for request in response.follow_all(book_links, callback=self.parse_book):
request.meta["zyte_api_automap"] = {"product": True}
yield request
def parse_book(self, response):
yield response.raw_api_response["product"]
The code above is a modification of your first spider that uses automatic extraction, where:
In requests for book URLs, at the end of the
parse
callback method, you include request metadata to have Zyte API give you structured data for an e-commerce product.The
parse_book
callback method yields the product data from the Zyte API response.
Now run your code:
scrapy crawl books_toscrape_com_extract -O books.csv
Your code will now extract many more fields from each book, all without you having to write a single line of parsing code.
Note
Zyte API automatic extraction requires you to specify the kind of data you want to extract.
Your spider above uses product to request the data of a single e-commerce product, but automatic extraction supports many other types of data extraction, and more will come following user demand.
For example, if you need to extract a news article or a blog post, use the article data extraction type instead.
This concludes our web scraping tutorial. To learn more, check out our web scraping guides, our documentation for Zyte API and Scrapy Cloud, and the Scrapy documentation. You can also visit our Support Center or reach out to the wider Scrapy community.