Automate parsing and crawling#
Now that you are familiar with browser automation, it is time to learn about parsing and crawling automation.
Automate parsing#
Your first spider parsed 3 fields from book webpages of
books.toscrape.com: name
, price
, url
.
When targeting other websites, there are 2 challenges you are going to face:
You will probably want more fields.
Our schema recommendation for an e-commerce product has more than 25 fields. You need to write parsing logic for every field and for every target website.
Websites change, and when they do they can break your parsing code.
You need to monitor your web scraping project for breaking website changes, and update your parsing code accordingly when they occur.
These issues are time-consuming and scale up with additional fields and websites. To avoid them, let Zyte API handle parsing for you.
Create a file at tutorial/spiders/books_toscrape_com_extract.py
with the
following code:
from scrapy import Spider
class BooksToScrapeComExtractSpider(Spider):
name = "books_toscrape_com_extract"
start_urls = [
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
]
def parse(self, response):
next_page_links = response.css(".next a")
yield from response.follow_all(next_page_links)
book_links = response.css("article a")
for request in response.follow_all(book_links, callback=self.parse_book):
request.meta["zyte_api_automap"] = {"product": True}
yield request
def parse_book(self, response):
yield response.raw_api_response["product"]
The code above is a modification of your first spider that uses automatic extraction, where:
In requests for book URLs, at the end of the
parse
callback method, you include request metadata to have Zyte API give you structured data for an e-commerce product.The
parse_book
callback method yields the product data from the Zyte API response.
Now run your code:
scrapy crawl books_toscrape_com_extract -O books.csv
Your code will now extract many more fields from each book, all without you having to write a single line of parsing code.
Note
Zyte API automatic extraction requires you to specify the kind of data you want to extract.
Your spider above uses product to request the data of a single e-commerce product, but automatic extraction supports many other types of data extraction, and more will come following user demand.
For example, if you need to extract a news article or a blog post, use the article data extraction type instead.
Automate crawling#
Your spider above uses automatic parsing for product detail pages, but it still hard-codes the logic to find the next page link and the links to product detail pages. What if you could also automate that code? What if you did not need to write a spider at all?
AI spiders are ready-to-use spiders that rely on Zyte API automatic extraction to both crawl and parse any website of a supported type, e.g. any e-commerce website.
To be able to use these ready-to-use spiders in your Scrapy project:
Install the latest version of zyte-spider-templates:
pip install --upgrade zyte-spider-templates
Also add the following line to your
requirements.txt
file, so that your project also continues to work when running on Scrapy Cloud:zyte-spider-templates
Edit your
settings.py
file:Add the following lines at the beginning of the file:
from itemadapter import ItemAdapter from zyte_common_items import ZyteItemAdapter ItemAdapter.ADAPTER_CLASSES.appendleft(ZyteItemAdapter)
Add
"zyte_spider_templates.spiders"
to your existingSPIDER_MODULES
setting:SPIDER_MODULES = [ "tutorial.spiders", "zyte_spider_templates.spiders", ]
Add
"scrapy_poet.InjectionMiddleware": 543
to your existingDOWNLOADER_MIDDLEWARES
setting:DOWNLOADER_MIDDLEWARES = { "scrapy_poet.InjectionMiddleware": 543, }
Define the following additional settings:
SPIDER_MIDDLEWARES = { "scrapy_poet.RetryMiddleware": 275, "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None, "zyte_spider_templates.middlewares.AllowOffsiteMiddleware": 500, "zyte_spider_templates.middlewares.CrawlingLogsMiddleware": 1000, } SCRAPY_POET_DISCOVER = [ "zyte_spider_templates.pages", ] CLOSESPIDER_TIMEOUT_NO_ITEM = 600 SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue" SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue"
Now that zyte-spider-templates is configured in your project, you can use its e-commerce spider:
scrapy crawl ecommerce -a url="http://books.toscrape.com/catalogue/category/books/mystery_3/index.html" -O books.csv
The specified target URL is all the e-commerce spider need to get you the same data as before. No custom code needed, now or whenever the target website changes in the future.
This concludes our web scraping tutorial. The tutorial code is available on GitHub. To learn more, check out our web scraping guides, our documentation for Zyte API and Scrapy Cloud, and the Scrapy documentation. You can also visit our Support Center or reach out to the wider web scraping and Scrapy communities.