Automate parsing and crawling#

Now that you are familiar with browser automation, it is time to learn about parsing and crawling automation.

Tip

This page covers AI-powered parsing and crawling on every request. See also AI-assisted web scraping tutorial for an alternative approach where AI is used to generate parsing and crawling code instead.

The approach described in this page is more robust to website changes, but it has a per-request cost. Choose the approach that best fits your needs, combine both, or use generated code by default and AI parsing as fallback for website changes.

Automate parsing#

Your first spider parsed 3 fields from book webpages of books.toscrape.com: name, price, url.

When targeting other websites, there are 2 challenges you are going to face:

  • You will probably want more fields.

    For example, our automatic extraction product schema has more than 25 fields. You need to write parsing logic for every combination of target field and target website. Web Scraping Copilot can speed up this work significantly, but it cannot make it go away.

  • Websites change, and when they do they can break your parsing code.

    You need to monitor your web scraping project for breaking website changes, and update your parsing code accordingly when they occur.

These issues are time-consuming and scale up with additional fields and websites. To avoid them altogether, you can let Zyte API handle parsing for you.

Create a file at project/spiders/books_toscrape_com_extract.py with the following code:

from scrapy import Spider


class BooksToScrapeComExtractSpider(Spider):
    name = "books_toscrape_com_extract"
    custom_settings = {
        "CONCURRENT_REQUESTS_PER_DOMAIN": 8,
        "DOWNLOAD_DELAY": 0.01,
    }
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        next_page_links = response.css(".next a")
        yield from response.follow_all(next_page_links)
        book_links = response.css("article a")
        for request in response.follow_all(book_links, callback=self.parse_book):
            request.meta["zyte_api_automap"] = {"product": True}
            yield request

    def parse_book(self, response):
        yield response.raw_api_response["product"]

The code above is a modification of your first spider that uses automatic extraction, where:

  • In requests for book URLs, at the end of the parse callback method, you include request metadata to have Zyte API give you structured data for an e-commerce product.

  • The parse_book callback method yields the product data from the Zyte API response.

Now run your new books_toscrape_com_extract spider with -O books.csv as Arguments.

Your code will now extract many more fields from each book, all without you having to write a single line of parsing code.

Note

Zyte API automatic extraction requires you to specify the kind of data you want to extract.

Your spider above uses product to request the data of a single e-commerce product, but automatic extraction supports many other types of data extraction.

For example, if you need to extract a news article or a blog post, use the article data extraction type instead.

Automate crawling#

Your spider above uses automatic parsing for product detail pages, but it still hard-codes the logic to find the next page link and the links to product detail pages. What if you could also automate that code? What if you did not need to write a spider at all?

AI spiders are ready-to-use spiders that rely on Zyte API automatic extraction to both crawl and parse any website of a supported type, e.g. any e-commerce website.

To be able to use these ready-to-use spiders in your Scrapy project:

  1. Install the latest version of zyte-spider-templates:

    pip install --upgrade zyte-spider-templates
    

    Also add the following line to your requirements.txt file, so that your project also continues to work when running on Scrapy Cloud:

    zyte-spider-templates
    
  2. Edit your settings.py file:

    1. Add the following lines at the beginning of the file:

      from itemadapter import ItemAdapter
      from zyte_common_items import ZyteItemAdapter
      
      ItemAdapter.ADAPTER_CLASSES.appendleft(ZyteItemAdapter)
      
    2. Add "zyte_spider_templates.spiders" to your existing SPIDER_MODULES setting:

      SPIDER_MODULES = [
          "project.spiders",
          "zyte_spider_templates.spiders",
      ]
      
    3. Add scrapy_poet.Addon and zyte_spider_templates.Addon to your existing ADDONS setting:

      import scrapy_poet
      import scrapy_zyte_api
      import zyte_spider_templates
      
      ADDONS = {
          scrapy_poet.Addon: 300,
          scrapy_zyte_api.Addon: 500,
          zyte_spider_templates.Addon: 700,
      }
      

Now that zyte-spider-templates is configured in your project, refresh the Spiders view, and run the ecommerce spider with the following Arguments:

-a url="http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
-O books.csv

The specified target URL is all the e-commerce spider needs to get you the same data as before. No custom code needed, now or whenever the target website changes in the future.

This concludes our web scraping tutorial. The tutorial code is available on GitHub. To learn more, check out our web scraping guides (including Zyte AI spiders tutorial, a tutorial focused on AI spiders), our documentation for Zyte API and Scrapy Cloud, and the Scrapy documentation. You can also visit our Support Center or reach out to the wider web scraping and Scrapy communities.