Automate parsing and crawling#

Now that you are familiar with browser automation, it is time to learn about parsing and crawling automation.

Automate parsing#

Your first spider parsed 3 fields from book webpages of books.toscrape.com: name, price, url.

When targeting other websites, there are 2 challenges you are going to face:

  • You will probably want more fields.

    Our schema recommendation for an e-commerce product has more than 25 fields. You need to write parsing logic for every field and for every target website.

  • Websites change, and when they do they can break your parsing code.

    You need to monitor your web scraping project for breaking website changes, and update your parsing code accordingly when they occur.

These issues are time-consuming and scale up with additional fields and websites. To avoid them, let Zyte API handle parsing for you.

Create a file at tutorial/spiders/books_toscrape_com_extract.py with the following code:

from scrapy import Spider


class BooksToScrapeComExtractSpider(Spider):
    name = "books_toscrape_com_extract"
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        next_page_links = response.css(".next a")
        yield from response.follow_all(next_page_links)
        book_links = response.css("article a")
        for request in response.follow_all(book_links, callback=self.parse_book):
            request.meta["zyte_api_automap"] = {"product": True}
            yield request

    def parse_book(self, response):
        yield response.raw_api_response["product"]

The code above is a modification of your first spider that uses automatic extraction, where:

  • In requests for book URLs, at the end of the parse callback method, you include request metadata to have Zyte API give you structured data for an e-commerce product.

  • The parse_book callback method yields the product data from the Zyte API response.

Now run your code:

scrapy crawl books_toscrape_com_extract -O books.csv

Your code will now extract many more fields from each book, all without you having to write a single line of parsing code.

Note

Zyte API automatic extraction requires you to specify the kind of data you want to extract.

Your spider above uses product to request the data of a single e-commerce product, but automatic extraction supports many other types of data extraction, and more will come following user demand.

For example, if you need to extract a news article or a blog post, use the article data extraction type instead.

Automate crawling#

Your spider above uses automatic parsing for product detail pages, but it still hard-codes the logic to find the next page link and the links to product detail pages. What if you could also automate that code? What if you did not need to write a spider at all?

AI spiders are ready-to-use spiders that rely on Zyte API automatic extraction to both crawl and parse any website of a supported type, e.g. any e-commerce website.

To be able to use these ready-to-use spiders in your Scrapy project:

  1. Install the latest version of zyte-spider-templates:

    pip install --upgrade zyte-spider-templates
    

    Also add the following line to your requirements.txt file, so that your project also continues to work when running on Scrapy Cloud:

    zyte-spider-templates
    
  2. Edit your settings.py file:

    1. Add the following lines at the beginning of the file:

      from itemadapter import ItemAdapter
      from zyte_common_items import ZyteItemAdapter
      
      ItemAdapter.ADAPTER_CLASSES.appendleft(ZyteItemAdapter)
      
    2. Add "zyte_spider_templates.spiders" to your existing SPIDER_MODULES setting:

      SPIDER_MODULES = [
          "tutorial.spiders",
          "zyte_spider_templates.spiders",
      ]
      
    3. Add "scrapy_poet.InjectionMiddleware": 543 to your existing DOWNLOADER_MIDDLEWARES setting:

      DOWNLOADER_MIDDLEWARES = {
          "scrapy_poet.InjectionMiddleware": 543,
      }
      
    4. Define the following additional settings:

      SPIDER_MIDDLEWARES = {
          "scrapy_poet.RetryMiddleware": 275,
          "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
          "zyte_spider_templates.middlewares.AllowOffsiteMiddleware": 500,
          "zyte_spider_templates.middlewares.CrawlingLogsMiddleware": 1000,
      }
      SCRAPY_POET_DISCOVER = [
          "zyte_spider_templates.pages",
      ]
      CLOSESPIDER_TIMEOUT_NO_ITEM = 600
      SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue"
      SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue"
      

Now that zyte-spider-templates is configured in your project, you can use its e-commerce spider:

scrapy crawl ecommerce -a url="http://books.toscrape.com/catalogue/category/books/mystery_3/index.html" -O books.csv

The specified target URL is all the e-commerce spider need to get you the same data as before. No custom code needed, now or whenever the target website changes in the future.

This concludes our web scraping tutorial. The tutorial code is available on GitHub. To learn more, check out our web scraping guides, our documentation for Zyte API and Scrapy Cloud, and the Scrapy documentation. You can also visit our Support Center or reach out to the wider web scraping and Scrapy communities.