Zyte API tutorial#

In this tutorial you will learn to use some of the main features of Zyte API.

You will start using the Zyte dashboard to send requests with a few clicks, and by the end you will have a production-ready web data extraction project.

Before you start to follow this tutorial, you need to sign up for Zyte API. You get $5 credit for free, and you should only need a fraction of that to complete this tutorial.

Use the Zyte dashboard#

The easiest way to send your first Zyte API requests is to use the Zyte dashboard:

  1. Log into the Zyte dashboard and, under Tools → Zyte API, select Run request.

  2. Enter toscrape.com under Test URL with the UI → Enter A URL, and select Request.

    ../../../_images/run-request-toscrape.png

    A Zyte API extract request is sent for http://toscrape.com/, and when it finishes the Request Summary page opens.

  3. Select the Response tab to see the request outputs.

    By default it shows the webpage screenshot. Select HTML to show the browser HTML:

You have sent your first Zyte API request, and seen some of the browser features of Zyte API, specifically the ability to get screenshot and browser HTML outputs from a webpage.

Get a response body#

Browser-based outputs have their benefits. For example, browser HTML can sometimes save you significant development time, and a screenshot can be helpful for quality assurance checks.

However, the response body output is usually faster, more cost-efficient, and it supports non-HTML content, which often makes it a better choice than browser HTML.

Now, you will send a new Zyte API request, also from the Zyte dashboard, but this time you will request a response body instead of a screenshot and browser HTML:

  1. Select Copy Request.

  2. Under Request configuration, disable Browser Rendering & Screenshot.

    ../../../_images/run-request-body.png
  3. Select Request.

    Your new request should be quicker. When it finishes, the Response tab will show the response body:

Notice how the browser HTML output you got in your first request and the response body that you got now are very similar. The differences are mostly about whitespace. And it is important to understand why that is:

  • The response body output is the body of the HTTP response to an HTTP request for the target URL.

    To see the HTTP response body in a web browser, select View page source in the webpage context menu or press Ctrl+U.

  • The browser HTML output is the HTML representation of the webpage data as seen in a web browser, including normalization of whitespace and HTML tags, and including changes made using JavaScript, such as loading data from an API on start or as a response to user input.

    To see the browser HTML in a web browser, select Inspect in the webpage context menu or press F12.

Because http://toscrape.com/ does not load any content through JavaScript, the HTML code you get is basically the same in both outputs, making response body the better choice for this specific webpage.

Use curl#

So far you have been using Zyte API through the Zyte dashboard. However, Zyte API is an HTTP API, so you can use it with any HTTP client.

Now, you will repeat your last request using curl, a popular command-line HTTP client which comes pre-installed in most systems:

  1. Copy your Zyte API key from the Zyte dashboard.

  2. Open a terminal window.

  3. Enter the following command, replacing YOUR_API_KEY with your actual API key, and making sure you keep the colon (:) after your API key:

    curl \
        --user YOUR_API_KEY: \
        --header 'Content-Type: application/json' \
        --data '{"url": "http://toscrape.com/", "httpResponseBody": true}' \
        https://api.zyte.com/v1/extract
    

    After you run that command, you will get a JSON output like:

    {"url":"http://toscrape.com/","statusCode":200,"httpResponseBody":"PCFET0[…]RtbD4K"}
    

    The value of the httpResponseBody key, PCFET0…RtbD4K, is the base64-encoded response body. When decoded, you get <!DOCTYPE html>…</html>.

    Note

    The reason why httpResponseBody is base64-encoded is because it can contain binary data, such as an image or a PDF file, which JSON does not support. Base64 is a common way to represent binary content as text in JSON.

Instead of using the Zyte dashboard, you have now used Zyte API with an HTTP client, using your API key and sending parameters in JSON format.

However, while the Zyte dashboard and curl can be good debugging tools, they are not good choices for a web data extraction project. For a web data extraction project, you usually want to use a web data extraction framework.

Start a Scrapy project#

You will now create a web data extraction project using Scrapy, a popular open source web scraping framework written in Python and maintained by Zyte.

Tip

The rest of this tutorial can be easier to follow if you first learn a bit of Python and Scrapy.

Follow these steps to start a Scrapy project and configure it to use Zyte API:

  1. Install Python, version 3.7 or better.

    Tip

    You can run python --version on a terminal window to make sure that you have a good-enough version of Python.

  2. Open a terminal window.

  3. Create a zyte-api-tutorial folder and make it your working folder:

    mkdir zyte-api-tutorial
    cd zyte-api-tutorial
    
  4. Create and activate a Python virtual environment.

    • On Windows:

      python3 -m venv tutorial-env
      tutorial-env\Scripts\activate.bat
      
    • On macOS and Linux:

      python3 -m venv tutorial-env
      . tutorial-env/bin/activate
      
  5. Install the latest version of the Python packages that you will use during this tutorial:

    pip install --upgrade scrapy scrapy-zyte-api
    
  6. Make zyte-api-tutorial a Scrapy project folder:

    scrapy startproject tutorial .
    

    Your zyte-api-tutorial folder should now contain the following folders and files:

    zyte-api-tutorial/
    ├── scrapy.cfg
    └── tutorial/
        ├── __init__.py
        ├── items.py
        ├── middlewares.py
        ├── pipelines.py
        ├── settings.py
        └── spiders/
            └── __init__.py
    
  7. Configure scrapy-zyte-api in transparent mode by adding the following code at the end of tutorial/settings.py, replacing YOUR_API_KEY with your actual Zyte API key:

    DOWNLOAD_HANDLERS = {
        "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
        "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    }
    DOWNLOADER_MIDDLEWARES = {
        "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
    }
    REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
    ZYTE_API_KEY = "YOUR_API_KEY"
    ZYTE_API_TRANSPARENT_MODE = True
    

    Note

    The scrapy startproject command should have already set the TWISTED_REACTOR setting to the required value.

Extract mystery books#

Now that you are all set up, you will write code to extract data from all books in the Mystery category of books.toscrape.com.

Create a file at tutorial/spiders/books_toscrape_com.py with the following code:

from scrapy import Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        next_page_links = response.css(".next a")
        yield from response.follow_all(next_page_links)
        book_links = response.css("article a")
        yield from response.follow_all(book_links, callback=self.parse_book)

    def parse_book(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css(".price_color::text").re_first("£(.*)"),
            "url": response.url,
        }

In the code above:

  • You define a Scrapy spider class named books_toscrape_com.

  • Your spider starts by sending a request for the Mystery category URL, http://books.toscrape.com/catalogue/category/books/mystery_3/index.html, (start_urls), and parses the response with the default callback method: parse.

  • The parse callback method:

    • Finds the link to the next page and, if found, yields a request for it, whose response will also be parsed by the parse callback method.

      As a result, the parse callback method eventually parses all pages of the Mystery category.

    • Finds links to book detail pages, and yields requests for them, whose responses will be parsed by the parse_book callback method.

      As a result, the parse_book callback method eventually parses all book detail pages from the Mystery category.

  • The parse_book callback method extracts a record of book information with the book name, price, and URL.

Now run your code:

scrapy crawl books_toscrape_com -O books.csv

Once execution finishes, the generated books.csv file will contain records for all books from the Mystery category of books.toscrape.com in CSV format. You can open books.csv with any spreadsheet app.

The code above is valid, regular Scrapy code, as you will now see. Open tutorial/settings.py, and change the value of ZYTE_API_TRANSPARENT_MODE to False. Then run your code again. Your code will work the same, only that requests will be sent directly from your computer to books.toscrape.com, instead of going through Zyte API.

For books.toscrape.com specifically, that is not a problem. Other websites, however, may slow you down, send you bad responses, or ban your requests. Zyte API protects you from that cost-efficiently.

Now, make sure you set ZYTE_API_TRANSPARENT_MODE back to True, and continue the tutorial.

Extract quotes through API requests#

Your next target will be http://quotes.toscrape.com/scroll, from which you will extract 100 quotes.

However, the HTML code of that page contains no quote at all. All 100 quotes are loaded dynamically, sending API requests from JavaScript code. To extract all 100 quotes, you will need to reproduce those API requests.

Create a file at tutorial/spiders/quotes_toscrape_com_scroll_api.py with the following code:

import json
from scrapy import Spider


class QuotesToScrapeComScrollAPISpider(Spider):
    name = "quotes_toscrape_com_scroll_api"
    start_urls = [
        f"http://quotes.toscrape.com/api/quotes?page={n}" for n in range(1, 11)
    ]

    def parse(self, response):
        data = json.loads(response.text)
        for quote in data["quotes"]:
            yield {
                "author": quote["author"]["name"],
                "tags": quote["tags"],
                "text": quote["text"],
            }

The code above sends 10 requests to the API of quotes.toscrape.com, reproducing what JavaScript code at http://quotes.toscrape.com/scroll does, and then parses the JSON response to extract the desired data.

Now run your code:

scrapy crawl quotes_toscrape_com_scroll_api -O quotes.csv

After all 10 requests are processed, all 100 quotes can be found at quotes.csv.

When the information that you want to extract is not readily available in the response HTML, but loaded from JavaScript, reproducing the JavaScript code manually, like you did above sending those 10 requests, is one option. Next you will try an alternative approach.

Extract quotes with a browser-based request#

In addition to raw HTTP responses, Zyte API can also provide browser-rendered HTML responses and screenshots. These browser-rendered outputs enable you to use browser actions.

You will now ask Zyte API for browser HTML instead of raw HTML, and will use a browser action to get all 100 quotes with a single Zyte API request.

Create a file at tutorial/spiders/quotes_toscrape_com_scroll_browser.py with the following code:

from scrapy import Request, Spider


class QuotesToScrapeComScrollBrowserSpider(Spider):
    name = "quotes_toscrape_com_scroll_browser"

    def start_requests(self):
        yield Request(
            "http://quotes.toscrape.com/scroll",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "scrollBottom",
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "author": quote.css(".author::text").get(),
                "tags": quote.css(".tag::text").getall(),
                "text": quote.css(".text::text").get()[1:-1],
            }

The code above sends a single request to http://quotes.toscrape.com/scroll, but this request includes some metadata. That is why the start_requests method is used instead of start_urls, since the latter does not allow defining request metadata.

The specified metadata indicates to Zyte API that you want the URL to be loaded in a web browser, that you want to execute the scrollBottom action, and that you want the HTML rendering of the webpage DOM after that. The scrollBottom action keeps scrolling to the bottom of a webpage until that webpage stops loading additional content.

Now run your code:

scrapy crawl quotes_toscrape_com_scroll_browser -O quotes.csv

quotes.csv will have the same data as before, only that now it has been generated with a completely different approach.

Which option is best, reproducing JavaScript code manually or using browser HTML and actions, depends on each scenario. To choose one option or the other you need to factor in development time, run time, request count, request cost, etc.

Interact with a webpage#

Sometimes, it can be really hard to reproduce JavaScript code manually, or the resulting code can break too easily, making the option of browser HTML and actions a clear winner.

You will now extract a quote from http://quotes.toscrape.com/search.aspx by interacting with the search form through actions.

Create a file at tutorial/spiders/quotes_toscrape_com_search.py with the following code:

from scrapy import Request, Spider


class QuotesToScrapeComSearchSpider(Spider):
    name = "quotes_toscrape_com_search"

    def start_requests(self):
        yield Request(
            "http://quotes.toscrape.com/search.aspx",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "select",
                            "selector": {"type": "css", "value": "#author"},
                            "values": ["Albert Einstein"],
                        },
                        {
                            "action": "waitForSelector",
                            "selector": {
                                "type": "css",
                                "value": "[value=\"world\"]",
                                "state": "attached",
                            },
                        },
                        {
                            "action": "select",
                            "selector": {"type": "css", "value": "#tag"},
                            "values": ["world"],
                        },
                        {
                            "action": "click",
                            "selector": {"type": "css", "value": "[type='submit']"},
                        },
                        {
                            "action": "waitForSelector",
                            "selector": {"type": "css", "value": ".quote"},
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "author": quote.css(".author::text").get(),
                "tags": quote.css(".tag::text").getall(),
                "text": quote.css(".content::text").get()[1:-1],
            }

The code above sends a request that makes Zyte API load http://quotes.toscrape.com/search.aspx and perform the following actions:

  1. Select Albert Einstein as author.

  2. Wait for the “world” tag to load.

  3. Select the “world” tag.

  4. Click the Search button.

  5. Wait for a quote to load.

From the HTML rendering of the DOM after those actions are executed, your code extracts all displayed quotes.

Now run your code:

scrapy crawl quotes_toscrape_com_search -O quotes.csv

quotes.csv will have 1 quote from Albert Einstein about the world.

If you were to try and write alternative code that, instead of relying on the browser HTML feature from Zyte API, reproduces the underlying JavaScript code with regular requests, it may take you a while to build a working solution, and your solution may be more fragile, i.e. more likely to break with server code changes.

Next steps#

Now that you are familiar with the main aspects of Zyte API, see Zyte API usage for more in-depth documentation of Zyte API features, and Zyte API HTTP API for complete reference documentation.

To learn how to run your code in the cloud, see Using Zyte API from Scrapy Cloud, which is a direct continuation of this tutorial.

If you have existing code that you wish to migrate to Zyte API, see Migrating to Zyte API.