Zyte Data API general usage#

This page contains general information about Zyte Data API usage.

Code examples#

The code examples in Zyte Data API usage documentation feature the following software choices:

Each code example assumes that you have installed and configured the corresponding software. See the official documentation of each software, linked from the list above, for installation and configuration instructions.

Authorization#

Zyte Data API uses HTTP Basic authentication.

Use your API key as user name, and leave the password field empty.

curl --user YOUR_API_KEY: …

Define the ZYTE_API_KEY environment variable with your API key.

Alternatively, pass your API key with the --api-key option:

zyte-api --api-key YOUR_API_KEY …

Use the auth keyword argument:

requests.post(..., auth=('YOUR_API_KEY', ''), ...)

Define the ZYTE_API_KEY environment variable with your API key.

Alternatively, pass your API key with the api_key keyword parameter to your AsyncClient object:

AsyncClient(api_key='YOUR_API_KEY')

Define the ZYTE_API_KEY environment variable with your API key.

Alternatively, set the ZYTE_API_KEY Scrapy setting with your API key:

settings.py#
  ZYTE_API_KEY = 'YOUR_API_KEY'

Getting high throughput#

A single request to Zyte Data API can take tens of seconds to process. The response time depends on the target website and on the task performed (API features used). For example, if you use browserHtml feature, it is common to get a response in 10…30 seconds.

It means that, if requests are sent sequentially, the throughput could be quite low - a few responses per minute.

To speed up the processing (increase the throughput), send many requests in parallel, instead of sending them sequentially.

For example, if the average response time for your website is 15 seconds, and you want to achieve 1 RPS (1 response per second) speed, you should be sending 15 requests in parallel.

input.jsonl#
  {"url": "https://books.toscrape.com/catalogue/page-1.html", "httpResponseBody": true}
  {"url": "https://books.toscrape.com/catalogue/page-2.html", "httpResponseBody": true}
cat input.jsonl \
| xargs -P 15 -d\\n -n 1 \
bash -c "
    curl \
        --user YOUR_API_KEY: \
        --header 'Content-Type: application/json' \
        --data \"\$0\" \
        https://api.zyte.com/v1/extract \
    | awk '{print \$1}' \
    >> output.jsonl
"
input.jsonl#
  {"url": "https://books.toscrape.com/catalogue/page-1.html", "httpResponseBody": true}
  {"url": "https://books.toscrape.com/catalogue/page-2.html", "httpResponseBody": true}
zyte-api --n-conn 15 input.jsonl -o output.jsonl
import asyncio

import aiohttp

urls = [
    'https://books.toscrape.com/catalogue/page-1.html',
    'https://books.toscrape.com/catalogue/page-2.html',
]
output = []


async def extract(client, url):
    response = await client.post(
        'https://api.zyte.com/v1/extract',
        json={'url': url, 'httpResponseBody': True},
        auth=aiohttp.BasicAuth('YOUR_API_KEY'),
    )
    output.append(await response.json())


async def main():
    connector = aiohttp.TCPConnector(limit_per_host=15)
    async with aiohttp.ClientSession(connector=connector) as client:
        await asyncio.gather(*[extract(client, url) for url in urls])

asyncio.run(main())
import asyncio

from zyte_api.aio.client import AsyncClient, create_session

urls = [
    'https://books.toscrape.com/catalogue/page-1.html',
    'https://books.toscrape.com/catalogue/page-2.html',
]
output = []


async def main():
    connection_count = 15
    client = AsyncClient(n_conn=connection_count)
    requests = [{'url': url, 'httpResponseBody': True} for url in urls]
    async with create_session(connection_count) as session:
        responses = client.request_parallel_as_completed(
            requests,
            session=session,
        )
        for response in responses:
            output.append(await response)

asyncio.run(main())
import json
from base64 import b64encode

from scrapy import Request, Spider

urls = [
    'https://books.toscrape.com/catalogue/page-1.html',
    'https://books.toscrape.com/catalogue/page-2.html',
]


class ToScrapeSpider(Spider):
    name = 'toscrape_com'

    custom_settings = {
        'CONCURRENT_REQUESTS': 15,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 15,
    }

    def start_requests(self):
        for url in urls:
            yield Request(
                url,
                meta={
                    'zyte_api': {
                        'httpResponseBody': True,
                    },
                },
            )

    def parse(self, response):
        yield {
            "url": response.url,
            "httpResponseBody": b64encode(response.body).decode(),
        }

Throttling#

Increased concurrency won’t always lead to increased throughput. The reasons for that are described below.

Per-user RPS limit#

By default, there is a 2 RPS limit associated with each API key (no more than 120 responses per minute).

Note

We can increase the limit on a case-by-case basis; please open a support ticket if your use case requires a higher limit.

When you hit the RPS limit, Zyte Data API starts returning HTTP 429 errors instead of accepting more requests. Requests over the limit are rejected with HTTP 429 error; requests which are under the limit are accepted and processed as usual.

You are expected to handle these HTTP 429 errors, retrying them indefinitely with an exponential backoff algorithm (i.e. increasing the time between retries). The official python-zyte-api client does it by default.

It is a good practice to configure you client not to get too many 429 errors, by having proper concurrency options: reduce the amount of connections if you’re getting a lot of 429 errors, or slow down your client in some other way. However, getting a small percent of 429 errors is normal and expected if you want to get close to the limits of your API key.

Per-website RPS limit#

Zyte Data API aims to be polite to websites - it tries hard not to cause any issues or overload. Because of that, if too many requests are sent to a single website, Zyte Data API might start throttling those requests, even if your per-user rate limit is not hit.

When a per-website limit is hit, the API also returns an HTTP 429 error, with the "/limits/over-domain-limit" value in the "type" field. It should be handled in the same way as other HTTP 429 errors: retried with an exponential backoff algorithm (with increased time between retries).

Other limits#

When Zyte Data API is overloaded overall, it may return HTTP 503 errors. Such errors should be retried after a delay. Zyte Data API does this to temporarily slow down some users during load peaks.