API Usage

Overview

This document contains general information about Zyte Data HTTP API. You can find the full API specification here.

Zyte Data API is an HTTP API that uses the JSON format for requests and responses. Client should send a POST request to one of the API’s endpoints, with JSON-encoded request body, and get JSON-encoded data in the response.

Authorisation

Zyte Data API uses HTTP Basic Auth. Use the API key you got after signing up as an user name; leave the password empty.

For example, in curl the standard way to set up Basic Auth is to pass the --user user:pass argument. In case of Zyte Data API password should be empty, so you should be passing --user API_KEY::

curl \
   --user YOUR_API_KEY: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://example.com/foo/bar", "browserHtml": true}' \
   https://api.zyte.com/v1/extract

Getting high throughput from the Zyte API

A single request to Zyte Data API can take tens of seconds to process. Response time depends on the target website and on the task performed (API features used). For example, if you use browserHtml feature, it is common to get a response in 10…30 seconds.

It means that if requests are sent sequentially, the throughput could be quite low - a few responses per minute.

To speed up the processing (increase the throughput), send many requests in parallel, instead of sending them sequentially.

Note

Example: if the average response time for your website is 15 seconds, and you want to achieve 1 RPS (1 response per second) speed, you should be sending 15 requests in parallel.

An easy way to do it is to use the CLI interface provided by the python-zyte-api client library. An example of using 25 connections:

python -m zyte_api --n-conn 25 urls.txt --output res.jl

python-zyte-api also provides an asyncio-based implementation which you can use from Python.

If you’re writing you own client, you can use threads, multiple processes or an event loop to send many HTTP requests in parallel.

Throttling

Increased concurrency won’t always lead to increased throughput. The reasons for that are described below.

Per-user RPS limit

By default, there is a 2RPS limit associated with each API key (no more than 120 responses per minute).

Note

We can increase the limit on a case-by-case basis; please open a support ticket if your use case requires a higher limit.

When the client hits the RPS limit, Zyte Data API starts returning HTTP 429 errors instead of accepting more requests. Requests over the limit are rejected with HTTP 429 error; requests which are under the limit are accepted and processed as usual.

Client is expected to handle these HTTP 429 errors, retrying them indefinitely, with the exponential back-off (i.e. increasing the time between retries). Official python-zyte-api client does it by default.

It is a good practice to configure you client not to get too many 429 errors, by having proper concurrency options: reduce the amount of connections if you’re getting a lot of 429 errors, or slow down your client in some other way. However, getting a small percent of 429 errors is normal and expected, if you want to get close to the limits of your API key.

Per-website RPS limit

Zyte Data API aims to be polite to the websites - it tries hard not to cause any issues or overload. It means that if too many requests are sent to a single website, Zyte Data API might start throttling, even if per-user rate limit is not hit.

When a per-website limit is hit, the API also returns HTTP 429 error, with “/limits/over-domain-limit” value in the “type” field. It should be handled in the same way as other HTTP 429 errors: retried with an exponential back-off (with increased time between retrials).

Other limits

When Zyte Data API is overloaded overall, it may be returning HTTP 503 errors. Such errors should be retried after a delay. Zyte Data API does this to temporary slow down some of the clients during the peak load.