Zyte Data API general usage#
This page contains general information about Zyte Data API usage.
Code examples#
The code examples in Zyte Data API usage documentation feature the following software choices:
CLI: common command-line interface applications, including base64, curl, jq, xargs, and xmllint.
CLI (best): command-line interface usage of python-zyte-api, the recommended command-line interface client. xmllint for HTML parsing.
Python: Requests for single requests, aiohttp for concurrent requests, Parsel for HTML parsing.
Python (best): asyncio API usage of python-zyte-api, the recommended Python client. Parsel for HTML parsing.
Scrapy: Scrapy with the scrapy-zyte-api plugin.
Each code example assumes that you have installed and configured the corresponding software. See the official documentation of each software, linked from the list above, for installation and configuration instructions.
Getting high throughput#
A single request to Zyte Data API can take tens of seconds to process.
The response time depends on the target website and on the task performed
(API features used). For example, if you use browserHtml
feature,
it is common to get a response in 10…30 seconds.
It means that, if requests are sent sequentially, the throughput could be quite low - a few responses per minute.
To speed up the processing (increase the throughput), send many requests in parallel, instead of sending them sequentially.
For example, if the average response time for your website is 15 seconds, and you want to achieve 1 RPS (1 response per second) speed, you should be sending 15 requests in parallel.
{"url": "https://books.toscrape.com/catalogue/page-1.html", "httpResponseBody": true}
{"url": "https://books.toscrape.com/catalogue/page-2.html", "httpResponseBody": true}
cat input.jsonl \
| xargs -P 15 -d\\n -n 1 \
bash -c "
curl \
--user YOUR_API_KEY: \
--header 'Content-Type: application/json' \
--data \"\$0\" \
https://api.zyte.com/v1/extract \
| awk '{print \$1}' \
>> output.jsonl
"
{"url": "https://books.toscrape.com/catalogue/page-1.html", "httpResponseBody": true}
{"url": "https://books.toscrape.com/catalogue/page-2.html", "httpResponseBody": true}
zyte-api --n-conn 15 input.jsonl -o output.jsonl
import asyncio
import aiohttp
urls = [
'https://books.toscrape.com/catalogue/page-1.html',
'https://books.toscrape.com/catalogue/page-2.html',
]
output = []
async def extract(client, url):
response = await client.post(
'https://api.zyte.com/v1/extract',
json={'url': url, 'httpResponseBody': True},
auth=aiohttp.BasicAuth('YOUR_API_KEY'),
)
output.append(await response.json())
async def main():
connector = aiohttp.TCPConnector(limit_per_host=15)
async with aiohttp.ClientSession(connector=connector) as client:
await asyncio.gather(*[extract(client, url) for url in urls])
asyncio.run(main())
import asyncio
from zyte_api.aio.client import AsyncClient, create_session
urls = [
'https://books.toscrape.com/catalogue/page-1.html',
'https://books.toscrape.com/catalogue/page-2.html',
]
output = []
async def main():
connection_count = 15
client = AsyncClient(n_conn=connection_count)
requests = [{'url': url, 'httpResponseBody': True} for url in urls]
async with create_session(connection_count) as session:
responses = client.request_parallel_as_completed(
requests,
session=session,
)
for response in responses:
output.append(await response)
asyncio.run(main())
import json
from base64 import b64encode
from scrapy import Request, Spider
urls = [
'https://books.toscrape.com/catalogue/page-1.html',
'https://books.toscrape.com/catalogue/page-2.html',
]
class ToScrapeSpider(Spider):
name = 'toscrape_com'
custom_settings = {
'CONCURRENT_REQUESTS': 15,
'CONCURRENT_REQUESTS_PER_DOMAIN': 15,
}
def start_requests(self):
for url in urls:
yield Request(
url,
meta={
'zyte_api': {
'httpResponseBody': True,
},
},
)
def parse(self, response):
yield {
"url": response.url,
"httpResponseBody": b64encode(response.body).decode(),
}
Throttling#
Increased concurrency won’t always lead to increased throughput. The reasons for that are described below.
Per-user RPS limit#
By default, there is a 2 RPS limit associated with each API key (no more than 120 responses per minute).
Note
We can increase the limit on a case-by-case basis; please open a support ticket if your use case requires a higher limit.
When you hit the RPS limit, Zyte Data API starts returning HTTP 429 errors instead of accepting more requests. Requests over the limit are rejected with HTTP 429 error; requests which are under the limit are accepted and processed as usual.
You are expected to handle these HTTP 429 errors, retrying them indefinitely with an exponential backoff algorithm (i.e. increasing the time between retries). The official python-zyte-api client does it by default.
It is a good practice to configure you client not to get too many 429 errors, by having proper concurrency options: reduce the amount of connections if you’re getting a lot of 429 errors, or slow down your client in some other way. However, getting a small percent of 429 errors is normal and expected if you want to get close to the limits of your API key.
Per-website RPS limit#
Zyte Data API aims to be polite to websites - it tries hard not to cause any issues or overload. Because of that, if too many requests are sent to a single website, Zyte Data API might start throttling those requests, even if your per-user rate limit is not hit.
When a per-website limit is hit, the API also returns an HTTP 429 error,
with the "/limits/over-domain-limit"
value in the "type"
field.
It should be handled in the same way as other HTTP 429 errors: retried
with an exponential backoff algorithm (with increased time between retries).
Other limits#
When Zyte Data API is overloaded overall, it may return HTTP 503 errors. Such errors should be retried after a delay. Zyte Data API does this to temporarily slow down some users during load peaks.