Zyte Data API data extraction#

Sending a POST request to the https://api.zyte.com/v1/extract endpoint of Zyte Data API with a JSON object as body, with the url key set to a target URL, allows you to extract data from that URL.

Zyte Data API can extract the following data from a URL:

Note

At the moment you cannot extract both browser HTML and a response body from the same Zyte Data API request.

You can also customize the country of origin used during data extraction.

Extract browser HTML#

Zyte Data API can render a URL in a web browser and return an HTML representation of its Document Object Model (DOM).

Extracting browser HTML allows you to:

  • Extract dynamically-loaded webpage content without spending time recreating what the browser does through JavaScript and additional requests.

  • Emulate user interaction through browser actions.

To extract browser HTML, set the browserHtml key in your API request body to true.

The browserHtml key of the response JSON object is the browser HTML as a string.

input.json#
  {"url": "https://toscrape.com", "browserHtml": true}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .browserHtml
input.jsonl#
  {"url": "https://toscrape.com", "browserHtml": true}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .browserHtml
import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://toscrape.com',
        'browserHtml': True,
    },
)
browser_html: str = api_response.json()['browserHtml']
import asyncio

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://toscrape.com',
            'browserHtml': True,
        }
    )
    browser_html: str = api_response['browserHtml']

asyncio.run(main())
from scrapy import Request, Spider


class ToScrapeSpider(Spider):
    name = 'toscrape_com'

    def start_requests(self):
        yield Request(
            'https://toscrape.com',
            meta={
                'zyte_api': {
                    'browserHtml': True,
                },
            },
        )

    def parse(self, response):
        browser_html: str = response.text

If you need non-HTML content, non-GET requests, or request headers beyond those supported for browser HTML extraction, extract a response body instead.

Set browser actions#

When extracting browser HTML, you can use the actions key in your API request body to define a sequence of actions to perform during browser rendering, and hence modify the DOM before browser HTML is generated for you.

Browser actions allow you to:

  • Type text into input fields.

  • Perform cursor actions (click, hover, scroll).

  • Wait for certain events or for a given time.

input.json#
  {
      "url": "https://quotes.toscrape.com/scroll",
      "browserHtml": true,
      "actions": [
          {
              "action": "scrollBottom"
          }
      ]
  }
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .browserHtml \
| xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null
input.jsonl#
  {"url":"https://quotes.toscrape.com/scroll","browserHtml":true,"actions":[{"action":"scrollBottom"}]}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .browserHtml \
| xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null
import json

import requests
from parsel import Selector

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://quotes.toscrape.com/scroll',
        'browserHtml': True,
        'actions': [
            {
                'action': 'scrollBottom',
            },
        ],
    },
)
browser_html = api_response.json()['browserHtml']
quote_count = len(Selector(browser_html).css('.quote'))
import asyncio
import json

from parsel import Selector
from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://quotes.toscrape.com/scroll',
            'browserHtml': True,
            'actions': [
                {
                    'action': 'scrollBottom',
                },
            ],
        },
    )
    browser_html = api_response['browserHtml']
    quote_count = len(Selector(browser_html).css('.quote'))

asyncio.run(main())
import json

from scrapy import Request, Spider


class QuotesToScrapeComSpider(Spider):
    name = 'quotes_toscrape_com'

    def start_requests(self):
        yield Request(
            'https://quotes.toscrape.com/scroll',
            meta={
                'zyte_api': {
                    'browserHtml': True,
                    'actions': [
                        {
                            'action': 'scrollBottom',
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        quote_count = len(response.css('.quote'))

Look up actions in the specification for the complete actions API.

Set request headers#

When extracting browser HTML, you can set the requestHeaders key in your API request body to an object where keys are camelCase header names and values are header values, representing headers to include in your request.

input.json#
  {
      "url": "https://httpbin.org/anything",
      "browserHtml": true,
      "requestHeaders": {
          "referer": "https://example.org/"
      }
  }
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .browserHtml \
| xmllint --html --xpath '//text()' - 2> /dev/null \
| jq .headers
input.jsonl#
  {"url": "https://httpbin.org/anything", "browserHtml": true, "requestHeaders": {"referer": "https://example.org/"}}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .browserHtml \
| xmllint --html --xpath '//text()' - 2> /dev/null \
| jq .headers
import json

import requests
from parsel import Selector

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://httpbin.org/anything',
        'browserHtml': True,
        'requestHeaders': {
            'referer': 'https://example.org/',
        },
    },
)
browser_html = api_response.json()['browserHtml']
selector = Selector(browser_html)
response_json = selector.xpath('//text()').get()
response_data = json.loads(response_json)
headers = response_data['headers']
import asyncio
import json

from parsel import Selector
from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://httpbin.org/anything',
            'browserHtml': True,
            'requestHeaders': {
                'referer': 'https://example.org/',
            },
        }
    )
    browser_html = api_response['browserHtml']
    selector = Selector(browser_html)
    response_json = selector.xpath('//text()').get()
    response_data = json.loads(response_json)
    headers = response_data['headers']

asyncio.run(main())
import json

from scrapy import Request, Spider


class HTTPBinOrgSpider(Spider):
    name = 'httpbin_org'

    def start_requests(self):
        yield Request(
            'https://httpbin.org/anything',
            meta={
                'zyte_api': {
                    'browserHtml': True,
                    'requestHeaders': {
                        'referer': 'https://example.org/',
                    },
                },
            },
        )

    def parse(self, response):
        response_json = response.xpath('//text()').get()
        response_data = json.loads(response_json)
        headers = response_data['headers']

At the moment, only the Referer header can be overridden.

Enable or disable JavaScript#

When extracting browser HTML, JavaScript execution is enabled by default for most websites.

For some websites, however, JavaScript execution is disabled by default because it helps data extraction.

You can set the javascript key in your API request body to true or false to force enabling or disabling JavaScript execution, regardless of the default value for the target website.

input.json#
 {
     "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled",
     "browserHtml": true,
     "javascript": false
 }
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .browserHtml \
| xmllint --html --xpath '//*[@id="detected_value"]/text()' - 2> /dev/null
input.jsonl#
 {"url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": true, "javascript": false}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .browserHtml \
| xmllint --html --xpath '//*[@id="detected_value"]/text()' - 2> /dev/null
import requests
from parsel import Selector

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://www.whatismybrowser.com/detect/is-javascript-enabled',
        'browserHtml': True,
        'javascript': False,
    },
)
browser_html = api_response.json()['browserHtml']
selector = Selector(browser_html)
is_javascript_enabled: str = selector.css('#detected_value::text').get()
import asyncio

from parsel import Selector
from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://www.whatismybrowser.com/detect/is-javascript-enabled',
            'browserHtml': True,
            'javascript': False,
        }
    )
    browser_html = api_response['browserHtml']
    selector = Selector(browser_html)
    is_javascript_enabled: str = selector.css('#detected_value::text').get()

asyncio.run(main())
from scrapy import Request, Spider


class HTTPBinOrgSpider(Spider):
    name = 'httpbin_org'

    def start_requests(self):
        yield Request(
            'https://www.whatismybrowser.com/detect/is-javascript-enabled',
            meta={
                'zyte_api': {
                    'browserHtml': True,
                    'javascript': False,
                },
            },
        )

    def parse(self, response):
        is_javascript_enabled: str = response.css('#detected_value::text').get()

Extract a response body#

Extracting a response body allows you to:

  • Get faster response times.

  • Reduce costs.

  • Download non-HTML content (e.g. JSON, XML), including binary content (e.g. images).

  • Set a request method, body, and arbitrary headers.

To extract a response body, set the httpResponseBody key in your API request body to true.

The httpResponseBody key of the response JSON object is the Base64-encoded response body.

input.json#
  {"url": "https://toscrape.com", "httpResponseBody": true}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .httpResponseBody \
| base64 --decode \
> output.unknown
input.jsonl#
  {"url": "https://toscrape.com", "httpResponseBody": true}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .httpResponseBody \
| base64 --decode \
> output.unknown
from base64 import b64decode

import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://toscrape.com',
        'httpResponseBody': True,
    },
)
http_response_body: bytes = b64decode(
    api_response.json()['httpResponseBody']
)
import asyncio
from base64 import b64decode

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://toscrape.com',
            'httpResponseBody': True,
        }
    )
    http_response_body: bytes = b64decode(
        api_response['httpResponseBody']
    )

asyncio.run(main())
from scrapy import Request, Spider


class ToScrapeSpider(Spider):
    name = 'toscrape_com'

    def start_requests(self):
        yield Request(
            'https://toscrape.com',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                },
            },
        )

    def parse(self, response):
        http_response_body: bytes = response.body

If your response body is HTML, see Decode HTML.

Set a request method#

Response body extraction uses a GET request by default.

Use the httpRequestMethod key in your API request body to switch the request method to a different value: POST, PUT, DELETE, OPTIONS, TRACE, PATCH.

input.json#
  {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST"}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq .method
input.jsonl#
  {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST"}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq .method
import json
from base64 import b64decode

import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://httpbin.org/anything',
        'httpResponseBody': True,
        'httpRequestMethod': 'POST',
    },
)
http_response_body = b64decode(
    api_response.json()['httpResponseBody']
)
method = json.loads(http_response_body)['method']
import asyncio
import json
from base64 import b64decode

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://httpbin.org/anything',
            'httpResponseBody': True,
            'httpRequestMethod': 'POST',
        }
    )
    http_response_body: bytes = b64decode(
        api_response['httpResponseBody']
    )
    method = json.loads(http_response_body)['method']

asyncio.run(main())
import json

from scrapy import Request, Spider


class HTTPBinOrgSpider(Spider):
    name = 'httpbin_org'

    def start_requests(self):
        yield Request(
            'https://httpbin.org/anything',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                    'httpRequestMethod': 'POST',
                },
            },
        )

    def parse(self, response):
        method = json.loads(response.body)['method']

The HEAD and CONNECT request methods are not supported.

Set a request body#

If you use a different request method, you may also need to set a request body.

Use the httpRequestBody key in the body of your API request to set a body for the extraction request. The value must be a Base64-encoded representation of the body bytes.

input.json#
  {
      "url": "https://httpbin.org/anything",
      "httpResponseBody": true,
      "httpRequestMethod": "POST",
      "httpRequestBody": "Zm9vCg=="
  }
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq --raw-output .data
input.jsonl#
  {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestBody": "Zm9vCg=="}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq --raw-output .data
import json
from base64 import b64decode, b64encode

import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://httpbin.org/anything',
        'httpResponseBody': True,
        'httpRequestMethod': 'POST',
        'httpRequestBody': b64encode(b'foo').decode(),
    },
)
http_response_body = b64decode(
    api_response.json()['httpResponseBody']
)
request_body: str = json.loads(http_response_body)['data']
import asyncio
import json
from base64 import b64decode, b64encode

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://httpbin.org/anything',
            'httpResponseBody': True,
            'httpRequestMethod': 'POST',
            'httpRequestBody': b64encode(b'foo').decode(),
        }
    )
    http_response_body: bytes = b64decode(
        api_response['httpResponseBody']
    )
    request_body: str = json.loads(http_response_body)['data']

asyncio.run(main())
import json
from base64 import b64encode

from scrapy import Request, Spider


class HTTPBinOrgSpider(Spider):
    name = 'httpbin_org'

    def start_requests(self):
        yield Request(
            'https://httpbin.org/anything',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                    'httpRequestMethod': 'POST',
                    'httpRequestBody': b64encode(b'foo').decode(),
                },
            },
            dont_filter=True,
        )

    def parse(self, response):
        request_body = json.loads(response.body)['data']

Set request headers#

When extracting a response body, you can set the customHttpRequestHeaders key in your API request body to an array of objects with name and value keys representing headers to include in your request.

input.json#
  {
      "url": "https://httpbin.org/anything",
      "httpResponseBody": true,
      "customHttpRequestHeaders": [
          {
              "name": "Accept-Language",
              "value": "fa"
          }
      ]
  }
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq .headers
input.jsonl#
  {"url": "https://httpbin.org/anything", "httpResponseBody": true, "customHttpRequestHeaders": [{"name": "Accept-Language", "value": "fa"}]}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq .headers
import json
from base64 import b64decode

import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://httpbin.org/anything',
        'httpResponseBody': True,
        'customHttpRequestHeaders': [
            {
                'name': 'Accept-Language',
                'value': 'fa',
            },
        ],
    },
)
http_response_body = b64decode(
    api_response.json()['httpResponseBody']
)
headers = json.loads(http_response_body)['headers']
import asyncio
import json
from base64 import b64decode

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://httpbin.org/anything',
            'httpResponseBody': True,
            'customHttpRequestHeaders': [
                {
                    'name': 'Accept-Language',
                    'value': 'fa',
                },
            ],
        }
    )
    http_response_body: bytes = b64decode(
        api_response['httpResponseBody']
    )
    headers = json.loads(http_response_body)['headers']

asyncio.run(main())
import json

from scrapy import Request, Spider


class HTTPBinOrgSpider(Spider):
    name = 'httpbin_org'

    def start_requests(self):
        yield Request(
            'https://httpbin.org/anything',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                    'customHttpRequestHeaders': [
                        {
                            'name': 'Accept-Language',
                            'value': 'fa',
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        headers = json.loads(response.body)['headers']

Zyte Data API sends some headers automatically. In case of conflict, your custom headers will usually override Zyte Data API headers. However, Zyte Data API may silently override or drop some of your custom headers to reduce the chance of your request being banned. For example, you can never set custom Cookie or User-Agent headers.

If you set multiple headers with the same name, only the last header value will be sent. To overcome this limitation, join the header values with a comma into a single header value. For example, replace "customHttpRequestHeaders": [{"name": "foo", "value": "bar"}, {"name": "foo", "value": "baz"}] with "customHttpRequestHeaders": [{"name": "foo", "value": "bar,baz"}].

Decode HTML#

While browser HTML is provided pre-decoded, as a string, HTML extracted as a response body needs to be decoded.

HTML content can be encoded with one of many character encodings, and you must determine the character encoding used so that you can decode that HTML content accordingly.

The best way to determine the encoding of HTML content is to follow the encoding sniffing algorithm defined in the HTML standard.

In addition to the HTML content, the HTML encoding sniffing algorithm takes into account any character encoding provided in the optional charset parameter of media types declared in the Content-Type response header, so make sure you get the response headers in addition to the response body if you are following the HTML encoding sniffing algorithm.

Use file to find the media type of a previously-downloaded response based solely on its body (i.e. not following the HTML encoding sniffing algorithm).

file --mime-encoding output.unknown

web-poet provides a response wrapper that automatically decodes the response body following an encoding sniffing algorithm similar to the one defined in the HTML standard.

Provided that you have extracted a response with both body and headers, and you have Base64-decoded the response body, you can decode the HTML bytes as follows:

from web_poet import HttpResponse

...

headers = tuple(
    (item['name'], item['value'])
    for item in http_response_headers
)
response = HttpResponse(
    url='https://example.com',
    body=http_response_body,
    status=200,
    headers=headers,
)
html = response.text

If you extract response headers, HTML responses are decoded automatically.

from scrapy import Request, Spider


class ToScrapeSpider(Spider):
    name = 'toscrape_com'

    def start_requests(self):
        yield Request(
            'https://toscrape.com',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                    'httpResponseHeaders': True,
                },
            },
        )

    def parse(self, response):
        html: str = response.text

Extract response headers#

When extracting browser HTML or a response body, set the httpResponseHeaders key in your API request body to true to also extract response headers.

When you do, the Zyte Data API response includes an httpResponseHeaders key with the headers as an array of objects with name and value keys.

input.json#
  {"url": "https://toscrape.com", "browserHtml": true, "httpResponseHeaders": true}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq .httpResponseHeaders
input.jsonl#
  {"url": "https://toscrape.com", "browserHtml": true, "httpResponseHeaders": true}
zyte-api input.jsonl 2> /dev/null \
| jq .httpResponseHeaders
import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'https://toscrape.com',
        'browserHtml': True,
        'httpResponseHeaders': True,
    },
)
http_response_headers = api_response.json()['httpResponseHeaders']
import asyncio

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'https://toscrape.com',
            'browserHtml': True,
            'httpResponseHeaders': True,
        }
    )
    http_response_headers = api_response['httpResponseHeaders']

asyncio.run(main())
from scrapy import Request, Spider


class ToScrapeComSpider(Spider):
    name = 'toscrape_com'

    def start_requests(self):
        yield Request(
            'https://toscrape.com',
            meta={
                'zyte_api': {
                    'browserHtml': True,
                    'httpResponseHeaders': True,
                },
            },
        )

    def parse(self, response):
        headers = response.headers

Zyte Data API may exclude some headers from the result, such as Set-Cookie.

Set a country of origin#

Set the geolocation key in your API request body to a supported ISO 3166-1 alpha-2 country code to channel your request through an IP address associated with the corresponding country.

input.json#
  {"url": "http://ip-api.com/json", "httpResponseBody": true, "geolocation": "AU"}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    https://api.zyte.com/v1/extract \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq .countryCode
input.jsonl#
  {"url": "http://ip-api.com/json", "httpResponseBody": true, "geolocation": "AU"}
zyte-api input.jsonl 2> /dev/null \
| jq --raw-output .httpResponseBody \
| base64 --decode \
| jq .countryCode
import json
from base64 import b64decode

import requests

api_response = requests.post(
    'https://api.zyte.com/v1/extract',
    auth=('YOUR_API_KEY', ''),
    json={
        'url': 'http://ip-api.com/json',
        'httpResponseBody': True,
        'geolocation': 'AU',
    },
)
http_response_body = b64decode(
    api_response.json()['httpResponseBody']
)
country_code = json.loads(http_response_body)['countryCode']
import asyncio
import json
from base64 import b64decode

from zyte_api.aio.client import AsyncClient

async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            'url': 'http://ip-api.com/json',
            'httpResponseBody': True,
            'geolocation': 'AU',
        }
    )
    http_response_body: bytes = b64decode(
        api_response['httpResponseBody']
    )
    country_code = json.loads(http_response_body)['countryCode']

asyncio.run(main())
import json

from scrapy import Request, Spider


class IPAPIComSpider(Spider):
    name = 'ip_api_com'

    def start_requests(self):
        yield Request(
            'http://ip-api.com/json',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                    'geolocation': 'AU',
                },
            },
        )

    def parse(self, response):
        country_code = json.loads(response.body)['countryCode']

Look up the geolocation key in the specification for the list of supported countries.

When the geolocation key is not specified, Zyte Data API aims to channel your request through a country that ensures a good response from the target website, meaning that the chosen country:

  • Does not cause unexpected locale changes in the response data, such as the wrong language, currency, date format, time zone, etc.

  • Does not cause your request to be banned.

Set request metadata#

Set the echoData key in your API request body to an arbitrary value, to get that value verbatim in the API response.

When sending multiple requests in parallel, this can be useful, for example, to keep track of the original request order.

input.jsonl#
  {"url": "https://toscrape.com", "browserHtml": true, "echoData": 1}
  {"url": "https://books.toscrape.com", "browserHtml": true, "echoData": 2}
  {"url": "https://quotes.toscrape.com", "browserHtml": true, "echoData": 3}
cat input.jsonl \
| xargs -P 15 -d\\n -n 1 \
bash -c "
    curl \
        --user $ZYTE_API_KEY: \
        --header 'Content-Type: application/json' \
        --data \"\$0\" \
        https://api.zyte.com/v1/extract \
    | jq .echoData \
    | awk '{print \$1}' \
    >> output.jsonl
"
input.jsonl#
  {"url": "https://toscrape.com", "browserHtml": true, "echoData": 1}
  {"url": "https://books.toscrape.com", "browserHtml": true, "echoData": 2}
  {"url": "https://quotes.toscrape.com", "browserHtml": true, "echoData": 3}
zyte-api --n-conn 15 input.jsonl -o output.jsonl
import asyncio

import aiohttp

input_data = [
    ('https://toscrape.com', 1),
    ('https://books.toscrape.com', 2),
    ('https://quotes.toscrape.com', 3),
]
output = []


async def extract(client, url, index):
    response = await client.post(
        'https://api.zyte.com/v1/extract',
        json={'url': url, 'browserHtml': True, 'echoData': index},
        auth=aiohttp.BasicAuth('YOUR_API_KEY'),
    )
    output.append(await response.json())


async def main():
    connector = aiohttp.TCPConnector(limit_per_host=15)
    async with aiohttp.ClientSession(connector=connector) as client:
        await asyncio.gather(
            *[extract(client, url, index) for url, index in input_data]
        )

asyncio.run(main())
import asyncio

from zyte_api.aio.client import AsyncClient, create_session

input_data = [
    ('https://toscrape.com', 1),
    ('https://books.toscrape.com', 2),
    ('https://quotes.toscrape.com', 3),
]
output = []


async def main():
    connection_count = 15
    client = AsyncClient(n_conn=connection_count)
    requests = [
        {'url': url, 'browserHtml': True, 'echoData': index}
        for url, index in input_data
    ]
    async with create_session(connection_count) as session:
        responses = client.request_parallel_as_completed(
            requests,
            session=session,
        )
        for response in responses:
            output.append(await response)

asyncio.run(main())
import json
from base64 import b64encode

from scrapy import Request, Spider

input_data = [
    ('https://toscrape.com', 1),
    ('https://books.toscrape.com', 2),
    ('https://quotes.toscrape.com', 3),
]


class ToScrapeSpider(Spider):
    name = 'toscrape_com'

    custom_settings = {
        'CONCURRENT_REQUESTS': 15,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 15,
    }

    def start_requests(self):
        for url, index in input_data:
            yield Request(
                url,
                meta={
                    'zyte_api': {
                        'browserHtml': True,
                        'echoData': index,
                    },
                },
            )

    def parse(self, response):
        yield {
            'index': response.raw_api_response['echoData'],
            'html': response.text,
        }

Alternatively, you can use Scrapy’s Request.cb_kwargs directly for a similar purpose:

...

    def start_requests(self):
        for url, index in input_data:
            yield Request(
                url,
                cb_kwargs={'index': index},
                meta={
                    'zyte_api': {
                        'browserHtml': True,
                    },
                },
            )

    def parse(self, response, index):
        yield {
            'index': index,
            'html': response.text,
        }

There is another metadata field that you can set and get verbatim on the API response: jobId. When running your requests from a Zyte Scrapy Cloud job, this field is meant to indicate the corresponding job ID. scrapy-zyte-api fills this field automatically.