Zyte API automatic extraction#

To let Zyte API parse the contents of the target URL automatically and give you structured data, enable one of the automatic extraction request fields. Their schemas are covered in the Zyte API reference. Each schema is a subset of the latest version of its matching Zyte Data schema.

Example

Note

Install and configure code example requirements and the Zyte CA certificate to run the example below.

using System;
using System.Collections.Generic;
using System.Net;
using System.Net.Http;
using System.Text;
using System.Text.Json;
using System.Threading.Tasks;

HttpClientHandler handler = new HttpClientHandler()
{
    AutomaticDecompression = DecompressionMethods.All
};
HttpClient client = new HttpClient(handler);

var apiKey = "YOUR_API_KEY";
var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":");
var auth = System.Convert.ToBase64String(bytes);
client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth);

client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate");

var input = new Dictionary<string, object>(){
    {"url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"},
    {"product", true}
};
var inputJson = JsonSerializer.Serialize(input);
var content = new StringContent(inputJson, Encoding.UTF8, "application/json");

HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content);
var body = await response.Content.ReadAsByteArrayAsync();

var data = JsonDocument.Parse(body);
var product = data.RootElement.GetProperty("product").ToString();

Console.WriteLine(product);
input.jsonl#
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}
zyte-api input.jsonl \
    | jq --raw-output .product
input.json#
{
    "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "product": true
}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    --compressed \
    https://api.zyte.com/v1/extract \
    | jq --raw-output .product
import com.google.common.collect.ImmutableMap;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.Map;
import org.apache.hc.client5.http.classic.methods.HttpPost;
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.CloseableHttpResponse;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.ContentType;
import org.apache.hc.core5.http.HttpEntity;
import org.apache.hc.core5.http.HttpHeaders;
import org.apache.hc.core5.http.ParseException;
import org.apache.hc.core5.http.io.entity.EntityUtils;
import org.apache.hc.core5.http.io.entity.StringEntity;

class Example {
  private static final String API_KEY = "YOUR_API_KEY";

  public static void main(final String[] args)
      throws InterruptedException, IOException, ParseException {
    Map<String, Object> parameters =
        ImmutableMap.of(
            "url",
            "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            "product",
            true);
    String requestBody = new Gson().toJson(parameters);

    HttpPost request = new HttpPost("https://api.zyte.com/v1/extract");
    request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON);
    request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate");
    request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader());
    request.setEntity(new StringEntity(requestBody));

    try (CloseableHttpClient client = HttpClients.createDefault()) {
      try (CloseableHttpResponse response = client.execute(request)) {
        HttpEntity entity = response.getEntity();
        String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8);
        JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject();
        String product = jsonObject.get("product").toString();
        System.out.println(product);
      }
    }
  }

  private static String buildAuthHeader() {
    String auth = API_KEY + ":";
    String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes());
    return "Basic " + encodedAuth;
  }
}
const axios = require('axios')

axios.post(
  'https://api.zyte.com/v1/extract',
  {
    url: 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    product: true
  },
  {
    auth: { username: 'YOUR_API_KEY' }
  }
).then((response) => {
  const product = response.data.product
  console.log(product)
})
<?php

$client = new GuzzleHttp\Client();
$response = $client->request('POST', 'https://api.zyte.com/v1/extract', [
    'auth' => ['YOUR_API_KEY', ''],
    'headers' => ['Accept-Encoding' => 'gzip'],
    'json' => [
        'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
        'product' => true,
    ],
]);
$data = json_decode($response->getBody());
$product = json_encode($data->product);
echo $product.PHP_EOL;
import requests

api_response = requests.post(
    "https://api.zyte.com/v1/extract",
    auth=("YOUR_API_KEY", ""),
    json={
        "url": (
            "https://books.toscrape.com/catalogue"
            "/a-light-in-the-attic_1000/index.html"
        ),
        "product": True,
    },
)
product = api_response.json()["product"]
print(product)
import asyncio
import json

from zyte_api import AsyncZyteAPI


async def main():
    client = AsyncZyteAPI()
    api_response = await client.get(
        {
            "url": (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            "product": True,
        }
    )
    product = api_response["product"]
    print(json.dumps(product, indent=2, ensure_ascii=False))


asyncio.run(main())
from scrapy import Request, Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"

    def start_requests(self):
        yield Request(
            (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            meta={
                "zyte_api_automap": {
                    "product": True,
                },
            },
        )

    def parse(self, response):
        product = response.raw_api_response["product"]
        print(product)

Output (first 5 lines):

{
  "name": "A Light in the Attic",
  "price": "51.77",
  "currency": "GBP",
  "currencyRaw": "£",

Automatic extraction supports geolocation, cookies, sessions, redirection, response headers, and metadata, plus additional features depending on your extraction source.

You can also use our ready-to-use spiders to automate crawling and parsing.

Request fields#

Enable any of the following request fields to get matching structured data:

Note

You can only enable 1 automatic extraction property per Zyte API request.

Extraction source#

Automatic extraction can be performed using either a browser request or an HTTP request. Choose which using the corresponding extractFrom option, e.g. productOptions.extractFrom when extracting a product.

Currently, automatic extraction defaults to using a browser request. In the future, however, the default value may depend on the target website.

Automatic extraction using an HTTP request supports HTTP request attributes for method, body, and headers.

Automatic extraction using a browser request supports browser HTML, screenshots, some request headers, actions, network capture, and toggling JavaScript. The limitations of browser requests also apply in this case.

When deciding whether to use automatic extraction from a browser request or from an HTTP request, consider the following:

  • Extraction using an HTTP request is typically much faster and has a lower cost compared to extraction from a browser request.

  • For some websites, extraction from an HTTP request would produce extremely poor results (such as low probability and missing fields), which often happens when JavaScript execution is required.

  • It is helpful to test both methods and choose extraction from a browser request if it provides better quality.

Spiders#

zyte-spider-templates provides spiders that use Zyte API automatic extraction to crawl and parse data from any website of a supported type, e.g. any e-commerce website. Use these spider however you prefer:

These spiders are spider templates with parameters that support complete customization.

Model pinning#

Zyte API automatic extraction uses AI models that are retrained regularly, usually a few times per year. While new model versions aim to improve overall accuracy, they may become less accurate for specific fields of specific websites.

For certain data types, we provide an option to pin a specific model version, which allows you to postpone an update to the latest model.

To pin a model, use the corresponding model option, e.g. productOptions.model when extracting a product.

Model versions remain available for at least 1 year after the release: for example, product model “2024-02-01” would remain available at least until 1st of February 2025. A deprecation date would be announced via email and in the table below at least 3 months before the model is removed. Use that time to compare the results of your pinned version with those of the latest version, report any issues you may find so that we can address them, and once you are happy with the results of the latest version, pin that version instead.

Models available to pinning:

Data type

Model name

Description

product

2024-02-01

Default product model.

Removal date not set yet, but not earlier than 2025-02-01.