Zyte API automatic extraction#

Automatic extraction gets you structured data from web data.

Automatic extraction supports AI-powered extraction of e-commerce, article and job posting data from any website, as well as non-AI extraction of Google Search results.

You can use Zyte API requests to get structured data from webpages or use AI spiders to get structured data from websites.

Structured data types#

In a Zyte API request, enable any of the following fields to get matching structured data:

Note

You can only enable 1 of these fields per Zyte API request.

Google Search (spider)

serp (output) non-ai

Example

Note

Install and configure code example requirements and the Zyte CA certificate to run the example below.

using System;
using System.Collections.Generic;
using System.Net;
using System.Net.Http;
using System.Text;
using System.Text.Json;
using System.Threading.Tasks;

HttpClientHandler handler = new HttpClientHandler()
{
    AutomaticDecompression = DecompressionMethods.All
};
HttpClient client = new HttpClient(handler);

var apiKey = "YOUR_API_KEY";
var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":");
var auth = System.Convert.ToBase64String(bytes);
client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth);

client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate");

var input = new Dictionary<string, object>(){
    {"url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"},
    {"product", true}
};
var inputJson = JsonSerializer.Serialize(input);
var content = new StringContent(inputJson, Encoding.UTF8, "application/json");

HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content);
var body = await response.Content.ReadAsByteArrayAsync();

var data = JsonDocument.Parse(body);
var product = data.RootElement.GetProperty("product").ToString();

Console.WriteLine(product);
input.jsonl#
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}
zyte-api input.jsonl \
    | jq --raw-output .product
input.json#
{
    "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "product": true
}
curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data @input.json \
    --compressed \
    https://api.zyte.com/v1/extract \
    | jq --raw-output .product
import com.google.common.collect.ImmutableMap;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.Map;
import org.apache.hc.client5.http.classic.methods.HttpPost;
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.ContentType;
import org.apache.hc.core5.http.HttpEntity;
import org.apache.hc.core5.http.HttpHeaders;
import org.apache.hc.core5.http.ParseException;
import org.apache.hc.core5.http.io.entity.EntityUtils;
import org.apache.hc.core5.http.io.entity.StringEntity;

class Example {
  private static final String API_KEY = "YOUR_API_KEY";

  public static void main(final String[] args)
      throws InterruptedException, IOException, ParseException {
    Map<String, Object> parameters =
        ImmutableMap.of(
            "url",
            "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            "product",
            true);
    String requestBody = new Gson().toJson(parameters);

    HttpPost request = new HttpPost("https://api.zyte.com/v1/extract");
    request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON);
    request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate");
    request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader());
    request.setEntity(new StringEntity(requestBody));

    CloseableHttpClient client = HttpClients.createDefault();
    client.execute(
        request,
        response -> {
          HttpEntity entity = response.getEntity();
          String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8);
          JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject();
          JsonObject product = jsonObject.get("product").getAsJsonObject();
          Gson gson = new GsonBuilder().setPrettyPrinting().create();
          System.out.println(gson.toJson(product));
          return null;
        });
  }

  private static String buildAuthHeader() {
    String auth = API_KEY + ":";
    String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes());
    return "Basic " + encodedAuth;
  }
}
const axios = require('axios')

axios.post(
  'https://api.zyte.com/v1/extract',
  {
    url: 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    product: true
  },
  {
    auth: { username: 'YOUR_API_KEY' }
  }
).then((response) => {
  const product = response.data.product
  console.log(product)
})
<?php

$client = new GuzzleHttp\Client();
$response = $client->request('POST', 'https://api.zyte.com/v1/extract', [
    'auth' => ['YOUR_API_KEY', ''],
    'headers' => ['Accept-Encoding' => 'gzip'],
    'json' => [
        'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
        'product' => true,
    ],
]);
$data = json_decode($response->getBody());
$product = json_encode($data->product);
echo $product.PHP_EOL;
import requests

api_response = requests.post(
    "https://api.zyte.com/v1/extract",
    auth=("YOUR_API_KEY", ""),
    json={
        "url": (
            "https://books.toscrape.com/catalogue"
            "/a-light-in-the-attic_1000/index.html"
        ),
        "product": True,
    },
)
product = api_response.json()["product"]
print(product)
import asyncio
import json

from zyte_api import AsyncZyteAPI


async def main():
    client = AsyncZyteAPI()
    api_response = await client.get(
        {
            "url": (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            "product": True,
        }
    )
    product = api_response["product"]
    print(json.dumps(product, indent=2, ensure_ascii=False))


asyncio.run(main())
from scrapy import Request, Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"

    def start_requests(self):
        yield Request(
            (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            meta={
                "zyte_api_automap": {
                    "product": True,
                },
            },
        )

    def parse(self, response):
        product = response.raw_api_response["product"]
        print(product)

Output (first 5 lines):

{
  "name": "A Light in the Attic",
  "price": "51.77",
  "currency": "GBP",
  "currencyRaw": "£",

AI-powered extraction#

Automatic extraction uses AI-powered extraction for the following structured data types: product, productList, productNavigation, article, articleList, articleNavigation, forumThread, jobPosting, jobPostingNavigation.

AI-powered extraction also supports LLM-based extraction of custom attributes, as well as: geolocation, IP type, cookies, sessions, redirection, response headers, and metadata, plus additional features depending on your extraction source.

Extraction source#

Automatic extraction can be performed using either a browser request or an HTTP request. Choose which using the corresponding extractFrom option, e.g. productOptions.extractFrom when extracting a product.

Currently, automatic extraction defaults to using a browser request. In the future, however, the default value may depend on the target website.

Automatic extraction using an HTTP request supports HTTP request attributes for method, body, and headers.

Automatic extraction using a browser request supports browser HTML, screenshots, some request headers, actions, network capture, and toggling JavaScript. The limitations of browser requests also apply in this case.

When deciding whether to use automatic extraction from a browser request or from an HTTP request, consider the following:

  • Extraction using an HTTP request is typically much faster and has a lower cost compared to extraction from a browser request.

  • For some websites, extraction from an HTTP request would produce extremely poor results (such as low probability and missing fields), which often happens when JavaScript execution is required.

  • It is helpful to test both methods and choose extraction from a browser request if it provides better quality.

Model pinning#

The AI models of AI-powered extraction are retrained regularly, usually a few times per year. While new model versions aim to improve overall accuracy, they may become less accurate for specific fields of specific websites.

For certain data types, we provide an option to pin a specific model version, which allows you to postpone an update to the latest model.

To pin a model, use the corresponding model option, e.g. productOptions.model when extracting a product.

Model versions remain available for at least 1 year after their release. For example, a product model version "2024-02-01" would remain available at least until the 1st of February 2025.

When we decide to remove a model version, we announce its end-of-life date by email to its users at least 3 months in advance, and we list that date in the table below.

Data type

Model name

Description

product

2024-02-01

product

2024-09-16

Default product model