Zyte API automatic extraction#
To let Zyte API parse the contents of the target URL automatically and give you structured data, enable one of the automatic extraction request fields. Their schemas are covered in the Zyte API reference. Each schema is a subset of the latest version of its matching Zyte Data schema.
Example
Note
Install and configure code example requirements and the Zyte CA certificate to run the example below.
using System;
using System.Collections.Generic;
using System.Net;
using System.Net.Http;
using System.Text;
using System.Text.Json;
using System.Threading.Tasks;
HttpClientHandler handler = new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.All
};
HttpClient client = new HttpClient(handler);
var apiKey = "YOUR_API_KEY";
var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":");
var auth = System.Convert.ToBase64String(bytes);
client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth);
client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate");
var input = new Dictionary<string, object>(){
{"url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"},
{"product", true}
};
var inputJson = JsonSerializer.Serialize(input);
var content = new StringContent(inputJson, Encoding.UTF8, "application/json");
HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content);
var body = await response.Content.ReadAsByteArrayAsync();
var data = JsonDocument.Parse(body);
var product = data.RootElement.GetProperty("product").ToString();
Console.WriteLine(product);
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}
zyte-api input.jsonl \
| jq --raw-output .product
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"product": true
}
curl \
--user YOUR_API_KEY: \
--header 'Content-Type: application/json' \
--data @input.json \
--compressed \
https://api.zyte.com/v1/extract \
| jq --raw-output .product
import com.google.common.collect.ImmutableMap;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.Map;
import org.apache.hc.client5.http.classic.methods.HttpPost;
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.CloseableHttpResponse;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.ContentType;
import org.apache.hc.core5.http.HttpEntity;
import org.apache.hc.core5.http.HttpHeaders;
import org.apache.hc.core5.http.ParseException;
import org.apache.hc.core5.http.io.entity.EntityUtils;
import org.apache.hc.core5.http.io.entity.StringEntity;
class Example {
private static final String API_KEY = "YOUR_API_KEY";
public static void main(final String[] args)
throws InterruptedException, IOException, ParseException {
Map<String, Object> parameters =
ImmutableMap.of(
"url",
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"product",
true);
String requestBody = new Gson().toJson(parameters);
HttpPost request = new HttpPost("https://api.zyte.com/v1/extract");
request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON);
request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate");
request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader());
request.setEntity(new StringEntity(requestBody));
try (CloseableHttpClient client = HttpClients.createDefault()) {
try (CloseableHttpResponse response = client.execute(request)) {
HttpEntity entity = response.getEntity();
String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8);
JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject();
String product = jsonObject.get("product").toString();
System.out.println(product);
}
}
}
private static String buildAuthHeader() {
String auth = API_KEY + ":";
String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes());
return "Basic " + encodedAuth;
}
}
const axios = require('axios')
axios.post(
'https://api.zyte.com/v1/extract',
{
url: 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
product: true
},
{
auth: { username: 'YOUR_API_KEY' }
}
).then((response) => {
const product = response.data.product
console.log(product)
})
<?php
$client = new GuzzleHttp\Client();
$response = $client->request('POST', 'https://api.zyte.com/v1/extract', [
'auth' => ['YOUR_API_KEY', ''],
'headers' => ['Accept-Encoding' => 'gzip'],
'json' => [
'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'product' => true,
],
]);
$data = json_decode($response->getBody());
$product = json_encode($data->product);
echo $product.PHP_EOL;
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"product": True,
},
)
product = api_response.json()["product"]
print(product)
import asyncio
import json
from zyte_api import AsyncZyteAPI
async def main():
client = AsyncZyteAPI()
api_response = await client.get(
{
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"product": True,
}
)
product = api_response["product"]
print(json.dumps(product, indent=2, ensure_ascii=False))
asyncio.run(main())
from scrapy import Request, Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
def start_requests(self):
yield Request(
(
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
meta={
"zyte_api_automap": {
"product": True,
},
},
)
def parse(self, response):
product = response.raw_api_response["product"]
print(product)
Output (first 5 lines):
{
"name": "A Light in the Attic",
"price": "51.77",
"currency": "GBP",
"currencyRaw": "£",
Automatic extraction supports LLM-based extraction of custom attributes, as well as: geolocation, IP type, cookies, sessions, redirection, response headers, and metadata, plus additional features depending on your extraction source.
Note
Currently serp can only be combined with the url Zyte API request field, it does not yet support any extra Zyte API features from the list above.
You can also use our ready-to-use spiders to automate crawling and parsing.
Request fields#
Enable any of the following request fields to get matching structured data:
Note
You can only enable 1 automatic extraction property per Zyte API request.
Custom attributes#
In addition to the extraction of standard data types, such as product or article, Zyte API allows extraction of user-defined attributes from any unstructured web page.
Extraction source#
Automatic extraction can be performed using either a browser request or
an HTTP request. Choose which using the corresponding extractFrom
option,
e.g. productOptions.extractFrom when extracting a
product.
Currently, automatic extraction defaults to using a browser request, except for serp, which defaults to using an HTTP request. In the future, however, the default value may depend on the target website.
Automatic extraction using an HTTP request supports HTTP request attributes for method, body, and headers.
Automatic extraction using a browser request supports browser HTML, screenshots, some request headers, actions, network capture, and toggling JavaScript. The limitations of browser requests also apply in this case.
When deciding whether to use automatic extraction from a browser request or from an HTTP request, consider the following:
Extraction using an HTTP request is typically much faster and has a lower cost compared to extraction from a browser request.
For some websites, extraction from an HTTP request would produce extremely poor results (such as low probability and missing fields), which often happens when JavaScript execution is required.
It is helpful to test both methods and choose extraction from a browser request if it provides better quality.
AI spiders#
Tip
The best way to learn all there is to know about AI spiders is to follow the AI spiders tutorial.
zyte-spider-templates provides spiders that use Zyte API automatic extraction to crawl and parse data from any website of a supported type, e.g. any e-commerce website. Use these spiders however you prefer:
Start a Scrapy project with AI spiders (covered in the AI spiders tutorial)
Add AI spiders to an existing Scrapy project (covered in the web scraping tutorial)
These spiders are spider templates with parameters that support complete customization.
Model pinning#
Zyte API automatic extraction uses AI models that are retrained regularly, usually a few times per year. While new model versions aim to improve overall accuracy, they may become less accurate for specific fields of specific websites.
For certain data types, we provide an option to pin a specific model version, which allows you to postpone an update to the latest model.
To pin a model, use the corresponding model
option, e.g.
productOptions.model when extracting a product.
Model versions remain available for at least 1 year after the release: for example, product model “2024-02-01” would remain available at least until 1st of February 2025. A deprecation date would be announced via email and in the table below at least 3 months before the model is removed. Use that time to compare the results of your pinned version with those of the latest version, report any issues you may find so that we can address them, and once you are happy with the results of the latest version, pin that version instead.
Models available to pinning:
Data type |
Model name |
Description |
---|---|---|
product |
2024-02-01 |
Default product model. Removal date not set yet, but not earlier than 2025-02-01. |