Zyte API automatic extraction#

Automatic extraction gets you structured data from web data.

Automatic extraction supports AI-powered extraction of e-commerce, article and job posting data from any website, as well as non-AI extraction of Google Search results.

You can use Zyte API requests to get structured data from webpages or use AI spiders to get structured data from websites.

Structured data types#

In a Zyte API request, enable any of the following fields to get matching structured data:

Note

You can only enable 1 of these fields per Zyte API request.

E-commerce (spider)

product (output) ai
productList (output) ai
productNavigation (output) ai

Articles (spider)

article (output) ai
articleList (output) ai
articleNavigation (output) ai
forumThread (output) ai

Job postings (spider)

jobPosting (output) ai

jobPostingNavigation (output) ai

Google Search (spider)

serp (output) non-ai

AI-powered extraction#

Automatic extraction uses AI-powered extraction for the following structured data types: product, productList, productNavigation, article, articleList, articleNavigation, forumThread, jobPosting, jobPostingNavigation.

AI-powered extraction also supports LLM-based extraction of custom attributes, as well as: geolocation, IP type, cookies, sessions, redirection, response headers, and metadata, plus additional features depending on your extraction source.

Extraction source#

Automatic extraction can be performed using either a browser request or an HTTP request. Choose which using the corresponding extractFrom option, e.g. productOptions.extractFrom when extracting a product.

Currently, automatic extraction defaults to using a browser request. In the future, however, the default value may depend on the target website.

Automatic extraction using an HTTP request supports HTTP request attributes for method, body, and headers.

Automatic extraction using a browser request supports browser HTML, screenshots, some request headers, actions, network capture, and toggling JavaScript. The limitations of browser requests also apply in this case.

When deciding whether to use automatic extraction from a browser request or from an HTTP request, consider the following:

Extraction using an HTTP request is typically much faster and has a lower cost compared to extraction from a browser request.
For some websites, extraction from an HTTP request would produce extremely poor results (such as low probability and missing fields), which often happens when JavaScript execution is required.
It is helpful to test both methods and choose extraction from a browser request if it provides better quality.

Model pinning#

The AI models of AI-powered extraction are retrained regularly, usually a few times per year. While new model versions aim to improve overall accuracy, they may become less accurate for specific fields of specific websites.

For certain data types, we provide an option to pin a specific model version, which allows you to postpone an update to the latest model.

To pin a model, use the corresponding model option, e.g. productOptions.model when extracting a product.

Model versions remain available for at least 1 year after their release. For example, a product model version "2024-02-01" would remain available at least until the 1st of February 2025.

When we decide to remove a model version, we announce its end-of-life date by email to its users at least 3 months in advance, and we list that date in the table below.

Data type	Model name	Description
product	2024-02-01
product	2024-09-16	Default product model