Zyte API reference documentation#

This is the complete reference documentation of the HTTP API of Zyte API.
For topic-based usage documentation, see Zyte API usage documentation.
All requests require basic authentication, with your API key as username, and no password.
For example, if your API key is foo, you base64-encode foo: as Zm9vOg==
and send the Authorization header with value Basic Zm9vOg==.
Authorization: Basic Zm9vOg==

Web Data Extraction API (1.0.0)

Download OpenAPI specification:Download

A single API for web scraping

Process a single URL, return the result

Process a single URL, return the result.

This endpoint blocks until the result is ready. It is intended for short-running operations.

At least one of the following request fields must be set to true:

All automatic extraction data types support performing extraction using either a browser request or an HTTP request. Choose which using the corresponding extractFrom option, e.g. productOptions.extractFrom when extracting a product.

When no option is specified, currently automatic extraction defaults to using a browser request, except for serp, where an HTTP request is used by default instead. In the future, however, the default value may depend on the target website.

When automatic extraction uses a browser request, it can be combined with any fields compatible with browserHtml, e.g. screenshot. When automatic extraction uses an HTTP request, it can be combined with any fields compatible with httpResponseBody. serp cannot be combined with any other fields besides serpOptions and url.

You cannot combine multiple automatic extraction request fields (e.g. product and productList) on the same request.

You cannot combine httpResponseBody with a request field that is exclusive of browser requests (e.g. httpResponseBody and browserHtml).

httpResponseHeaders can be requested alone or with any other valid combination of request fields except for serp.

The request body size limit is 5MiB.

Authorizations:
Request Body schema: application/json

An extraction request body

url
required
string <= 8192 characters

An absolute URL to extract data from.

The host name must be a domain name, it cannot be an IP address.

object (RequestHeaders)

HTTP request headers.

Can only be used in a browser request. For HTTP requests, see customHttpRequestHeaders.

At the moment it only supports the Referer header.

See an example.

referer
object or null

Assign arbitrary key-value pairs to the request that you can use for filtering in the Stats API.

Keys must be strings. Values must be strings or null.

For example: {"tags": {"foo": "bar", "baz": null}}.

property name*
additional property
string
ipType
string
Enum: "datacenter" "residential"

Type of IP address from which the request should be sent.

If not specified, Zyte API will use an IP type that, for the target website, does not cause bans or unexpected response data.

If you believe Zyte API is using the wrong default IP type for a website, please reach out to our expert anti-ban team.

See an example.

httpRequestMethod
string
Enum: "GET" "POST" "PUT" "DELETE" "OPTIONS" "TRACE" "PATCH" "HEAD"
httpRequestBody
string <byte> <= 400000 characters

Base64-encoded data to send as request body.

Can only be used in combination with httpResponseBody.

It usually needs to be used in combination with httpRequestMethod.

If you only need to send UTF-8-encoded text, use httpRequestText instead to skip Base64-encoding. Note that you cannot combine both fields on the same request.

See an example. See also: customHttpRequestHeaders.

httpRequestText
string [ 1 .. 400000 ] characters

UTF-8 text to send as request body.

Can only be used in combination with httpResponseBody.

It usually needs to be used in combination with httpRequestMethod.

If you need to send a binary or non-UTF-8 request body, use httpRequestBody instead. Note that you cannot combine both fields on the same request.

See an example. See also: customHttpRequestHeaders.

Array of objects (CustomHttpRequestHeader) <= 200 items [ items ]

HTTP request headers.

Can only be used in combination with httpResponseBody. To set headers with other outputs, see requestHeaders.

Setting HTTP request headers has some caveats:

  • Zyte API sends some headers automatically for ban avoidance, and may silently override or drop some of your custom headers for that purpose.

    However, your custom headers may override those automatic headers, and in doing so they can break the ban avoidance capabilities of Zyte API, as some websites may ban based on the presence, values, or order of certain headers.

  • You cannot set the Cookie header. Use requestCookies instead.

  • If you set multiple headers with the same name, only the last header value will be sent. To overcome this limitation, join the header values with a comma into a single header value. For example, replace "customHttpRequestHeaders": [{"name": "foo", "value": "bar"}, {"name": "foo", "value": "baz"}] with "customHttpRequestHeaders": [{"name": "foo", "value": "bar,baz"}].

See an example. See also: httpRequestMethod, httpRequestText, httpRequestBody, httpResponseHeaders.

Array (<= 200 items)
name
string <= 200 characters
value
string <= 2000 characters
httpResponseBody
boolean
Default: false

Set to true to get the HTTP response body in the httpResponseBody response field.

This field is not compatible with browser automation.

See an example. See also: httpRequestMethod, httpRequestText, httpRequestBody, customHttpRequestHeaders.

httpResponseHeaders
boolean
Default: false

Set to true to get the HTTP response headers in the httpResponseHeaders response field.

See an example. See also: customHttpRequestHeaders, requestHeaders.

browserHtml
boolean
Default: false

Set to true to get the browser HTML in the browserHtml response field.

This field is not compatible with HTTP requests.

If you use actions, the browser HTML is generated after action execution has finished or timed out.

See an example. See also: screenshot, requestHeaders.

screenshot
boolean
Default: false

Set to true to get a page screenshot in the screenshot response field.

This field is not compatible with HTTP requests.

To adjust the screenshot contents you can use screenshotOptions and viewport.

If you use actions, the screenshot is generated after action execution has finished or timed out.

See an example. See also: browserHtml, requestHeaders.

object (ScreenshotOptions)

Options for the screenshot taken when the screenshot request field is true.

format
string
Default: "jpeg"
Enum: "png" "jpeg"

File format.

JPEG screenshots are taken with a quality of 75%.

fullPage
boolean
Default: false

When true, the screenshot features the full page. When false, it features only what is visible on the browser window (viewport).

Full page screenshots:

  • Are only available in JPEG format.

  • Have a minimum resolution of 1920x1080, i.e. for pages smaller than 1920x1080, the screenshot looks the same regardless of the value of fullPage.

  • Any image exceeding 5000 (width) x 10000 (height) pixels will be clipped to those dimensions.

article
boolean
Default: false

Set to true to get article data in the article response field.

The target page should only contain a single article, such as a blog post or a news article. For pages with multiple articles consider using articleList instead.

To combine this field with HTTP requests, set articleOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, articleNavigation, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

articleList
boolean
Default: false

Set to true to get article list data in the articleList response field.

The target page should contain multiple articles, usually as links or short snippets. Examples of such pages are main or category pages of news sites, main pages of blogs showing multiple posts, and other pages with multiple articles.

Article list data is especially useful to get basic information about articles on a website, like a headline and a link to the article details, using a smaller number of requests, when article attributes are extracted directly from a article list page, without making individual article requests.

To implement article crawling from article list pages, use articleNavigation, which also enables navigation through pagination links.

To combine this field with HTTP requests, set articleListOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

articleNavigation
boolean
Default: false

Set to true to get article navigation data in the articleNavigation response field.

The target page should contain multiple articles and/or subcategories that can be followed.

Article navigation data is especially useful for implementing article crawling, i.e. following links to article pages, as well as to subcategories and pagination that can in turn link to more article pages.

Article navigation data can also be used to get basic information of articles and subcategories on a website, obtaining the URLs and link names of the articles and subcategories, without making individual requests for those articles.

To combine this field with HTTP requests, set articleNavigationOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, article, articleList, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

forumThread
boolean
Default: false

Set to true to get forum threads data in the forumThread response field.

The target page should contain an individual forum thread page on a forum website.

To combine this field with HTTP requests, set forumThread.extractFrom to "httpResponseBody". If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, article, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

jobPosting
boolean
Default: false

Set to true to get job posting data in the jobPosting response field.

The target page should contain individual job posting page on a company website or on a job website.

To combine this field with HTTP requests, set jobPostingOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

jobPostingNavigation
boolean
Default: false

Set to true to get job posting navigation data in the jobPostingNavigation response field.

The target page should contain multiple job postings and/or subcategories that can be followed.

Job posting navigation data is especially useful for implementing job posting crawling, i.e. following links to job posting pages, as well as pagination that can in turn link to more job posting pages.

Job posting navigation data can also be used to get basic information of job postings on a website, obtaining the URLs and link names of the job postings, without making individual requests for them.

To combine this field with HTTP requests, set jobPostingNavigationOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, jobPosting, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

product
boolean
Default: false

Set to true to get product data in the product response field.

The target page should only contain a single product. For pages with multiple products consider using productList instead.

To combine this field with HTTP requests, set productOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See an example. See also: List of all automatic extraction request fields, productNavigation, browserHtml, screenshot, requestHeaders.

object

Additional options for product extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

model
string
Enum: "2024-02-01" "2024-09-16"

Model version to use for product extraction. If not specified, the "2024-02-01" version is used.

Available product models:

  • "2024-02-01"

  • "2024-09-16"

See Model pinning.

productList
boolean
Default: false

Set to true to get product list data in the productList response field.

The target page should contain a list or a grid of products.

Product list data is especially useful to get basic information about products on a website using a smaller number of requests, when product attributes are extracted directly from a product list page, without making individual product requests.

To implement product crawling from product list pages, use productNavigation, which also enables navigation through pagination links.

To combine this field with HTTP requests, set productListOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

productNavigation
boolean
Default: false

Set to true to get product navigation data in the productNavigation response field.

The target page should contain multiple products and/or subcategories that can be followed.

Product navigation data is especially useful for implementing product crawling, i.e. following links to product pages, as well as to subcategories and pagination that can in turn link to more product pages.

Product navigation data can also be used to get basic information of products and subcategories on a website, obtaining the URLs and link names of the products and subcategories, without making individual requests for those products.

To combine this field with HTTP requests, set productNavigationOptions.extractFrom to "httpResponseBody".

If you use actions, data extraction happens after action execution has finished or timed out.

See also: List of all automatic extraction request fields, product, productList, browserHtml, screenshot, requestHeaders.

object (ExtractionOptions)

Options for datatype extraction.

extractFrom
string
Enum: "httpResponseBody" "browserHtml"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, browserHtml is currently used by default. In the future, the default value may depend on the target website.

object or null

Schema of the custom attributes to extract. This is a subset of the OpenAPI specification, using JSON syntax.

Zyte custom attributes extraction uses a Large Language Model (LLM) operated by Zyte to obtain any structured data specified by this schema from any unstructured web page. This allows to perform extraction similar to standard schemas, such as article or product, but much more flexibly.

When this field is specified, the customAttributes.values field in the response would contain the extracted data.

When custom attributes extraction is requested, a standard extraction field must also be specified (e.g. product). This determines the part of the web page which would be passed to the LLM for custom attributes extraction, e.g. when a web page is a product, we're only going to pass the product information, ignoring other parts of the page, such as menu or footer, which makes extraction cheaper and more accurate.

See detailed documentation. Additionally, to see a request example, scroll up to the right-hand sidebar Request samples, and select “Extract Custom Attributes along with Article information” under Example.

additional property
object (CustomAttribute)
description
string <= 300 characters
type
required
string
object

Additional options for custom attributes extraction.

method
string
Default: "generate"
Enum: "generate" "extract"

Method to use for custom attributes extraction:

  • "generate" (default) generates extracted data with the help of a generative Large Language Model (LLM). It is the most powerful and versatile extraction method, but also the most expensive one, with variable per-request cost.

  • "extract" locates extracted data in the requested web page with the help of a non-generative LLM. It only supports a subset of the schema (only string, integer and number types), and can't perform generative tasks such as summarization or data transformation. It is however much cheaper compared to the generative method and has a fixed per-request cost.

maxInputTokens
integer >= 1

Limit on the number of input tokens for custom attribute extraction with the "generate" method.

This includes the schema as well, but not our internal fixed prompt with the LLM instruction.

When the number of tokens for schema and page text is above the specified maxInputTokens, we truncate the page text to fit in maxInputTokens. This may result in quality degradation or data not extracted from the page because it was truncated.

Tokens are words or word pieces, for example {"price": "2.00 $"} is 9 tokens: {", price, ":, ", 2, ., 00, $, "}.

maxOutputTokens
integer >= 1

Limit on the number of output tokens for extracted custom attributes with the "generate" method. This field can be set to limit the extraction cost, but may result in quality degradation.

See an example of token counting in the maxInputTokens field above.

geolocation
string (CountryCode)
Enum: "AW" "AF" "AO" "AI" "AX" "AL" "AD" "AE" "AR" "AM" "AS" "AQ" "TF" "AG" "AU" "AT" "AZ" "BI" "BE" "BJ" "BQ" "BF" "BD" "BG" "BH" "BS" "BA" "BL" "BY" "BZ" "BM" "BO" "BR" "BB" "BN" "BT" "BV" "BW" "CF" "CA" "CC" "CH" "CL" "CN" "CI" "CM" "CD" "CG" "CK" "CO" "KM" "CV" "CR" "CU" "CW" "CX" "KY" "CY" "CZ" "DE" "DJ" "DM" "DK" "DO" "DZ" "EC" "EG" "ER" "EH" "ES" "EE" "ET" "FI" "FJ" "FK" "FR" "FO" "FM" "GA" "GB" "GE" "GG" "GH" "GI" "GN" "GP" "GM" "GW" "GQ" "GR" "GD" "GL" "GT" "GF" "GU" "GY" "HK" "HM" "HN" "HR" "HT" "HU" "ID" "IM" "IN" "IO" "IE" "IR" "IQ" "IS" "IL" "IT" "JM" "JE" "JO" "JP" "KZ" "KE" "KG" "KH" "KI" "KN" "KR" "KW" "LA" "LB" "LR" "LY" "LC" "LI" "LK" "LS" "LT" "LU" "LV" "MO" "MF" "MA" "MC" "MD" "MG" "MV" "MX" "MH" "MK" "ML" "MT" "MM" "ME" "MN" "MP" "MZ" "MR" "MS" "MQ" "MU" "MW" "MY" "YT" "NA" "NC" "NE" "NF" "NG" "NI" "NU" "NL" "NO" "NP" "NR" "NZ" "OM" "PK" "PA" "PN" "PE" "PH" "PW" "PG" "PL" "PR" "KP" "PT" "PY" "PS" "PF" "QA" "RE" "RO" "RU" "RW" "SA" "SD" "SN" "SG" "GS" "SH" "SJ" "SB" "SL" "SV" "SM" "SO" "PM" "RS" "SS" "ST" "SR" "SK" "SI" "SE" "SZ" "SX" "SC" "SY" "TC" "TD" "TG" "TH" "TJ" "TK" "TM" "TL" "TO" "TT" "TN" "TR" "TV" "TW" "TZ" "UG" "UA" "UM" "UY" "US" "UZ" "VA" "VC" "VE" "VG" "VI" "VN" "VU" "WF" "WS" "YE" "ZA" "ZM" "ZW"

ISO 3166-1 alpha-2 code of a country from which the request should be sent, i.e. the request geolocation.

If not specified, Zyte API will use a geolocation that, for the target website, does not cause bans or unexpected locale changes in the response data, such as the wrong language, currency, date format, time zone, etc.

If you believe Zyte API is using the wrong default geolocation for a website, please reach out to our expert anti-ban team.

For some websites, however, you might want to set a custom geolocation. For example, you may be interested in visiting the same URL from different locations.

Zyte API provides 2 sets of geolocations. Standard geolocations are AU, BE, BR, CA, CN, DE, ES, FR, GB, IN, IT, JP, KR, MX, NL, PL, RU, TR, US, and ZA. All other geolocations are extended geolocations.

See an example.

javascript
boolean

Forces JavaScript execution on a browser request to be enabled (true) or disabled (false).

By default Zyte API enables or disables JavaScript execution for a request depending on which option makes it easier to avoid bans. Use this request field to override that choice.

Passing this request field when requesting automatic extraction ( product, article, etc.) may impact the quality of the returned data, as it might override the optimal value for automatic extraction.

This field is not compatible with HTTP requests.

See an example.

Array of click (object) or doubleClick (object) or evaluate (object) or goto (object) or hide (object) or hover (object) or interaction (object) or keyPress (object) or reload (object) or scrollBottom (object) or scrollTo (object) or searchKeyword (object) or select (object) or setLocation (object) or type (object) or waitForNavigation (object) or waitForRequest (object) or waitForResponse (object) or waitForSelector (object) or waitForTimeout (object) (ActionSequence) [ items ]

Sequence of browser actions to execute.

Select an action below to see its API reference.

When using actions, you get the actions response field with debug information about action execution.

See an example.

Array
One of
action
required
any
Value: "click"

Click on an element.

required
object (ActionSelector)

A CSS or XPath selector to search for an element.

type
required
string
Enum: "css" "xpath"

The type of selector - CSS or XPath

value
required
string [ 1 .. 500 ] characters
state
string
Default: "visible"
Enum: "attached" "visible" "hidden"

State can be either of the following values and defaults to visible

  • 'visible' - The element has a non-empty bounding box and no visibility:hidden. Note that an element without content or with display:none has an empty bounding box, and is not considered visible.
  • 'hidden' - The element is either detached from the DOM, or has an empty bounding box or visibility:hidden. This is the opposite of the 'visible' option.
  • 'attached' - The element is present in the DOM; it can be visible or hidden
button
string
Default: "left"
Enum: "left" "right" "middle"

Mouse button to click

delay
number [ 0 .. 3 ]
Default: 0

Time to wait between mousedown and mouseup, in seconds.

waitForNavigationTimeout
number [ 0 .. 20 ]
Default: 0

Maximum waiting time in seconds for the navigation event during the click action.

If navigation happens within the defined duration, then waiting is halted and the next action is executed after the new is page is loaded. If the page loading does not finish then the next action ends with an error, and following actions may not be executed, depending on the onError property. If no navigation happens within the defined duration then the next action is executed.

onError
string (onError)
Default: "return"
Enum: "continue" "return"

Handle errors encountered while executing a particular action.

  • continue - When a particular action fails, the action sequence continues, executing the next actions
  • return - When a particular actions fails, the action sequence stops, not executing any more actions

When an action sequence finishes prematurely the service will return the entire response body up until the point of execution.

jobId
string <= 100 characters

ID of the Scrapy Cloud job from which this request has been sent, to be returned in the jobId response field.

This field is meant to help with request tracking.

scrapy-zyte-api fills this request field automatically.

See an example. See also: echoData.

echoData
any

This field is returned in the echoData response field, verbatim.

This field can be useful, for example, to keep track of the original request order when sending multiple requests in parallel.

The request can be rejected if the data is too big.

See an example. See also: jobId.

object (Viewport)
width
integer [ 320 .. 5120 ]
Default: 1920

Viewport width, in pixels.

height
integer [ 360 .. 4096 ]
Default: 1080

Viewport height, in pixels.

followRedirect
boolean

Whether to follow HTTP redirection or not.

Only supported in HTTP requests, browser requests always follow redirection.

Array of objects (SessionContext) [ items <= 10 items ]

User-defined name-value pairs to request a server-managed session initialized with sessionContextParameters).

For every subsequent request with the same session context, Zyte API will either reuse an available session created for the same session context or create a new session using sessionContextParameters).

Server-managed sessions expire after 4 hours or 3 ban responses. If you are targeting websites that silently expire their sessions before the 4-hour mark, i.e. they revert the effects of your sessionContextParameters but requests continue working as expected otherwise, consider using client-managed sessions for higher session control.

See an example. See also: requestCookies, responseCookies.

Array
name
required
string [ 1 .. 30 ] characters

Name of the context identifier.

value
required
string [ 1 .. 100 ] characters

Value of the context identifier.

object (SessionContextParameters)

Parameters to create a server-managed session for a given sessionContext).

See an example. See also: actions.

Array of click (object) or doubleClick (object) or evaluate (object) or goto (object) or hide (object) or hover (object) or interaction (object) or keyPress (object) or reload (object) or scrollBottom (object) or scrollTo (object) or searchKeyword (object) or select (object) or setLocation (object) or type (object) or waitForNavigation (object) or waitForRequest (object) or waitForResponse (object) or waitForSelector (object) or waitForTimeout (object) (SessionContextActionSequence) [ items ]

Actions to run to initialize a server-managed session for a given sessionContext).

Array
One of
action
required
any
Value: "click"

Click on an element.

required
object (ActionSelector)

A CSS or XPath selector to search for an element.

type
required
string
Enum: "css" "xpath"

The type of selector - CSS or XPath

value
required
string [ 1 .. 500 ] characters
state
string
Default: "visible"
Enum: "attached" "visible" "hidden"

State can be either of the following values and defaults to visible

  • 'visible' - The element has a non-empty bounding box and no visibility:hidden. Note that an element without content or with display:none has an empty bounding box, and is not considered visible.
  • 'hidden' - The element is either detached from the DOM, or has an empty bounding box or visibility:hidden. This is the opposite of the 'visible' option.
  • 'attached' - The element is present in the DOM; it can be visible or hidden
button
string
Default: "left"
Enum: "left" "right" "middle"

Mouse button to click

delay
number [ 0 .. 3 ]
Default: 0

Time to wait between mousedown and mouseup, in seconds.

waitForNavigationTimeout
number [ 0 .. 20 ]
Default: 0

Maximum waiting time in seconds for the navigation event during the click action.

If navigation happens within the defined duration, then waiting is halted and the next action is executed after the new is page is loaded. If the page loading does not finish then the next action ends with an error, and following actions may not be executed, depending on the onError property. If no navigation happens within the defined duration then the next action is executed.

onError
string (onError)
Default: "return"
Enum: "continue" "return"

Handle errors encountered while executing a particular action.

  • continue - When a particular action fails, the action sequence continues, executing the next actions
  • return - When a particular actions fails, the action sequence stops, not executing any more actions

When an action sequence finishes prematurely the service will return the entire response body up until the point of execution.

object (Session)

Parameters to create or reuse a client-managed session.

If id does not match one of your running sessions, a new session is created with that session ID. Otherwise, the matching running session is reused.

Client-managed sessions may expire due to any of the following:

  • 15 minutes (900 seconds) have passed since the session was created.

  • 2 minutes (120 seconds) have passed since the session use.

  • For 3 times in a row, requests using this session got banned.

For 5-10 minutes after a session expires, Zyte API keeps track of the expired session and does not allow re-using it. After that time, attempts to reuse the session will instead create a new session.

See an example.

id
string

User-defined session ID.

It must be a version 4 UUID, i.e. a randomly-generated UUID.

Array of objects (NetworkCaptureFilterSequence) <= 10 items [ items ]

Filters to capture browser network responses.

HTTP responses received during browser rendering (including action execution) will be returned in the networkCapture response field if they match any of the filters defined here.

You can capture up to 10 responses, provided the sum of their bodies does not exceed 5 MiB. If they do exceed that limit, only the first captured responses within the limit are returned.

See an example.

Array (<= 10 items)
filterType
required
string
httpResponseBody
boolean
Default: false

Set to true to get the body of the captured response in the networkCapture[].httpResponseBody response field.

value
required
string [ 3 .. 8192 ] characters

A string to compare with the URL of network responses according to matchType.

matchType
required
string (PatternMatchingOptions)
Default: "contains"
Enum: "startsWith" "endsWith" "contains" "exact"

How to compare a user-defined string with a target string:

  • contains matches if the user-defined string is a substring of the target string.

  • exact matches if the user-defined string is an exact match of the target string.

  • startsWith matches if the target string starts with the user-defined string.

  • endsWith matches if the target string ends with the user-defined string.

Comparisons are case-sensitive. Regular expressions or wildcard characters are not supported.

device
string
Enum: "desktop" "mobile"

Type of device to emulate during your request.

A desktop device is emulated by default.

Can only be used in combination with httpResponseBody.

cookieManagement
any
Default: "auto"
Enum: "auto" "discard"

Cookie management method

It determines how to handle user cookies, defined through requestCookies, and automatic cookies, cookies automatically generated by Zyte API. auto (default) uses user cookies if defined, or automatic cookies otherwise.

discard uses user cookies if defined, or no cookies otherwise.

Array of objects (Cookie) <= 100 items [ items ]

A list of cookies to be sent with a request.

You can use the contents of the responseCookies response field as a value for this request field.

See an example.

Array (<= 100 items)
name
required
string <= 4085 characters

Cookie name

value
required
string <= 4085 characters

Cookie value

domain
required
string <= 253 characters

Domain the cookie belongs to

path
string

Path the cookie belongs to

expires
integer <int64>

Unix time in seconds.

httpOnly
boolean
secure
boolean
sameSite
string
Enum: "Strict" "Lax" "Extended" "None"
responseCookies
boolean
Default: false

Set to true to get the list of cookies set during a request in the responseCookies response field.

See an example. See also: requestCookies.

serp
boolean

Set to true to get the data of a search engine results page (SERP) in the serp response field.

The target URL should be a search URL that belongs to a Google domain.

Currently, you cannot combine this field with any other request fields besides serpOptions and url.

See also: List of all automatic extraction request fields.

object (SerpOptions)

Options for SERP extraction.

extractFrom
string
Enum: "browserHtml" "httpResponseBody"

Input to use for extraction, either httpResponseBody or browserHtml.

If not specified, httpResponseBody is currently used by default. In the future, the default value may depend on the target website.

includeIframes
boolean
Default: false

Whether to add the content of iframes into browserHtml.

Note that iframes are visible in screenshots even if this is set to false.

See also: browserHtml.

Responses

Response Schema: application/json
url
required
string

URL the data was extracted from.

Could be different from the input URL in case of redirection.

See also: statusCode.

statusCode
integer

The HTTP status code retrieved from the target page.

If redirection is followed, this is the status code of the response after redirection.

See also: url.

httpResponseBody
string <byte>

Base64-encoded HTTP response body.

To get this response field, set the httpResponseBody request field to true.

Unlike browserHtml, this field supports binary response bodies, such as image files or PDF files. This is the reason why this field is Base64-encoded, JSON does not support binary data.

See an example.

Array of objects (HTTPHeader) [ items ]

HTTP response headers.

To get this response field, set the httpResponseHeaders request field to true.

The Content-Encoding header value (e.g. gzip, br, etc.) should not be used to decompress httpResponseBody, Zyte API already decompresses the body of compressed responses.

The Set-Cookie header value, when present, contains the header value received from the main HTTP response. These cookies could have changed later on, e.g. during browser rendering. Usually you will want to ignore this header in favor of responseCookies, which provides the final cookies.

See an example.

Array
name
required
string non-empty

The name of the header

value
required
string

The value of the header

browserHtml
string

Browser HTML.

To get this response field, set the browserHtml request field to true.

Browser HTML does not include the contents of iframes or the shadow DOM.

See an example.

object (Session)

Parameters to create or reuse a client-managed session.

If id does not match one of your running sessions, a new session is created with that session ID. Otherwise, the matching running session is reused.

Client-managed sessions may expire due to any of the following:

  • 15 minutes (900 seconds) have passed since the session was created.

  • 2 minutes (120 seconds) have passed since the session use.

  • For 3 times in a row, requests using this session got banned.

For 5-10 minutes after a session expires, Zyte API keeps track of the expired session and does not allow re-using it. After that time, attempts to reuse the session will instead create a new session.

See an example.

id
string

User-defined session ID.

It must be a version 4 UUID, i.e. a randomly-generated UUID.

screenshot
string <byte>

Base64-encoded page screenshot file data.

To get this response field, set the screenshot request field to true.

screenshotOptions.format determines the file format of the screenshot data.

See an example.

object

Article data.

To get this response field, set the article request field to true.

headline
string

Article headline or title.

articleBody
string

Clean text of the article, including sub-headings, with newline separators.

articleBodyHtml
string

Simplified and standardized HTML of the article body, including sub-headings, image captions and embedded content (videos, tweets, etc.).

description
string

A short summary of the article. It can be either human-provided (if available), or auto-generated.

datePublished
string

Publication date. ISO-formatted with 'T' separator, may contain a timezone. If the actual publication date is not found, "dateModified" value is taken.

datePublishedRaw
string

Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.

dateModified
string

The date when the article was most recently modified. ISO-formatted with 'T' separator, may contain a timezone.

dateModifiedRaw
string

Same date as "dateModified", but before parsing/normalization, i.e. as it appears on the website.

Array of objects (Author) [ items ]

Authors of the article.

Array
name
required
string

Full name of the author, e.g. "Alice".

nameRaw
string

Text from which this author name was extracted, e.g. "Alice and Bob".

inLanguage
string

Language of the article, as an ISO 639-1 language code. Example: "en". Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see "webPageInfo".

Array of objects (Breadcrumb) [ items ]

A list of breadcrumbs (a specific navigation element) with optional name and url.

Array
name
string

Text of the breadcrumb, as it appears on the website.

url
string

Absolute URL of the breadcrumb.

object (Image)

Image.

url
required
string

URL of an image.

Array of objects (Image) [ items ]

All images of the item (may include the main image).

Array
url
required
string

URL of an image.

Array of objects[ items ]

A list of all videos inside the article body.

Array
url
required
string

Absolute URL of the video.

Array of objects[ items ]

A list of all audios inside the article body.

Array
url
required
string

Absolute URL of the audio.

url
required
string

URL of a page where this article was extracted.

canonicalUrl
string

Canonical URL of the article, if available.

required
object (schemas-Metadata)

Extracted item metadata for single-item data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object

Article list data.

To get this response field, set the articleList request field to true.

Array of objects[ items ]

List of articles available on this page.

Array
url
string

URL of a detailed article page. Pass this URL with "article: true" in the request to extract detailed information about the article.

headline
string

Article headline or title.

articleBody
string

Text of the article as it appears on the list page, including sub-headings, with newline separators.

datePublished
string

Publication date. ISO-formatted with 'T' separator, may contain a timezone.

datePublishedRaw
string

Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.

Array of objects (Author) [ items ]

Authors of the article.

Array
name
required
string

Full name of the author, e.g. "Alice".

nameRaw
string

Text from which this author name was extracted, e.g. "Alice and Bob".

inLanguage
string

Language of the article, as an ISO 639-1 language code. Example: "en". Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see "webPageInfo".

object (Image)

Image.

url
required
string

URL of an image.

Array of objects (Image) [ items ]

All images of the item (may include the main image).

Array
url
required
string

URL of an image.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

url
required
string

URL of a page where this article list was extracted.

required
object (MetadataList)

Top-level metadata for list data types.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object

Article navigation data.

To get this response field, set the articleNavigation request field to true.

object (PaginationNext)

A link to the next page in the list.

url
required
string

URL of the next page in the list.

name
string

Text of the link to the next page, if available.

pageNumber
integer (PageNumber)

Integer describing the current page number. Starts at 1.

Array of objects[ items ]

List of articles available on this page.

Array
url
required
string

URL of a detailed article page. Pass this URL with "article: true" in the request to extract detailed information about the article.

name
string

The name of the article or article link text.

datePublished
string

Publication date. ISO-formatted with 'T' separator, may contain a timezone.

datePublishedRaw
string

Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

url
required
string

URL of a page containing the list of articles.

required
object (MetadataList)

Top-level metadata for list data types.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object

Forum thread data.

To get this response field, set the forumThread request field to true.

object

Topic that is discussed on the page.

name
required
string

Name of the topic.

Array of objects[ items ]

List of posts available on this page, including the first or top post.

Array
text
string

Text of the post.

datePublished
string

Publication date. ISO-formatted with 'T' separator, may contain a timezone.

datePublishedRaw
string

Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.

object

Details of reactions to this post.

likes
integer >= 0

Number of up-votes or likes/stars received by the post.

replies
integer >= 0

Number of replies received by the post.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

url
required
string

URL of a page where this forum post list was extracted.

required
object (MetadataList)

Top-level metadata for list data types.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object

Job posting data.

To get this response field, set the jobPosting request field to true.

jobTitle
string

The title of the job.

datePublished
string

Publication date of the job posting. ISO-formatted with 'T' separator, may contain a timezone.

datePublishedRaw
string

Same date as 'datePublished', but before parsing/normalization, i.e. as it appears on the website.

validThrough
string

The date after which the job posting is not valid, e.g. the end of an offer. ISO-formatted with ‘T’ separator, may contain a timezone.

description
string

A description of the job posting including sub-headings, with newline separators.

descriptionHtml
string

Simplified HTML of the description, including sub-headings, image captions and embedded content.

employmentType
string

Type of employment (e.g. full-time, part-time, contract, temporary, seasonal, internship).

object

Information about the organization offering the job position.

name
required
string

Name of the organization.

object

The base salary of the job or of an employee in the proposed role.

raw
string

Salary amount as it appears on the website.

valueMax
string

The maximum value of the base salary as a number string. In case of only one value given for the salary instead of a range, valueMax is used to represent it.

currency
string

Currency associated with the salary amount. ISO 4217 standard.

currencyRaw
string

Currency associated with the salary amount, without normalization.

object

A (typically single) geographic location associated with the job position.

raw
required
string

Job location as it appears on the website.

url
required
string

URL of a page where this job posting was extracted.

required
object (schemas-Metadata)

Extracted item metadata for single-item data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object

Job posting navigation data.

To get this response field, set the jobPostingNavigation request field to true.

object (PaginationNext)

A link to the next page in the list.

url
required
string

URL of the next page in the list.

name
string

Text of the link to the next page, if available.

pageNumber
integer (PageNumber)

Integer describing the current page number. Starts at 1.

Array of objects[ items ]

List of job postings available on this page.

Array
url
required
string

URL of a detailed job posting page. Pass this URL with "jobPosting: true" in the request to extract detailed information about the job posting.

name
string

The name of the job posting or job posting link text.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

url
required
string

URL a of page.

required
object (MetadataList)

Top-level metadata for list data types.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object

Product data.

To get this response field, set the product request field to true.

name
string (Name)

The name of the product.

price
string (Price) ^[0-9]+(\.[0-9]+)?$

The price at which the product is being offered. If there is only one price associated with the offer, it is returned in this field.

currency
string (Currency) ^[A-Z]{3}$

The ISO 4217 standard of the currency in which the price is in.

currencyRaw
string (CurrencyRaw)

The currency as given on the website, without extra normalization (for example, both "$" and "USD" are possible currencies).

regularPrice
string (RegularPrice) ^[0-9]+(\.[0-9]+)?$

The price before any discount or special offer.

availability
string (Availability)
Enum: "InStock" "OutOfStock"

Availability, as a string. Allowed values:

  • "InStock" - includes limited availability, presale, preorder, and in-store only.
  • "OutOfStock" - includes discontinued and sold out.
sku
string (Sku)

The Stock Keeping Unit (SKU), i.e. a merchant-specific identifier for the product - identifier assigned by the seller.

mpn
string (Mpn)

The Manufacturer Part Number (MPN) of the product. It is issued by the manufacturer, and is the same across different e-commerce websites.

Array of objects (Gtin) [ items ]

Standardized GTIN product identifier which is unique for a product across different sellers.

Array
type
required
string
Enum: "gtin8" "gtin13" "gtin14" "isbn10" "isbn13" "ismn" "issn" "upc"

gtin14 corresponds to former names EAN/UCC-14, SCC-14, DUN-14, UPC Case Code, UPC Shipping Container Code.

gtin13 also includes the jan (japanese article number).

value
required
string

The GTIN value as a string.

object

Brand or manufacturer of the product.

name
required
string

Name of the brand.

Array of objects (Breadcrumb) [ items ]

A list of breadcrumbs (a specific navigation element) with optional name and url.

Array
name
string

Text of the breadcrumb, as it appears on the website.

url
string

Absolute URL of the breadcrumb.

object (Image)

Image.

url
required
string

URL of an image.

Array of objects (Image) [ items ]

All images of the item (may include the main image).

Array
url
required
string

URL of an image.

description
string

Description of the product.

descriptionHtml
string

Simplified HTML of the description, including sub-headings, image captions and embedded content.

object

The overall rating, based on a collection of reviews or ratings.

ratingValue
number

The average rating value.

bestRating
number

The highest value allowed in this rating system.

reviewCount
integer >= 0

The total number of reviews or ratings for the product.

color
string (Color)

Color of the product.

size
string (Size)

A standardized size of a product, specified through a simple textual string (for example "XL", "32Wx34L"). A single product dimension (height, width) is not considered as the size.

object (Weight)
value
number

A weight value expressed as a floating point number.

unit
string

A normalized unit of weight, like kilogram / ounce / pound and others.

rawUnit
string

A unit of weight without normalization - how it was extracted from the page. Normalized version of the rawUnit is in 'unit' attribute.

material
string

The materials from which the product is made. Contains all product materials on the page.

style
string (Style)

Style of the product. It can also be referred as pattern/finish on the product page. Example values: "Polka dots", "Striped", "Nickel finish with Translucent glass", etc.

Array of objects (AdditionalProperty) [ items ]

A list of properties or characteristics.

  • name field contains the property name,
  • value field contains the property value.

Array
name
required
string

Property name.

value
string

Property value.

features
Array of strings

A list of features of the Product.

The features of a Product can be found generally on the product page arranged in a list, which is usually bulleted.

url
required
string (Url)

URL of a page where this product was extracted.

canonicalUrl
string (CanonicalUrl)

Canonical URL of the product, if available.

required
object (schemas-Metadata)

Extracted item metadata for single-item data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

Array of objects[ items ]

Array of product variants, using the same Product schema. Represents extra information available about the variants of a product. All variants are included into this array, including the variant shown on the page. If some field in this array is empty, it means that either the value is the same as in the top-level product, or that extraction API did not manage to extract it.

Array
name
string (Name)

The name of the product.

price
string (Price) ^[0-9]+(\.[0-9]+)?$

The price at which the product is being offered. If there is only one price associated with the offer, it is returned in this field.

currency
string (Currency) ^[A-Z]{3}$

The ISO 4217 standard of the currency in which the price is in.

currencyRaw
string (CurrencyRaw)

The currency as given on the website, without extra normalization (for example, both "$" and "USD" are possible currencies).

regularPrice
string (RegularPrice) ^[0-9]+(\.[0-9]+)?$

The price before any discount or special offer.

availability
string (Availability)
Enum: "InStock" "OutOfStock"

Availability, as a string. Allowed values:

  • "InStock" - includes limited availability, presale, preorder, and in-store only.
  • "OutOfStock" - includes discontinued and sold out.
sku
string (Sku)

The Stock Keeping Unit (SKU), i.e. a merchant-specific identifier for the product - identifier assigned by the seller.

mpn
string (Mpn)

The Manufacturer Part Number (MPN) of the product. It is issued by the manufacturer, and is the same across different e-commerce websites.

Array of objects (Gtin) [ items ]

Standardized GTIN product identifier which is unique for a product across different sellers.

Array
type
required
string
Enum: "gtin8" "gtin13" "gtin14" "isbn10" "isbn13" "ismn" "issn" "upc"

gtin14 corresponds to former names EAN/UCC-14, SCC-14, DUN-14, UPC Case Code, UPC Shipping Container Code.

gtin13 also includes the jan (japanese article number).

value
required
string

The GTIN value as a string.

object (Image)

Image.

url
required
string

URL of an image.

Array of objects (Image) [ items ]

All images of the item (may include the main image).

Array
url
required
string

URL of an image.

color
string (Color)

Color of the product.

size
string (Size)

A standardized size of a product, specified through a simple textual string (for example "XL", "32Wx34L"). A single product dimension (height, width) is not considered as the size.

style
string (Style)

Style of the product. It can also be referred as pattern/finish on the product page. Example values: "Polka dots", "Striped", "Nickel finish with Translucent glass", etc.

Array of objects (AdditionalProperty) [ items ]

A list of properties or characteristics.

  • name field contains the property name,
  • value field contains the property value.

Array
name
required
string

Property name.

value
string

Property value.

url
string (Url)

URL of a page where this product was extracted.

canonicalUrl
string (CanonicalUrl)

Canonical URL of the product, if available.

object

Product list data.

To get this response field, set the productList request field to true.

Array of objects (Breadcrumb) [ items ]

A list of breadcrumbs (a specific navigation element) with optional name and url.

Array
name
string

Text of the breadcrumb, as it appears on the website.

url
string

Absolute URL of the breadcrumb.

Array of objects[ items ]

List of products available on this page.

Array
url
string

URL of a detailed product page. Pass this URL with "product: true" in the request to extract detailed information about the product.

name
string

The name of the product.

price
string (Price) ^[0-9]+(\.[0-9]+)?$

The price at which the product is being offered. If there is only one price associated with the offer, it is returned in this field.

currencyRaw
string (CurrencyRaw)

The currency as given on the website, without extra normalization (for example, both "$" and "USD" are possible currencies).

currency
string (Currency) ^[A-Z]{3}$

The ISO 4217 standard of the currency in which the price is in.

regularPrice
string (RegularPrice) ^[0-9]+(\.[0-9]+)?$

The price before any discount or special offer.

object (Image)

Image.

url
required
string

URL of an image.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

url
required
string

URL of a page where this product list was extracted.

required
object (MetadataList)

Top-level metadata for list data types.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

categoryName
string

Name of the category in which the listed products are.

object

Product navigation data.

To get this response field, set the productNavigation request field to true.

categoryName
string

Name of the category in which the listed products are found.

object (PaginationNext)

A link to the next page in the list.

url
required
string

URL of the next page in the list.

name
string

Text of the link to the next page, if available.

pageNumber
integer (PageNumber)

Integer describing the current page number. Starts at 1.

Array of objects[ items ]

List of products available on this page.

Array
url
required
string

URL of a detailed product page. Pass this URL with "product: true" in the request to extract detailed information about the product.

name
string

The name of the product or product link text.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

Array of objects[ items ]

List of subcategory links found on this page.

Array
url
required
string

URL of the subcategory.

name
string

The name of the subcategory or subcategory link text.

required
object (MetadataListItem)

Item-level metadata for list data types.

probability
required
number [ 0 .. 1 ]

Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated.

url
required
string

URL a of page.

required
object (MetadataList)

Top-level metadata for list data types.

dateDownloaded
required
string (DateDownloaded)

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

object
object

Values of extracted custom attributes, extracted according to the requested customAttributes schema.

property name*
additional property
any
object
inputTokens
integer

Total number of used input tokens, excluding our internal fixed prompt with the LLM instruction, when using the "generate" method.

outputTokens
integer

Total number of used output tokens, when using the "generate" method.

textInputTokens
integer

Total number of input tokens used for the text of the web page, excluding the schema and our internal fixed prompt with the LLM instruction, when using the "generate" method. Already included in the customAttributes.metadata.inputTokens field.

textInputTokensBeforeTruncation
integer

textInputTokens before the text was truncated to fit into the input limits, either set via customAttributesOptions.maxInputTokens or due to the model limitation returned in customAttributes.metadata.maxInputTokens, when using the "generate" method.

maxInputTokens
integer

Maximum number of allowed input tokens for the model, when using the "generate" method.

excludedPIIAttributes
Array of strings

A list of all attributes dropped from the output due to a risk of PII (Personally Identifiable Information) extraction.

error
string
  • The extraction/unparsable-response error is given when the LLM response could not be parsed or recovered. If this error happens, we suggest simplifying the task or reducing the number of attributes.
  • The extraction/schema-size-exceeded error is given when the schema did not fit into the input limits, leaving no space for the input text, and therefore the LLM could not be used. If this error happens, we suggest either making the schema smaller (fewer attributes and/or shorter descriptions), or increasing customAttributesOptions.maxInputTokens.
echoData
object

Arbitrary data set on the echoData request field.

See an example.

jobId
string <= 100 characters

Scrapy Cloud job ID set on the jobId request field.

See an example.

Array of objects (ActionResult) [ items ]

Debug information about the execution of the action sequence set in the actions request field.

Action order in the response always matches that of the request.

Array
action
required
string

The type of action submitted

elapsedTime
required
number

Elapsed time in seconds

status
required
string
Enum: "success" "continued" "returned" "notExecuted"

Status of execution of a particular action

  • success - When the action finishes execution successfully without any errors
  • continued - When the action fails, but the execution of the action sequence is continued
  • returned - When the action fails and stops execution
  • notExecuted - When a a prior action has failed, thereby not executing the current action
error
string

Detailed information about the underlying error.

Array of objects (InteractionLogEntry) [ items ]

Messages logged with console.log() from browser scripts.

Array
time
string

The ISO 8601 format of the time

level
string
Enum: "debug" "info" "warning" "error" "warn"

The log level

message
string

The log message

Array of objects (Cookie) [ items ]

List of cookies set during the request.

To get this response field, set the responseCookies request field to true.

See an example. See also: requestCookies.

Array
name
required
string <= 4085 characters

Cookie name

value
required
string <= 4085 characters

Cookie value

domain
required
string <= 253 characters

Domain the cookie belongs to

path
string

Path the cookie belongs to

expires
integer <int64>

Unix time in seconds.

httpOnly
boolean
secure
boolean
sameSite
string
Enum: "Strict" "Lax" "Extended" "None"
Array of objects (CapturedResponse) [ items ]

Responses captured by filters specified in the networkCapture request parameter.

Array
object

Exit status of the network capture.

If interceptionStatus.status is error, httpResponseBody is not delivered.

Possible causes of error include all matching responses exceeding the maximum total body size of 5 MiB.

status
string
Enum: "success" "error"
error
string

Error message.

This field is only present if interceptionStatus.status is error.

statusCode
integer

HTTP status code of the captured response.

httpResponseBody
string <byte>

Base64-encoded body of the captured response.

To get this response field, set the networkCapture[].httpResponseBody request field to true.

url
string <uri>

Captured response URL.

headers
object

Captured response headers.

object (NetworkCaptureFilter)

Filter defined in the networkCapture request field that matched the captured response.

filterType
required
string
httpResponseBody
boolean
Default: false

Set to true to get the body of the captured response in the networkCapture[].httpResponseBody response field.

value
required
string [ 3 .. 8192 ] characters

A string to compare with the URL of network responses according to matchType.

matchType
required
string (PatternMatchingOptions)
Default: "contains"
Enum: "startsWith" "endsWith" "contains" "exact"

How to compare a user-defined string with a target string:

  • contains matches if the user-defined string is a substring of the target string.

  • exact matches if the user-defined string is an exact match of the target string.

  • startsWith matches if the target string starts with the user-defined string.

  • endsWith matches if the target string ends with the user-defined string.

Comparisons are case-sensitive. Regular expressions or wildcard characters are not supported.

object

Captured request that got the captured response.

url
string

URL of the captured request.

headers
object

Headers of the captured request.

method
string

HTTP method of the captured request.

body
string

Body of the captured request, if any.

object (SearchResultsPage)

Search engine results page data.

To get this response field, set the serp request field to true.

Array of objects (OrganicResult) [ items ]

List of search results excluding paid results.

Array
description
string

Result excerpt.

name
string

Result title.

url
string (OrganicResultURL) ^https?://[\S]+$

Result URL.

rank
integer

Result position among organic results in the search page.

The first result of a search page is always 1, regardless of the value of serp.pageNumber.

url
string (SearchURL) ^https?://[\S]+$

Search URL.

Should match url.

pageNumber
integer >= 1

Page number.

object (Metadata)

Metadata.

displayedQuery
string

Search query as seen in the webpage.

searchedQuery
string

Search query as specified in the input URL.

totalOrganicResults
integer <int64> >= 0

Total number of organic results reported by the search engine.

dateDownloaded
string

The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

Request samples

Content type
application/json
Example
{}

Response samples

Content type
application/json
{
  • "statusCode": 200,
  • "httpResponseBody": "string",
  • "httpResponseHeaders": [
    ],
  • "browserHtml": "<html>Downloaded data.</html>",
  • "session": {
    },
  • "screenshot": "string",
  • "article": {},
  • "articleList": {},
  • "articleNavigation": {},
  • "forumThread": {
    },
  • "jobPosting": {
    },
  • "jobPostingNavigation": {},
  • "product": {
    },
  • "productList": {},
  • "productNavigation": {},
  • "customAttributes": {
    },
  • "echoData": { },
  • "jobId": "example-job-1",
  • "actions": [
    ],
  • "responseCookies": [
    ],
  • "networkCapture": [
    ],
  • "serp": {
    }
}