Warning

Zyte Automatic Extraction will be discontinued starting April 30th, 2024. It is replaced by Zyte API. See Migrating from Automatic Extraction to Zyte API.

Automatic Extraction API#

How the API works#

Currently, the API has a single endpoint: https://autoextract.scrapinghub.com/v1/extract

A request is composed of one or more queries. Each query contains a URL to extract from, and a page type that indicates what the extraction result should be (article, job posting, product, etc.).

Requests and responses are transmitted in JSON format over HTTPS. Authentication is performed using HTTP Basic Authentication, where your API key is the username and the password is empty.

API data formats#

Requests#

Requests are comprised of a JSON array of queries. Each query is a map containing the following fields:

Name

Required

Type

Description

url

Yes

String

URL of web page to extract from. Must be a valid http:// or https:// URL.

pageType

Yes

String

Type of extraction to perform. Must be article, articleList, comments, forumPosts, jobPosting, product, productList, realEstate, reviews or vehicle.

meta

No

String

User UTF-8 string, which will be passed through the extraction pipeline and returned in the query result. Max size 4 Kb.

articleBodyRaw

No

boolean

Whether or not to include article HTML in article extractions. True by default. Setting this to false can reduce response size significantly if HTML is not required.

fullHtml

No

boolean

Include the full, raw HTML of the target web page in the query result. This is a premium feature that is disabled by default. Please open a support ticket if you wish to have it enabled for your account.

customHtml

No

String

HTML source to be scraped. Extraction will be done from the provided HTML with additional resources (images, CSS, etc.) downloaded from the provided url. JavaScript processing will be disabled. The String should be UTF-8 encoded. The maximum length is 2,000,000 characters, longer requests will be rejected.

Responses#

API responses are wrapped in a JSON array (this is to facilitate query batching). A query response for a single article extraction looks like this (some large fields are truncated):

[
  {
    "query": {
      "id": "1564747029122-9e02a1868d70b7a1",
      "domain": "scrapinghub.com",
      "userQuery": {
        "url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
        "pageType": "article"
      }
    },
    "article": {
      "articleBody": "Unbeknownst to many..",
      "articleBodyHtml": "<article>Unbeknownst to many..",
      "articleBodyRaw": "<span id=...",
      "headline": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
      "inLanguage": "en",
      "datePublished": "2018-06-19T00:00:00",
      "datePublishedRaw": "June 19, 2018",
      "author": "Ian Kerins",
      "authorsList": [
        "Ian Kerins"
      ],
      "mainImage": "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg#keepProtocol",
      "images": [
        "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg"
      ],
      "description": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
      "url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
      "probability": 0.7369686365127563
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"}
      ]
    },
    "algorithmVersion": "20.8.1"
  }
]

Output fields#

Query#

All API responses include the original query along with some additional information such as the query ID:

# Enriched query
print(response.json()[0]['query'])

Result fields#

The following result fields are available when requested with the corresponding page type:

The following result fields are always present if extraction succeeds:

Full HTML#

If you have upgraded your account to support the fullHtml query parameter, then queries with this parameter set to true will return the raw HTML of the entire target web page in the html field of the query result. This differs from the articleBodyRaw field in article extractions in two respects:

  1. It includes the HTML of the entire page, not just the section of the page containing article text

  2. It can be used with all extraction types

Note that this is a premium feature that is disabled by default. Please open a support ticket if you wish to enable it.

Errors#

Errors fall into two broad categories: request-level and query-level. Request-level errors occur when the HTTP API server can’t process the input that it receives. Query-level errors occur when a specific query cannot be processed. You can detect these by checking the error field in query results.

Some errors can be detected immediately when a request is received. Users are not charged for these requests. However, users are charged for requests that result in errors further along the extraction process.

Request-level#

Examples include:

  • Authentication failure

  • Malformed request JSON

  • Too many queries in request

  • Request payload size too large

If a request-level error occurs, the API server will return a 4xx or 5xx response code. If possible, a JSON response body with content type application/problem+json will be returned that describes the error in accordance with RFC-7807 - Problem Details for HTTP APIs.

For example, if you exceed the query limit in a batched request, you get an error response with 413 as status code, the Content-Type header set to application/problem+json, and a body similar to this:

{
    "title": "Limit of 100 queries per request exceeded",
    "type": "http://errors.xod.scrapinghub.com/queries-limit-reached"
}

The type field should be used to check the error type as this will not change in subsequent versions. There could be more specific fields depending on the error providing additional details, e.g. delay before retrying next time. Such responses can be easily parsed and used for programmatic error handling.

If it is not possible to return a JSON description of the error, then no content type header will be set for the response and the response body will be empty.

Query-level#

If the error field is present in an extraction result, then an error has occurred and the extraction result will not be available.

[
    {
        "query": {
            "id": "1587642195276-9386233af6ce1b9f",
            "domain": "example.com",
            "userQuery": {
                "url": "http://www.example.com/this-page-does-not-exist",
                "pageType": "article"
            }
        },
        "error": "Downloader error: http404",
        "algorithmVersion": "20.8.1"
    }
]

“algorithmVersion” is optional and only added when it’s possible to report.

Reference#

Request-level#

Type

Description

Billed

http://errors.xod.scrapinghub.com/queries-limit-reached.html

Limit of 100 queries per request exceeded

No

http://errors.xod.scrapinghub.com/malformed-json.html

Could not parse request JSON

No

http://errors.xod.scrapinghub.com/rate-limit-exceeded.html

System-wide rate limit exceeded

No

http://errors.xod.scrapinghub.com/user-rate-limit-exceeded.html

User rate limit exceeded

No

http://errors.xod.scrapinghub.com/account-disabled.html

Account has been disabled - contact support

No

http://errors.xod.scrapinghub.com/unrecognized-content-type.html

Unsupported request content type: should be application/json

No

http://errors.xod.scrapinghub.com/empty-request.html

Empty request body - should be JSON document

No

http://errors.xod.scrapinghub.com/malformed-request.html

Unparseable request

No

http://errors.xod.scrapinghub.com/http-pipelining-not-supported.html

Attempt to second HTTP request over TCP connection

No

http://errors.xod.scrapinghub.com/unknown-uri.html

Invalid API endpoint

No

http://errors.xod.scrapinghub.com/method-not-allowed.html

Invalid HTTP method (only POST is supported)

No

Query-level#

error contains

Description

Billed

query timed out

10 minute time out for query reached

No

malformed URL

Requested URL cannot be parsed

No

URL cannot be longer than 4096 UTF-16 characters

URL is too long

No

non-HTTP schemes are not allowed

Only http and https schemes are allowed

No

Domain … is occupied, please retry in … seconds

Per-domain rate limiting was applied. It is recommended to retry
after the specified interval.

No

Extraction not permitted for this URL

This domain or URL has been blacklisted

No

InternalError

Internal extraction pipeline error

No

Downloader error: httpXXX

Remote server returned HTTP non-success status code XXX

Yes

Downloader error: No visible elements

There are no visible elements in downloaded content

Yes

Downloader error: internal

Internal downloader error

Yes

Proxy error: banned

Antiban measures in action, could not fetch content after several
attempts.

Yes

Proxy error: internal_error

Internal proxy error

Yes

Other, more rare, errors are also possible. In general, errors starting with “Downloader error…” or “Proxy error…” will be billed for, while other types of error will not.

Restrictions and Failure Modes#

  • Users with a standard $60 per month subscription receive a quota of 100,000 queries per monthly billing cycle. If the quota is exceeded, additional queries will be billed at the end of the billing cycle on a pro rata basis. There is a limit of 500,000 queries per monthly billing cycle.

    The monthly limit can be increased if requested; please open a support ticket if you anticipate that your usage will exceed 500,000 queries in a billing cycle.

  • A rate limit of 5 queries per second is enforced. Please open a support ticket if you require a rate limit increase.

    Sequential single queries will result in a low throughput. To achieve higher request rates you will need to use a number of concurrent API requests. You can find more information about this in the documentation for the AutoExtract Python client.

  • There is a global timeout of 10 minutes for queries.

    Queries can time out for a number of reasons, such as difficulties during content download. If a query in a batched request times out, the API will return the results of the extractions that did succeed along with errors for those that timed out.

    We therefore recommend that you set the HTTP timeout for API requests to over 10 minutes.

  • A maximum of 100 queries may be submitted in a single batched request.

  • In general, it is not possible to access the Automatic Extraction API within browser environments due to browser CORS restrictions. It is not possible to circumvent this restriction using the no-cors mode of the browser fetch API as the correct Authorization and Content-Type headers will not be be included in the HTTP requests to the API. In any case, we strongly recommend that Automatic Extraction API requests are performed at the backend in order to preserve the secrecy of your API key.

Batching Queries#

Multiple queries can be submitted in a single API request, resulting in an equivalent number of query results.

Warning

We don’t recommend using batching queries in most cases. If you wish to achieve highest throughput and avoid a limitation of 100 URLs in a batch, we recommend sending multiple concurrent requests, each with a single URL, and handling responses as they become available. zyte-autoextract already handles this out-of-the box. If you batch a large number of URLs instead, full results would only become available when all are processed, and overall throughput would be lower.

Note

When using batch requests, each query is accounted towards usage limits separately. For example, sending a batch request with 10 queries will incur the same cost as sending 10 requests with 1 query each.

[
    {
        "url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "pageType": "product"
    },
    {
        "url": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        "pageType": "product"
    },
    {
        "url": "http://books.toscrape.com/catalogue/soumission_998/index.html",
        "pageType": "product"
    }
]

Note that query results are not necessarily returned in the same order as the original queries. If you need an easy way to associate the results with the queries that generated them, you can pass an additional meta field in the query. The value that you pass will appear as the query/userQuery/meta field in the corresponding query result.

For example, if your request body is:

[
    {
        "url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "pageType": "product",
        "meta": "query1"
    },
    {
        "url": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        "pageType": "product",
        "meta": "query2"
    },
    {
        "url": "http://books.toscrape.com/catalogue/soumission_998/index.html",
        "pageType": "product",
        "meta": "query3"
    }
]

The response may be (irrelevant content omitted):

[
    {
        "query": {
            "userQuery": {
                "meta": "query2"
            }
        },
        "product": {}
    },
    {
        "query": {
            "userQuery": {
                "meta": "query1"
            }
        },
        "product": {}
    },
    {
        "query": {
            "userQuery": {
                "meta": "query3"
            }
        },
        "product": {}
    }
]