Automatic Extraction API

How the API works

Currently, the API has a single endpoint: https://autoextract.scrapinghub.com/v1/extract

A request is composed of one or more queries. Each query contains a URL to extract from, and a page type that indicates what the extraction result should be (article, job posting, product, etc.).

Requests and responses are transmitted in JSON format over HTTPS. Authentication is performed using HTTP Basic Authentication, where your API key is the username and the password is empty.

API data formats

Requests

Requests are comprised of a JSON array of queries. Each query is a map containing the following fields:

Name

Required

Type

Description

url

Yes

String

URL of web page to extract from. Must be a valid http:// or https:// URL.

pageType

Yes

String

Type of extraction to perform. Must be article, articleList, comments, forumPosts, jobPosting, product, productList, productReviews, realEstate or vehicle.

meta

No

String

User UTF-8 string, which will be passed through the extraction pipeline and returned in the query result. Max size 4 Kb.

articleBodyRaw

No

boolean

Whether or not to include article HTML in article extractions. True by default. Setting this to false can reduce response size significantly if HTML is not required.

fullHtml

No

boolean

Include the full, raw HTML of the target web page in the query result. This is a premium feature that is disabled by default. Please contact autoextractsales@zyte.com if you wish to have it enabled for your account.

customHtml

No

String

HTML source to be scraped. Extraction will be done from the provided HTML with additional resources (images, CSS, etc.) downloaded from the provided url. JavaScript processing will be disabled. The String should be UTF-8 encoded. The maximum length is 2,000,000 characters, longer requests will be rejected.

Responses

API responses are wrapped in a JSON array (this is to facilitate query batching). A query response for a single article extraction looks like this (some large fields are truncated):

[
  {
    "query": {
      "id": "1564747029122-9e02a1868d70b7a1",
      "domain": "scrapinghub.com",
      "userQuery": {
        "url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
        "pageType": "article"
      }
    },
    "article": {
      "articleBody": "Unbeknownst to many..",
      "articleBodyHtml": "<article>Unbeknownst to many..",
      "articleBodyRaw": "<span id=...",
      "headline": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
      "inLanguage": "en",
      "datePublished": "2018-06-19T00:00:00",
      "datePublishedRaw": "June 19, 2018",
      "author": "Ian Kerins",
      "authorsList": [
        "Ian Kerins"
      ],
      "mainImage": "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg#keepProtocol",
      "images": [
        "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg"
      ],
      "description": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
      "url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
      "probability": 0.7369686365127563
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"}
      ]
    },
    "algorithmVersion": "20.8.1"
  }
]

Output fields

Query

All API responses include the original query along with some additional information such as the query ID:

# Enriched query
print(response.json()[0]['query'])

Result fields

The following result fields are available when requested with the corresponding page type:

The following result fields are always present if extraction succeeds:

Full HTML

If you have upgraded your account to support the fullHtml query parameter, then queries with this parameter set to true will return the raw HTML of the entire target web page in the html field of the query result. This differs from the articleBodyRaw field in article extractions in two respects:

  1. It includes the HTML of the entire page, not just the section of the page containing article text

  2. It can be used with all extraction types

Note that this is a premium feature that is disabled by default. Please contact autoextractsales@zyte.com if you wish to enable it.

Errors

Errors fall into two broad categories: request-level and query-level. Request-level errors occur when the HTTP API server can’t process the input that it receives. Query-level errors occur when a specific query cannot be processed. You can detect these by checking the error field in query results.

Some errors can be detected immediately when a request is received. Users are not charged for these requests. However, users are charged for requests that result in errors further along the extraction process.

Request-level

Examples include:

  • Authentication failure

  • Malformed request JSON

  • Too many queries in request

  • Request payload size too large

If a request-level error occurs, the API server will return a 4xx or 5xx response code. If possible, a JSON response body with content type application/problem+json will be returned that describes the error in accordance with RFC-7807 - Problem Details for HTTP APIs.

For example, if you exceed the query limit in a batched request, you get an error response with 413 as status code, the Content-Type header set to application/problem+json, and a body similar to this:

{
    "title": "Limit of 100 queries per request exceeded",
    "type": "http://errors.xod.scrapinghub.com/queries-limit-reached"
}

The type field should be used to check the error type as this will not change in subsequent versions. There could be more specific fields depending on the error providing additional details, e.g. delay before retrying next time. Such responses can be easily parsed and used for programmatic error handling.

If it is not possible to return a JSON description of the error, then no content type header will be set for the response and the response body will be empty.

Query-level

If the error field is present in an extraction result, then an error has occurred and the extraction result will not be available.

[
    {
        "query": {
            "id": "1587642195276-9386233af6ce1b9f",
            "domain": "example.com",
            "userQuery": {
                "url": "http://www.example.com/this-page-does-not-exist",
                "pageType": "article"
            }
        },
        "error": "Downloader error: http404",
        "algorithmVersion": "20.8.1"
    }
]

“algorithmVersion” is optional and only added when it’s possible to report.

Reference

Request-level

Type

Description

Billed

http://errors.xod.scrapinghub.com/queries-limit-reached.html

Limit of 100 queries per request exceeded

No

http://errors.xod.scrapinghub.com/malformed-json.html

Could not parse request JSON

No

http://errors.xod.scrapinghub.com/rate-limit-exceeded.html

System-wide rate limit exceeded

No

http://errors.xod.scrapinghub.com/user-rate-limit-exceeded.html

User rate limit exceeded

No

http://errors.xod.scrapinghub.com/account-disabled.html

Account has been disabled - contact support

No

http://errors.xod.scrapinghub.com/unrecognized-content-type.html

Unsupported request content type: should be application/json

No

http://errors.xod.scrapinghub.com/empty-request.html

Empty request body - should be JSON document

No

http://errors.xod.scrapinghub.com/malformed-request.html

Unparseable request

No

http://errors.xod.scrapinghub.com/http-pipelining-not-supported.html

Attempt to second HTTP request over TCP connection

No

http://errors.xod.scrapinghub.com/unknown-uri.html

Invalid API endpoint

No

http://errors.xod.scrapinghub.com/method-not-allowed.html

Invalid HTTP method (only POST is supported)

No

Query-level

error contains

Description

Billed

query timed out

10 minute time out for query reached

No

malformed URL

Requested URL cannot be parsed

No

URL cannot be longer than 4096 UTF-16 characters

URL is too long

No

non-HTTP schemes are not allowed

Only http and https schemes are allowed

No

Domain … is occupied, please retry in … seconds

Per-domain rate limiting was applied. It is recommended to retry
after the specified interval.

No

Downloader error: No response (network301)

Cannot honor the request because the protocol is not known

Yes

Downloader error: No response (network5)

Remote server closed connection before transfer was finished

Yes

Downloader error: No visible elements

There are no visible elements in downloaded content

Yes

Downloader error: http304

Remote server returned HTTP status code 304 (not modified)

Yes

Downloader error: http404

Remote server returned HTTP status code 404 (not found)

Yes

Downloader error: http500

Remote server returned HTTP status code 404 (internal server error)

Yes

Extraction not permitted for this URL

This domain or URL has been blacklisted

No

Proxy error: ssl_tunnel_error

SSL proxy tunneling error

Yes

Proxy error: banned

Antiban measures in action, but this doesn’t mean that the proxy
pool is exhausted. Retry is recommended.

Yes

Proxy error: domain_forbidden

Domain is forbidden on Smart Proxy Manager side

Yes

Proxy error: internal_error

Internal proxy error

Yes

Proxy error: nxdomain

Smart Proxy Manager wasn’t able to resolve domain through DNS

Yes

Other, more rare, errors are also possible. In general, errors starting with “Downloader error…” or “Proxy error…” will be billed for, while other types of error will not.

Restrictions and Failure Modes

  • Users are limited to 10,000 queries during the Automatic Extraction trial period of two weeks. Users automatically convert to the standard $60 subscription if they use all of their trial quota, or the trial period elapses.

    Users with a standard $60 per month subscription receive a quota of 100,000 queries per monthly billing cycle. If the quota is exceeded, additional queries will be billed at the end of the billing cycle on a pro rata basis. There is a limit of 500,000 queries per monthly billing cycle.

    The monthly limit can be increased if requested; please reach out to our sales team (autoextractsales@zyte.com) if you anticipate that your usage will exceed 500,000 queries in a billing cycle.

  • A rate limit of 5 queries per second is enforced. Please reach out to our sales team (autoextractsales@zyte.com) if you require a rate limit increase.

    Sequential single queries will result in a low throughput. To achieve higher request rates you will need to use a number of concurrent API requests. You can find more information about this in the documentation for the AutoExtract Python client.

  • There is a global timeout of 10 minutes for queries.

    Queries can time out for a number of reasons, such as difficulties during content download. If a query in a batched request times out, the API will return the results of the extractions that did succeed along with errors for those that timed out.

    We therefore recommend that you set the HTTP timeout for API requests to over 10 minutes.

  • A maximum of 100 queries may be submitted in a single batched request.

  • In general, it is not possible to access the Automatic Extraction API within browser environments due to browser CORS restrictions. It is not possible to circumvent this restriction using the no-cors mode of the browser fetch API as the correct Authorization and Content-Type headers will not be be included in the HTTP requests to the API. In any case, we strongly recommend that Automatic Extraction API requests are performed at the backend in order to preserve the secrecy of your API key.

Batching Queries

Multiple queries can be submitted in a single API request, resulting in an equivalent number of query results.

Note

When using batch requests, each query is accounted towards usage limits separately. For example, sending a batch request with 10 queries will incur the same cost as sending 10 requests with 1 query each.

[
    {
        "url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "pageType": "product"
    },
    {
        "url": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        "pageType": "product"
    },
    {
        "url": "http://books.toscrape.com/catalogue/soumission_998/index.html",
        "pageType": "product"
    }
]

Note that query results are not necessarily returned in the same order as the original queries. If you need an easy way to associate the results with the queries that generated them, you can pass an additional meta field in the query. The value that you pass will appear as the query/userQuery/meta field in the corresponding query result.

For example, if your request body is:

[
    {
        "url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "pageType": "product",
        "meta": "query1"
    },
    {
        "url": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        "pageType": "product",
        "meta": "query2"
    },
    {
        "url": "http://books.toscrape.com/catalogue/soumission_998/index.html",
        "pageType": "product",
        "meta": "query3"
    }
]

The response may be (irrelevant content omitted):

[
    {
        "query": {
            "userQuery": {
                "meta": "query2"
            }
        },
        "product": {}
    },
    {
        "query": {
            "userQuery": {
                "meta": "query1"
            }
        },
        "product": {}
    },
    {
        "query": {
            "userQuery": {
                "meta": "query3"
            }
        },
        "product": {}
    }
]