Automatic Extraction API#
How the API works#
Currently, the API has a single endpoint: https://autoextract.scrapinghub.com/v1/extract
A request is composed of one or more queries. Each query contains a URL to extract from, and a page type that indicates what the extraction result should be (article, job posting, product, etc.).
Requests and responses are transmitted in JSON format over HTTPS. Authentication is performed using HTTP Basic Authentication, where your API key is the username and the password is empty.
API data formats#
Requests#
Requests are comprised of a JSON array of queries. Each query is a map containing the following fields:
Name |
Required |
Type |
Description |
---|---|---|---|
|
Yes |
String |
URL of web page to extract from. Must be a valid |
|
Yes |
String |
Type of extraction to perform. Must be |
|
No |
String |
User UTF-8 string, which will be passed through the extraction pipeline and returned in the query result. Max size 4 Kb. |
|
No |
boolean |
Whether or not to include article HTML in article extractions. True by default. Setting this to false can reduce response size significantly if HTML is not required. |
|
No |
boolean |
Include the full, raw HTML of the target web page in the query result. This is a premium feature that is disabled by default. Please open a support ticket if you wish to have it enabled for your account. |
|
No |
String |
HTML source to be scraped. Extraction will be done from the provided HTML with additional resources (images, CSS, etc.) downloaded from the provided |
Responses#
API responses are wrapped in a JSON array (this is to facilitate query batching). A query response for a single article extraction looks like this (some large fields are truncated):
[
{
"query": {
"id": "1564747029122-9e02a1868d70b7a1",
"domain": "scrapinghub.com",
"userQuery": {
"url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
"pageType": "article"
}
},
"article": {
"articleBody": "Unbeknownst to many..",
"articleBodyHtml": "<article>Unbeknownst to many..",
"articleBodyRaw": "<span id=...",
"headline": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
"inLanguage": "en",
"datePublished": "2018-06-19T00:00:00",
"datePublishedRaw": "June 19, 2018",
"author": "Ian Kerins",
"authorsList": [
"Ian Kerins"
],
"mainImage": "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg#keepProtocol",
"images": [
"https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg"
],
"description": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
"url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
"probability": 0.7369686365127563
},
"webPage": {
"inLanguages": [
{"code": "en"}
]
},
"algorithmVersion": "20.8.1"
}
]
Output fields#
Query#
All API responses include the original query along with some additional information such as the query ID:
# Enriched query
print(response.json()[0]['query'])
Result fields#
The following result fields are available when requested with the corresponding page type:
article
- see Article ExtractionarticleList
- see Article List Extractioncomments
- see Comment ExtractionforumPosts
- see Forum Post ExtractionjobPosting
- see Job Posting Extractionproduct
- see Product ExtractionproductList
- see Product List ExtractionrealEstate
- see Real Estate Extractionreviews
- see Review Extractionvehicle
- see Vehicle Extraction
The following result fields are always present if extraction succeeds:
webPage
- see General Web Page InformationalgorithmVersion
- the version of Automatic Extraction AI-enabled engine
Full HTML#
If you have upgraded your account to support the fullHtml
query parameter,
then queries with this parameter set to true
will return the raw HTML of
the entire target web page in the html
field of the query result.
This differs from the articleBodyRaw
field in article extractions in two respects:
It includes the HTML of the entire page, not just the section of the page containing article text
It can be used with all extraction types
Note that this is a premium feature that is disabled by default. Please open a support ticket if you wish to enable it.
Errors#
Errors fall into two broad categories: request-level and query-level.
Request-level errors occur when the HTTP API server can’t process
the input that it receives. Query-level errors occur when a specific query
cannot be processed. You can detect these by checking the error
field in query results.
Some errors can be detected immediately when a request is received. Users are not charged for these requests. However, users are charged for requests that result in errors further along the extraction process.
Request-level#
Examples include:
Authentication failure
Malformed request JSON
Too many queries in request
Request payload size too large
If a request-level error occurs,
the API server will return a 4xx or 5xx response code.
If possible, a JSON response body with content type
application/problem+json
will be returned that describes the error
in accordance with
RFC-7807 - Problem Details for HTTP APIs.
For example, if you exceed the query limit in a batched request, you get an
error response with 413
as status code, the Content-Type
header set to application/problem+json
,
and a body similar to this:
{
"title": "Limit of 100 queries per request exceeded",
"type": "http://errors.xod.scrapinghub.com/queries-limit-reached"
}
The type
field should be used to check the error type as this will not change in
subsequent versions. There could be more specific fields depending on the error providing additional details, e.g.
delay before retrying next time. Such responses can be easily parsed and used for programmatic error handling.
If it is not possible to return a JSON description of the error, then no content type header will be set for the response and the response body will be empty.
Query-level#
If the error
field is present in an extraction result, then an error has occurred and the extraction result will not be available.
[
{
"query": {
"id": "1587642195276-9386233af6ce1b9f",
"domain": "example.com",
"userQuery": {
"url": "http://www.example.com/this-page-does-not-exist",
"pageType": "article"
}
},
"error": "Downloader error: http404",
"algorithmVersion": "20.8.1"
}
]
“algorithmVersion” is optional and only added when it’s possible to report.
Reference#
Request-level#
Type |
Description |
Billed |
---|---|---|
http://errors.xod.scrapinghub.com/queries-limit-reached.html |
Limit of 100 queries per request exceeded |
No |
Could not parse request JSON |
No |
|
System-wide rate limit exceeded |
No |
|
http://errors.xod.scrapinghub.com/user-rate-limit-exceeded.html |
User rate limit exceeded |
No |
Account has been disabled - contact support |
No |
|
http://errors.xod.scrapinghub.com/unrecognized-content-type.html |
Unsupported request content type: should be application/json |
No |
Empty request body - should be JSON document |
No |
|
Unparseable request |
No |
|
http://errors.xod.scrapinghub.com/http-pipelining-not-supported.html |
Attempt to second HTTP request over TCP connection |
No |
Invalid API endpoint |
No |
|
Invalid HTTP method (only POST is supported) |
No |
Query-level#
error contains |
Description |
Billed |
---|---|---|
query timed out |
10 minute time out for query reached |
No |
malformed URL |
Requested URL cannot be parsed |
No |
URL cannot be longer than 4096 UTF-16 characters |
URL is too long |
No |
non-HTTP schemes are not allowed |
Only http and https schemes are allowed |
No |
Domain … is occupied, please retry in … seconds |
Per-domain rate limiting was applied. It is recommended to retry
after the specified interval.
|
No |
Extraction not permitted for this URL |
This domain or URL has been blacklisted |
No |
InternalError |
Internal extraction pipeline error |
No |
Downloader error: httpXXX |
Remote server returned HTTP non-success status code XXX |
Yes |
Downloader error: No visible elements |
There are no visible elements in downloaded content |
Yes |
Downloader error: internal |
Internal downloader error |
Yes |
Proxy error: banned |
Antiban measures in action, could not fetch content after several
attempts.
|
Yes |
Proxy error: internal_error |
Internal proxy error |
Yes |
Other, more rare, errors are also possible. In general, errors starting with “Downloader error…” or “Proxy error…” will be billed for, while other types of error will not.
Restrictions and Failure Modes#
Users are limited to 10,000 queries during the Automatic Extraction trial period of two weeks. Users automatically convert to the standard $60 subscription if they use all of their trial quota, or the trial period elapses.
Users with a standard $60 per month subscription receive a quota of 100,000 queries per monthly billing cycle. If the quota is exceeded, additional queries will be billed at the end of the billing cycle on a pro rata basis. There is a limit of 500,000 queries per monthly billing cycle.
The monthly limit can be increased if requested; please open a support ticket if you anticipate that your usage will exceed 500,000 queries in a billing cycle.
A rate limit of 5 queries per second is enforced. Please open a support ticket if you require a rate limit increase.
Sequential single queries will result in a low throughput. To achieve higher request rates you will need to use a number of concurrent API requests. You can find more information about this in the documentation for the AutoExtract Python client.
There is a global timeout of 10 minutes for queries.
Queries can time out for a number of reasons, such as difficulties during content download. If a query in a batched request times out, the API will return the results of the extractions that did succeed along with errors for those that timed out.
We therefore recommend that you set the HTTP timeout for API requests to over 10 minutes.
A maximum of 100 queries may be submitted in a single batched request.
In general, it is not possible to access the Automatic Extraction API within browser environments due to browser CORS restrictions. It is not possible to circumvent this restriction using the
no-cors
mode of the browser fetch API as the correctAuthorization
andContent-Type
headers will not be be included in the HTTP requests to the API. In any case, we strongly recommend that Automatic Extraction API requests are performed at the backend in order to preserve the secrecy of your API key.
Batching Queries#
Multiple queries can be submitted in a single API request, resulting in an equivalent number of query results.
Warning
We don’t recommend using batching queries in most cases. If you wish to achieve highest throughput and avoid a limitation of 100 URLs in a batch, we recommend sending multiple concurrent requests, each with a single URL, and handling responses as they become available. zyte-autoextract already handles this out-of-the box. If you batch a large number of URLs instead, full results would only become available when all are processed, and overall throughput would be lower.
Note
When using batch requests, each query is accounted towards usage limits separately. For example, sending a batch request with 10 queries will incur the same cost as sending 10 requests with 1 query each.
[
{
"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"pageType": "product"
},
{
"url": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"pageType": "product"
},
{
"url": "http://books.toscrape.com/catalogue/soumission_998/index.html",
"pageType": "product"
}
]
Note that query results are not necessarily returned
in the same order as the original queries.
If you need an easy way to associate the results with the queries
that generated them, you can pass an additional meta
field in the query.
The value that you pass will appear as the query/userQuery/meta
field
in the corresponding query result.
For example, if your request body is:
[
{
"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"pageType": "product",
"meta": "query1"
},
{
"url": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"pageType": "product",
"meta": "query2"
},
{
"url": "http://books.toscrape.com/catalogue/soumission_998/index.html",
"pageType": "product",
"meta": "query3"
}
]
The response may be (irrelevant content omitted):
[
{
"query": {
"userQuery": {
"meta": "query2"
}
},
"product": {}
},
{
"query": {
"userQuery": {
"meta": "query1"
}
},
"product": {}
},
{
"query": {
"userQuery": {
"meta": "query3"
}
},
"product": {}
}
]