Migrating from Automatic Extraction to Zyte API#
Learn how to migrate from Automatic Extraction to Zyte API, which supports automatic extraction.
Key differences#
The following table summarizes the feature differences between both products:
Feature |
Automatic Extraction |
Zyte API |
---|---|---|
Type support |
Product, product list, article, article list, comment, forum post, review, real estate, vehicle, job posting |
Product, product list, product navigation, article, article list, article navigation, job posting |
Data schemas |
||
Browser HTML |
Premium only |
|
Screenshots |
No |
|
Actions |
No |
|
JavaScript toggle |
No |
|
Geolocation |
No |
|
Cookies |
No |
|
Response headers |
No |
|
Browserless input |
No |
|
Batch queries |
Yes |
No |
Custom input |
Yes ( |
No |
Pricing |
More granular and flexible. For example, getting both automatic extraction and browser HTML on the same request no longer requires an Enterprise account. |
Updating requests#
If you are using an HTTP client, update your API requests as follows:
- Update your endpoint, from:
https://autoextract.scrapinghub.com/v1/extract
to:https://api.zyte.com/v1/extract
Update your API key to your Zyte API key.
- Update your request body from an array of queries:
[{"…": "…"}]
to a single query object:{"…": "…"}
Zyte API does not support query batching. If you were sending multiple queries per request, you must split them into separate requests with 1 query each.
Replace
"pageType": "TYPE"
with"TYPE": true
.For example, replace
"pageType": "product"
with"product": true
.Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Remove
customHtml
, Zyte API does not support providing a custom HTML document as input.
Example (curl)
Automatic Extraction:
curl \
--user YOUR_API_KEY: \
--header 'Content-Type: application/json' \
--data '[{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
--compressed \
https://autoextract.scrapinghub.com/v1/extract
Zyte API:
curl \
--user YOUR_API_KEY: \
--header 'Content-Type: application/json' \
--data '{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}' \
--compressed \
https://api.zyte.com/v1/extract
To replace the command-line interface of zyte-autoextract:
Install python-zyte-api:
pip install zyte-api
Replace the
python -m autoextract
command withzyte-api
.Update your API key to your Zyte API key.
If you were setting your API key with the
ZYTE_AUTOEXTRACT_KEY
environment variable, useZYTE_API_KEY
instead now.If you were using a list of URLs as input, switch to a JSON Lines file as input.
Tip
You do not need to pass
--intype jl
on the command line,zyte-api
automatically detects your input format.Instead of passing
--page-type TYPE
on the command line, use"TYPE": true
in each query of your input JSON Lines file.For example, replace
--page-type product
on the command line with"product": true
on every query.If you were using a JSON Lines file as input:
Replace
"pageType": "TYPE"
with"TYPE": true
.For example, replace
"pageType": "product"
with"product": true
.Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Remove
customHtml
, Zyte API does not support providing a custom HTML document as input.
If you are using
--api-endpoint ENDPOINT
, find out what your Zyte API endpoint is and use--api-url ENDPOINT
instead, or remove the command-line parameter altogether to use the default endpoint.Remove
--batch-size NUMBER
, Zyte API does not support query batching.Remove
--max-query-error-retries
.zyte-api
performs some retries automatically, but it does not allow customizing its retry policy from the command line, other than disabling error retries altogether with--dont-retry-errors
.Remove
--disable-cert-validation
.If you get SSL errors, install our CA certificate.
Example
Automatic Extraction:
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}
python -m autoextract --intype jl input.jsonl
Zyte API:
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}
zyte-api input.jsonl
To replace the Python asyncio interface of zyte-autoextract:
Install python-zyte-api:
pip install zyte-api
- In your import statements, change
autoextract
tozyte_api
Instead of calling the
request_raw
andrequest_parallel_as_completed
functions, create an instance ofAsyncClient
and call its same-name methods.Update your query to be a single
dict
, instead of a list ofdict
orRequest
objects.Zyte API does not support query batching. If you were sending multiple queries per request, you must split them into separate requests with 1 query each.
Update your query
dict
orRequest
object to be adict
with the following field changes:Remove
pageType
, use its previous value as a field name instead, and set it toTrue
.For example, replace
Request(pageType="product")
with{"product": True}
.Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
extra
.extra.customHtml
has no replacement, as Zyte API does not support providing a custom HTML document as input.
Pass the
api_key
parameter toAsyncClient
, with your Zyte API key as value.If you were setting your API key with the
ZYTE_AUTOEXTRACT_KEY
environment variable, useZYTE_API_KEY
instead now.If you are using the
endpoint
parameter, find out what your Zyte API URL and endpoint are, and use insteadapi_url
inAsyncClient
(default:https://api.zyte.com/v1/
) andendpoint
in client methods (default:extract
), or omit the parameters altogether to use their default values.If you are creating an aiohttp session with
create_session
, drop thedisable_cert_validation
parameter.If you get SSL errors, install our CA certificate.
Remove the
agg_stats
parameter, or pass it toAsyncClient
instead.Remove the
max_query_error_retries
parameter. To customize the retry policy, use theretrying
parameter ofAsyncClient
instead.If you are using
request_raw
:Remove the
handle_retries
parameter. To customize the retry policy, use theretrying
parameter ofAsyncClient
instead.Remove the
headers
parameter, python-zyte-api does not support customizing the HTTP headers sent to Zyte API.Note
Not to be confused with Zyte API parameters to set request headers: customHttpRequestHeaders (HTTP) and requestHeaders (browser).
Pass the
retrying
parameter toAsyncClient
instead.
If you are using
request_parallel_as_completed
:Pass the
n_conn
parameter toAsyncClient
instead.Remove the
batch_size
parameter, Zyte API does not support query batching.
Example
Automatic Extraction:
import asyncio
from autoextract.aio.client import request_raw
async def main():
api_response = await request_raw(
[
{
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"pageType": "product",
},
],
)
print(api_response)
asyncio.run(main())
Zyte API:
import asyncio
from zyte_api.aio.client import AsyncClient
async def main():
client = AsyncClient()
api_response = await client.request_raw(
{
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"product": True,
},
)
print(api_response)
asyncio.run(main())
To replace the scrapy-autoextract middleware:
Install and configure scrapy-zyte-api following this page of the web scraping tutorial, including your Zyte API key.
Instead of setting a page type with the
AUTOEXTRACT_PAGE_TYPE
setting, thepage_type
spider attribute, or theautoextract.pageType
request metadata key, set"zyte_api_automap": {"TYPE": true}
on the request metadata, whereTYPE
is the target type, e.g.product
.For example, replace
Request(meta={"autoextract": {"pageType": "product"}})
withRequest(meta={"zyte_api_automap": {"product": True}})
.If you are using the
AUTOEXTRACT_URL
setting, find out what your Zyte API endpoint is and useZYTE_API_URL
instead, or let the default endpoint be used.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_SLOT_POLICY
setting, a per-domain policy is always used. Moreover, Zyte API and non-Zyte-API requests are always treated as targeting different domains.If you are using the
autoextract.extra
request metadata key, map its values to values in thezyte_api_automap
request metadata key as follows:Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Remove
customHtml
, Zyte API does not support providing a custom HTML document as input.
Remove the
autoextract.headers
parameter, scrapy-zyte-api does not support customizing the HTTP headers sent to Zyte API.Note
Not to be confused with Zyte API parameters to set request headers: customHttpRequestHeaders (HTTP) and requestHeaders (browser).
Example
Automatic Extraction:
from scrapy import Request, Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
def start_requests(self):
yield Request(
(
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
meta={
"autoextract": {
"enabled": True,
"pageType": "product",
},
},
)
def parse(self, response):
print(response.meta["autoextract"])
Zyte API:
from scrapy import Request, Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
def start_requests(self):
yield Request(
(
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
meta={
"zyte_api_automap": {
"product": True,
},
},
)
def parse(self, response):
print(response.raw_api_response)
Note
As of scrapy-zyte-api 0.9.0, page object support is limited
to the product
item type.
To replace the scrapy-autoextract page object providers:
Install and configure scrapy-zyte-api following this page of the web scraping tutorial, including your Zyte API key.
Upgrade your versions of web-poet and scrapy-poet, chances are you are using old versions that still work with scrapy-autoextract.
Replace
scrapy_autoextract.AutoExtractProvider
withscrapy_zyte_api.providers.ZyteApiProvider
in yourSCRAPY_POET_PROVIDERS
setting.Replace
autoextract_poet.pages.AutoExtractProductPage
withzyte_common_items.Product
.Note
zyte_common_items.Product
is not a page object but an item, i.e. the result of callingto_item()
on a page object.Replace
autoextract_poet.pages.AutoExtractWebPage
withweb_poet.BrowserResponse
. See Browser HTML.If you are using the
AUTOEXTRACT_URL
setting, find out what your Zyte API endpoint is and useZYTE_API_URL
instead, or let the default endpoint be used.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_MAX_QUERY_ERROR_RETRIES
setting, see Customizing the retry policy to achieve something similar.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_CONCURRENT_REQUESTS_PER_DOMAIN
setting, use theCONCURRENT_REQUESTS_PER_DOMAIN
setting instead.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_CACHE_FILENAME
andAUTOEXTRACT_CACHE_GZIP
settings.
Example
Automatic Extraction:
from autoextract_poet.pages import AutoExtractProductPage
from scrapy import Spider
from scrapy_poet import DummyResponse
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
start_urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
]
def parse(self, response: DummyResponse, product_page: AutoExtractProductPage):
print(product_page.to_item())
Zyte API:
from scrapy import Spider
from scrapy_poet import DummyResponse
from zyte_common_items import Product
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
start_urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
]
def parse(self, response: DummyResponse, product: Product):
print(product)
Updating response expectations#
If you are using an HTTP client, update your API response expectations as follows:
You get a JSON object (
{"…": "…"}
), not an array ([{"…": "…"}]
).You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API.There are no
query
,webPage
, oralgorithmVersion
keys in the response.You can use metadata to replace
query
.A
url
key exists, but it is not the request URL that you get inquery.userQuery.url
, but the response URL, which could be different from the request URL, e.g. due to redirections.Error response handling is similar, rate limiting is more generous. See Zyte API error handling.
If you are using the command-line interface of zyte-autoextract, update your API response expectations as follows:
You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.There are no
query
,webPage
, oralgorithmVersion
keys in the response.You can use metadata to replace
query
.A
url
key exists, but it is not the request URL that you get inquery.userQuery.url
, but the response URL, which could be different from the request URL, e.g. due to redirections.Rate limiting is more generous. See Zyte API error handling.
If you are using the Python asyncio interface of zyte-autoextract, update your API response expectations as follows:
You get a
dict
({"…": "…"}
), not a list ofdict
([{"…": "…"}]
).You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.There are no
query
,webPage
, oralgorithmVersion
keys in the response.You can use metadata to replace
query
.A
url
key exists, but it is not the request URL that you get inquery.userQuery.url
, but the response URL, which could be different from the request URL, e.g. due to redirections.Rate limiting is more generous. See Zyte API error handling.
If you are using the scrapy-autoextract middleware, update your API response expectations as follows:
You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.There are no
original_url
ortiming
meta keys in the response.You can read the original URL from
response.request.url
.There is no built-in alternative for the timing data, if you want that you need to implement it on your own, for example with a custom Scrapy downloader middleware.
Rate limiting is more generous. See Zyte API error handling.
scrapy-zyte-api is smarter about retries, at the cost of handling retries off Scrapy. See Customizing the retry policy.
If you are using scrapy-autoextract page object providers, update your API response expectations as follows:
zyte_common_items.Product
is not a page object but an item, i.e. the result of callingto_item()
on a page object.Its API is also slightly different from that of
autoextract_poet.items.Product
, which is whatautoextract_poet.AutoExtractProductPage.to_item()
returns.Rate limiting is more generous. See Zyte API error handling.
Example
Automatic Extraction:
[
{
"query": {
"id": "1686644367537-712b96d0aa96c12a",
"domain": "toscrape.com",
"userAgent": "curl/8.1.0",
"userQuery": {
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"pageType": "product"
}
},
"webPage": {
"inLanguages": [
{
"code": "en"
}
]
},
"product": {
"name": "A Light in the Attic",
"description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
"mainImage": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg",
"images": [
"https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
],
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"additionalProperty": [
{
"name": "upc",
"value": "a897fe39b1053632"
},
{
"name": "product type",
"value": "Books"
},
{
"name": "price (excl. tax)",
"value": "£51.77"
},
{
"name": "price (incl. tax)",
"value": "£51.77"
},
{
"name": "tax",
"value": "£0.00"
},
{
"name": "availability",
"value": "In stock (22 available)"
},
{
"name": "number of reviews",
"value": "0"
}
],
"offers": [
{
"price": "51.77",
"currency": "£",
"availability": "InStock"
}
],
"sku": "1000",
"breadcrumbs": [
{
"name": "Home",
"link": "https://books.toscrape.com/index.html"
},
{
"name": "Books",
"link": "https://books.toscrape.com/catalogue/category/books_1/index.html"
},
{
"name": "Poetry",
"link": "https://books.toscrape.com/catalogue/category/books/poetry_23/index.html"
},
{
"name": "A Light in the Attic"
}
],
"probability": 0.9982717,
"aggregateRating": {
"reviewCount": 0
},
"descriptionHtml": "<article>\n\n<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p>\n\n</article>",
"color": "Books"
},
"algorithmVersion": "21.12.7"
}
]
Zyte API:
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"statusCode": 200,
"product": {
"name": "A Light in the Attic",
"price": "51.77",
"currency": "GBP",
"currencyRaw": "£",
"availability": "InStock",
"sku": "a897fe39b1053632",
"brand": {
"name": "Books to Scrape"
},
"breadcrumbs": [
{
"name": "Home",
"url": "https://books.toscrape.com/index.html"
},
{
"name": "Books",
"url": "https://books.toscrape.com/catalogue/category/books_1/index.html"
},
{
"name": "Poetry",
"url": "https://books.toscrape.com/catalogue/category/books/poetry_23/index.html"
},
{
"name": "A Light in the Attic"
}
],
"mainImage": {
"url": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
},
"images": [
{
"url": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
}
],
"description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
"descriptionHtml": "<article>\n\n<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p>\n\n</article>",
"aggregateRating": {
"reviewCount": 0
},
"additionalProperties": [
{
"name": "upc",
"value": "a897fe39b1053632"
},
{
"name": "product type",
"value": "Books"
},
{
"name": "price (excl. tax)",
"value": "£51.77"
},
{
"name": "price (incl. tax)",
"value": "£51.77"
},
{
"name": "tax",
"value": "£0.00"
},
{
"name": "availability",
"value": "In stock (22 available)"
},
{
"name": "number of reviews",
"value": "0"
}
],
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"metadata": {
"probability": 0.9947898387908936,
"dateDownloaded": "2023-06-13T08:19:46Z"
}
}
}
Schema changes#
Data schemas in Zyte API are different from those used in Automatic Extraction.
Zyte API data schemas are based on Zyte Data schemas, implementing a subset of their fields. For detailed reference of the Zyte API data schemas, find the corresponding data type response section of the Zyte API reference.
Select a data type below to see how its schema has changed:
New fields: currency, features, metadata.dateDownloaded.
Fields price, regularPrice, and availability, previously nested under
offers
, have been unnested.currency
has been unnested and renamed to currencyRaw.offers
has been removed.{ "price": "9999.99", "regularPrice": "11999.99", "currency": "USD", "currencyRaw": "$", "availability": "InStock" }
brand has become an object, with the brand name on the nested name field instead.
{ "brand": { "name": "Ka-pow" } }
In breadcrumbs, the
link
nested field is now called url instead.{ "breadcrumbs": [ { "url": "http://example.com/level1", "name": "Level 1" }, { "url": "http://example.com/level1/level2", "name": "Level 2" } ] }
mainImage and the images list items are no longer strings, but objects with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" }, "images": [ { "url": "https://img.example.com/products/22.jpeg" } ] }
additionalProperty
is now additionalProperties.probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
hasVariants
is now variants, and its nested items are also affected by all root schema changes listed above.
paginationNext
has been moved to the new productNavigation data type as nextPage, with its nestedtext
field renamed to name.paginationPrevious
has been removed.New fields: products[].currency, metadata, categoryName.
For items in products:
Fields price and regularPrice, previously nested under
offers
, have been unnested.currency
has been unnested and renamed to currencyRaw.offers[].availability
andoffers
have been removed.{ "price": "9999.99", "regularPrice": "11999.99", "currency": "USD", "currencyRaw": "$" }
mainImage is no longer a string, but an object with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" } }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
The following fields have been removed:
sku
,brand
,images
,description
,descriptionHtml
,aggregateRating
.
In breadcrumbs, the
link
nested field is now called url instead.{ "breadcrumbs": [ { "url": "http://example.com/level1", "name": "Level 1" }, { "url": "http://example.com/level1/level2", "name": "Level 2" } ] }
New fields: metadata.dateDownloaded.
author
andauthorsList
have been replaced by authors, a list of objects with name and nameRaw fields. Specifically,authors.name
replacedauthorsList
andauthors.nameRaw
replacedauthor
.{ "authors": [ { "name": "Alice", "nameRaw": "Alice and Bob" }, { "name": "Bob", "nameRaw": "Alice and Bob" } ] }
In breadcrumbs, the
link
nested field is now called url instead.{ "breadcrumbs": [ { "url": "http://example.com/level1", "name": "Level 1" }, { "url": "http://example.com/level1/level2", "name": "Level 2" } ] }
mainImage and the images list items are no longer strings, but objects with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" }, "images": [ { "url": "https://img.example.com/products/22.jpeg" } ] }
audioUrls
andvideoUrls
have been replaced by audios and videos respectively, which are arrays of objects with url fields, rather than arrays of strings.{ "audios": [ { "url": "https://audio.example.com/products/22.mp3" } ], "videos": [ { "url": "https://video.example.com/products/22.mp4" } ] }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
The
articleBodyRaw
field has been removed.
paginationNext
has been moved to the new articleNavigation data type as nextPage, with its nestedtext
field renamed to name.paginationPrevious
has been removed.New fields: metadata.
For items in articles:
author
andauthorsList
have been replaced by authors, a list of objects with name and nameRaw fields. Specifically,authors.name
replacedauthorsList
andauthors.nameRaw
replacedauthor
.{ "authors": [ { "name": "Alice", "nameRaw": "Alice and Bob" }, { "name": "Bob", "nameRaw": "Alice and Bob" } ] }
mainImage and the images list items are no longer strings, but objects with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" }, "images": [ { "url": "https://img.example.com/products/22.jpeg" } ] }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
New fields: baseSalary.currency, datePublishedRaw, metadata.dateDownloaded.
title
is now jobTitle.datePosted
is now datePublished.hiringOrganization.raw
is now hiringOrganization.name.Under baseSalary:
value
is now valueMax, and it is a number string instead of a float number.currency
is now currencyRaw.
{ "baseSalary": { "raw": "$53,251 a year", "valueMax": "53251.00", "currency": "USD", "currencyRaw": "$" } }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }