Migrating from Automatic Extraction to Zyte API#

Learn how to migrate from Automatic Extraction to Zyte API, which supports automatic extraction.

Key differences#

The following table summarizes the feature differences between both products:

Feature

Automatic Extraction

Zyte API

Type support

Product, product list, article, article list, comment, forum post, review, real estate, vehicle, job posting

Product, product list, product navigation, article, article list, article navigation, job posting

Data schemas

Improved

Extraction from raw HTML

No

Yes

Crawling

UI only

UI and API

Browser HTML

Premium only

Yes

Screenshots

No

Yes

Actions

No

Yes

Network capture

No

Yes

JavaScript toggle

No

Yes

Geolocation

No

Yes

Cookies

No

Yes

Sessions

No

Yes

Response headers

No

Yes

Browserless input

No

Yes

Batch queries

Yes

No

Custom input

Yes (customHtml)

No

Pricing

More granular and flexible. For example, getting both automatic extraction and browser HTML on the same request no longer requires an Enterprise account.

Updating your subscription#

Zyte API requires a separate subscription, follow the getting started guide to get one.

Also remember to cancel your existing Automatic Extraction subscription once you have completed your migration to Zyte API automatic extraction.

Updating requests#

If you are using an HTTP client, check examples of making Zyte API calls from different languages, and update your API requests as follows:

  • Update your endpoint, from:
    https://autoextract.scrapinghub.com/v1/extract
    to:
    https://api.zyte.com/v1/extract
  • Update your API key to your Zyte API key.

  • Update your request body from an array of queries:
    [{"…": "…"}]
    to a single query object:
    {"…": "…"}

    Zyte API does not support query batching. If you were sending multiple queries per request, you must split them into separate requests with 1 query each.

  • Replace "pageType": "TYPE" with "TYPE": true.

    For example, replace "pageType": "product" with "product": true.

  • Replace meta with echoData, which can be any JSON structure, not only a string. See Metadata.

  • Replace fullHtml with browserHtml. See Browser HTML.

  • Remove articleBodyRaw, Zyte API can only return articleBodyHtml.

  • Remove customHtml, Zyte API does not support providing a custom HTML document as input.

Example (curl)

Automatic Extraction:

curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data '[{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
    --compressed \
    https://autoextract.scrapinghub.com/v1/extract

Zyte API:

curl \
    --user YOUR_API_KEY: \
    --header 'Content-Type: application/json' \
    --data '{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}' \
    --compressed \
    https://api.zyte.com/v1/extract

To replace the command-line interface of zyte-autoextract:

  • Install python-zyte-api:

    pip install zyte-api
    
  • Replace the python -m autoextract command with zyte-api.

  • Update your API key to your Zyte API key.

    If you were setting your API key with the ZYTE_AUTOEXTRACT_KEY environment variable, use ZYTE_API_KEY instead now.

  • If you were using a list of URLs as input, switch to a JSON Lines file as input.

    Tip

    You do not need to pass --intype jl on the command line, zyte-api automatically detects your input format.

  • Instead of passing --page-type TYPE on the command line, use "TYPE": true in each query of your input JSON Lines file.

    For example, replace --page-type product on the command line with "product": true on every query.

  • If you were using a JSON Lines file as input:

    • Replace "pageType": "TYPE" with "TYPE": true.

      For example, replace "pageType": "product" with "product": true.

    • Replace meta with echoData, which can be any JSON structure, not only a string. See Metadata.

    • Replace fullHtml with browserHtml. See Browser HTML.

    • Remove articleBodyRaw, Zyte API can only return articleBodyHtml.

    • Remove customHtml, Zyte API does not support providing a custom HTML document as input.

  • If you are using --api-endpoint ENDPOINT, find out what your Zyte API endpoint is and use --api-url ENDPOINT instead, or remove the command-line parameter altogether to use the default endpoint.

  • Remove --batch-size NUMBER, Zyte API does not support query batching.

  • Remove --max-query-error-retries.

    zyte-api performs some retries automatically, but it does not allow customizing its retry policy from the command line, other than disabling error retries altogether with --dont-retry-errors.

  • Remove --disable-cert-validation.

    If you get SSL errors, install our CA certificate.

Example

Automatic Extraction:

input.jsonl#
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}
python -m autoextract --intype jl input.jsonl

Zyte API:

input.jsonl#
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}
zyte-api input.jsonl

To replace the Python asyncio interface of zyte-autoextract:

  • Install python-zyte-api:

    pip install zyte-api
    
  • In your import statements, change
    autoextract
    to
    zyte_api
  • Instead of calling the request_raw and request_parallel_as_completed functions, create an instance of AsyncClient and call its same-name methods.

  • Update your query to be a single dict, instead of a list of dict or Request objects.

    Zyte API does not support query batching. If you were sending multiple queries per request, you must split them into separate requests with 1 query each.

  • Update your query dict or Request object to be a dict with the following field changes:

    • Remove pageType, use its previous value as a field name instead, and set it to True.

      For example, replace Request(pageType="product") with {"product": True}.

    • Replace meta with echoData, which can be any JSON structure, not only a string. See Metadata.

    • Remove articleBodyRaw, Zyte API can only return articleBodyHtml.

    • Replace fullHtml with browserHtml. See Browser HTML.

    • Remove extra.

      extra.customHtml has no replacement, as Zyte API does not support providing a custom HTML document as input.

  • Pass the api_key parameter to AsyncClient, with your Zyte API key as value.

    If you were setting your API key with the ZYTE_AUTOEXTRACT_KEY environment variable, use ZYTE_API_KEY instead now.

  • If you are using the endpoint parameter, find out what your Zyte API URL and endpoint are, and use instead api_url in AsyncClient (default: https://api.zyte.com/v1/) and endpoint in client methods (default: extract), or omit the parameters altogether to use their default values.

  • If you are creating an aiohttp session with create_session, drop the disable_cert_validation parameter.

    If you get SSL errors, install our CA certificate.

  • Remove the agg_stats parameter, or pass it to AsyncClient instead.

  • Remove the max_query_error_retries parameter. To customize the retry policy, use the retrying parameter of AsyncClient instead.

  • If you are using request_raw:

    • Remove the handle_retries parameter. To customize the retry policy, use the retrying parameter of AsyncClient instead.

    • Remove the headers parameter, python-zyte-api does not support customizing the HTTP headers sent to Zyte API.

      Note

      Not to be confused with Zyte API parameters to set request headers: customHttpRequestHeaders (HTTP) and requestHeaders (browser).

    • Pass the retrying parameter to AsyncClient instead.

  • If you are using request_parallel_as_completed:

    • Pass the n_conn parameter to AsyncClient instead.

    • Remove the batch_size parameter, Zyte API does not support query batching.

Example

Automatic Extraction:

import asyncio

from autoextract.aio.client import request_raw


async def main():
    api_response = await request_raw(
        [
            {
                "url": (
                    "https://books.toscrape.com/catalogue"
                    "/a-light-in-the-attic_1000/index.html"
                ),
                "pageType": "product",
            },
        ],
    )
    print(api_response)


asyncio.run(main())

Zyte API:

import asyncio

from zyte_api.aio.client import AsyncClient


async def main():
    client = AsyncClient()
    api_response = await client.request_raw(
        {
            "url": (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            "product": True,
        },
    )
    print(api_response)


asyncio.run(main())

To replace the scrapy-autoextract middleware:

  • Install and configure scrapy-zyte-api following this page of the web scraping tutorial, including your Zyte API key.

  • Instead of setting a page type with the AUTOEXTRACT_PAGE_TYPE setting, the page_type spider attribute, or the autoextract.pageType request metadata key, set "zyte_api_automap": {"TYPE": true} on the request metadata, where TYPE is the target type, e.g. product.

    For example, replace Request(meta={"autoextract": {"pageType": "product"}}) with Request(meta={"zyte_api_automap": {"product": True}}).

  • If you are using the AUTOEXTRACT_URL setting, find out what your Zyte API endpoint is and use ZYTE_API_URL instead, or let the default endpoint be used.

  • scrapy-zyte-api does not provide a counterpart to the AUTOEXTRACT_SLOT_POLICY setting, a per-domain policy is always used. Moreover, Zyte API and non-Zyte-API requests are always treated as targeting different domains.

  • If you are using the autoextract.extra request metadata key, map its values to values in the zyte_api_automap request metadata key as follows:

    • Replace meta with echoData, which can be any JSON structure, not only a string. See Metadata.

    • Replace fullHtml with browserHtml. See Browser HTML.

    • Remove articleBodyRaw, Zyte API can only return articleBodyHtml.

    • Remove customHtml, Zyte API does not support providing a custom HTML document as input.

  • Remove the autoextract.headers parameter, scrapy-zyte-api does not support customizing the HTTP headers sent to Zyte API.

    Note

    Not to be confused with Zyte API parameters to set request headers: customHttpRequestHeaders (HTTP) and requestHeaders (browser).

Example

Automatic Extraction:

from scrapy import Request, Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"

    def start_requests(self):
        yield Request(
            (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            meta={
                "autoextract": {
                    "enabled": True,
                    "pageType": "product",
                },
            },
        )

    def parse(self, response):
        print(response.meta["autoextract"])

Zyte API:

from scrapy import Request, Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"

    def start_requests(self):
        yield Request(
            (
                "https://books.toscrape.com/catalogue"
                "/a-light-in-the-attic_1000/index.html"
            ),
            meta={
                "zyte_api_automap": {
                    "product": True,
                },
            },
        )

    def parse(self, response):
        print(response.raw_api_response)

To replace the scrapy-autoextract page object providers:

  • Install and configure scrapy-zyte-api following this page of the web scraping tutorial, including your Zyte API key.

  • Upgrade your versions of web-poet and scrapy-poet, chances are you are using old versions that still work with scrapy-autoextract.

  • Remove scrapy_autoextract.AutoExtractProvider from your SCRAPY_POET_PROVIDERS setting.

  • Replace autoextract_poet.pages.AutoExtract<type>Page with zyte_common_items.<type>, e.g. autoextract_poet.pages.AutoExtractProductPage with zyte_common_items.Product.

    Note

    zyte_common_items.Product is not a page object class but an item, i.e. the result of calling to_item() on a page object.

  • Replace autoextract_poet.pages.AutoExtractWebPage with web_poet.AnyResponse, which wraps web_poet.HttpResponse or web_poet.BrowserResponse. Which input is actually used depends on your custom page object dependencies, if any, or your extraction source, if defined (see Dependency annotations in scrapy-poet integration).

  • If you are using the AUTOEXTRACT_URL setting, find out what your Zyte API endpoint is and use ZYTE_API_URL instead, or let the default endpoint be used.

  • scrapy-zyte-api does not provide a counterpart to the AUTOEXTRACT_MAX_QUERY_ERROR_RETRIES setting, see Retries to achieve something similar.

  • scrapy-zyte-api does not provide a counterpart to the AUTOEXTRACT_CONCURRENT_REQUESTS_PER_DOMAIN setting, use the CONCURRENT_REQUESTS_PER_DOMAIN setting instead.

  • scrapy-zyte-api does not provide a counterpart to the AUTOEXTRACT_CACHE_FILENAME and AUTOEXTRACT_CACHE_GZIP settings.

Example

Automatic Extraction:

from autoextract_poet.pages import AutoExtractProductPage
from scrapy import Spider
from scrapy_poet import DummyResponse


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"
    start_urls = [
        "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    ]

    def parse(self, response: DummyResponse, product_page: AutoExtractProductPage):
        print(product_page.to_item())

Zyte API:

from scrapy import Spider
from scrapy_poet import DummyResponse
from zyte_common_items import Product


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"
    start_urls = [
        "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    ]

    def parse(self, response: DummyResponse, product: Product):
        print(product)

Updating response expectations#

If you are using an HTTP client, update your API response expectations as follows:

  • You get a JSON object ({"…": "…"}), not an array ([{"…": "…"}]).

  • You also get a key matching your page type, e.g. product, but its content follows a different schema in Zyte API.

  • There are no query, webPage, or algorithmVersion keys in the response.

    You can use metadata to replace query.

    A url key exists, but it is not the request URL that you get in query.userQuery.url, but the response URL, which could be different from the request URL, e.g. due to redirections.

  • Error response handling is similar, rate limiting is more generous. See Zyte API error handling.

If you are using the command-line interface of zyte-autoextract, update your API response expectations as follows:

  • You also get a key matching your page type, e.g. product, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.

  • There are no query, webPage, or algorithmVersion keys in the response.

    You can use metadata to replace query.

    A url key exists, but it is not the request URL that you get in query.userQuery.url, but the response URL, which could be different from the request URL, e.g. due to redirections.

  • Rate limiting is more generous. See Zyte API error handling.

If you are using the Python asyncio interface of zyte-autoextract, update your API response expectations as follows:

  • You get a dict ({"…": "…"}), not a list of dict ([{"…": "…"}]).

  • You also get a key matching your page type, e.g. product, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.

  • There are no query, webPage, or algorithmVersion keys in the response.

    You can use metadata to replace query.

    A url key exists, but it is not the request URL that you get in query.userQuery.url, but the response URL, which could be different from the request URL, e.g. due to redirections.

  • Rate limiting is more generous. See Zyte API error handling.

If you are using the scrapy-autoextract middleware, update your API response expectations as follows:

  • You also get a key matching your page type, e.g. product, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.

  • There are no original_url or timing meta keys in the response.

    You can read the original URL from response.request.url.

    There is no built-in alternative for the timing data, if you want that you need to implement it on your own, for example with a custom Scrapy downloader middleware.

  • Rate limiting is more generous. See Zyte API error handling.

  • scrapy-zyte-api is smarter about retries, at the cost of handling retries off Scrapy. See Retries.

If you are using scrapy-autoextract page object providers, update your API response expectations as follows:

  • zyte_common_items.Product is not a page object but an item, i.e. the result of calling to_item() on a page object.

    Its API is also slightly different from that of autoextract_poet.items.Product, which is what autoextract_poet.AutoExtractProductPage.to_item() returns.

    See the zyte_common_items.Product API reference.

  • Rate limiting is more generous. See Zyte API error handling.

Example

Automatic Extraction:

[
  {
    "query": {
      "id": "1686644367537-712b96d0aa96c12a",
      "domain": "toscrape.com",
      "userAgent": "curl/8.1.0",
      "userQuery": {
        "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "pageType": "product"
      }
    },
    "webPage": {
      "inLanguages": [
        {
          "code": "en"
        }
      ]
    },
    "product": {
      "name": "A Light in the Attic",
      "description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
      "mainImage": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg",
      "images": [
        "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
      ],
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "additionalProperty": [
        {
          "name": "upc",
          "value": "a897fe39b1053632"
        },
        {
          "name": "product type",
          "value": "Books"
        },
        {
          "name": "price (excl. tax)",
          "value": "£51.77"
        },
        {
          "name": "price (incl. tax)",
          "value": "£51.77"
        },
        {
          "name": "tax",
          "value": "£0.00"
        },
        {
          "name": "availability",
          "value": "In stock (22 available)"
        },
        {
          "name": "number of reviews",
          "value": "0"
        }
      ],
      "offers": [
        {
          "price": "51.77",
          "currency": "£",
          "availability": "InStock"
        }
      ],
      "sku": "1000",
      "breadcrumbs": [
        {
          "name": "Home",
          "link": "https://books.toscrape.com/index.html"
        },
        {
          "name": "Books",
          "link": "https://books.toscrape.com/catalogue/category/books_1/index.html"
        },
        {
          "name": "Poetry",
          "link": "https://books.toscrape.com/catalogue/category/books/poetry_23/index.html"
        },
        {
          "name": "A Light in the Attic"
        }
      ],
      "probability": 0.9982717,
      "aggregateRating": {
        "reviewCount": 0
      },
      "descriptionHtml": "<article>\n\n<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p>\n\n</article>",
      "color": "Books"
    },
    "algorithmVersion": "21.12.7"
  }
]

Zyte API:

{
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "statusCode": 200,
  "product": {
    "name": "A Light in the Attic",
    "price": "51.77",
    "currency": "GBP",
    "currencyRaw": "£",
    "availability": "InStock",
    "sku": "a897fe39b1053632",
    "brand": {
      "name": "Books to Scrape"
    },
    "breadcrumbs": [
      {
        "name": "Home",
        "url": "https://books.toscrape.com/index.html"
      },
      {
        "name": "Books",
        "url": "https://books.toscrape.com/catalogue/category/books_1/index.html"
      },
      {
        "name": "Poetry",
        "url": "https://books.toscrape.com/catalogue/category/books/poetry_23/index.html"
      },
      {
        "name": "A Light in the Attic"
      }
    ],
    "mainImage": {
      "url": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
    },
    "images": [
      {
        "url": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
      }
    ],
    "description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
    "descriptionHtml": "<article>\n\n<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p>\n\n</article>",
    "aggregateRating": {
      "reviewCount": 0
    },
    "additionalProperties": [
      {
        "name": "upc",
        "value": "a897fe39b1053632"
      },
      {
        "name": "product type",
        "value": "Books"
      },
      {
        "name": "price (excl. tax)",
        "value": "£51.77"
      },
      {
        "name": "price (incl. tax)",
        "value": "£51.77"
      },
      {
        "name": "tax",
        "value": "£0.00"
      },
      {
        "name": "availability",
        "value": "In stock (22 available)"
      },
      {
        "name": "number of reviews",
        "value": "0"
      }
    ],
    "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "metadata": {
      "probability": 0.9947898387908936,
      "dateDownloaded": "2023-06-13T08:19:46Z"
    }
  }
}

Schema changes#

Data schemas in Zyte API are different from those used in Automatic Extraction.

Zyte API data schemas are based on Zyte Data schemas, implementing a subset of their fields. For detailed reference of the Zyte API data schemas, find the corresponding data type response section of the Zyte API reference.

Select a data type below to see how its schema has changed:

  • New fields: currency, features, metadata.dateDownloaded.

  • Fields price, regularPrice, and availability, previously nested under offers, have been unnested. currency has been unnested and renamed to currencyRaw. offers has been removed.

    {
        "price": "9999.99",
        "regularPrice": "11999.99",
        "currency": "USD",
        "currencyRaw": "$",
        "availability": "InStock"
    }
    
  • brand has become an object, with the brand name on the nested name field instead.

    {
        "brand": {
            "name": "Ka-pow"
        }
    }
    
  • In breadcrumbs, the link nested field is now called url instead.

    {
        "breadcrumbs": [
            {
                "url": "http://example.com/level1",
                "name": "Level 1"
            },
            {
                "url": "http://example.com/level1/level2",
                "name": "Level 2"
            }
        ]
    }
    
  • mainImage and the images list items are no longer strings, but objects with a url field instead.

    {
        "mainImage": {
            "url": "https://img.example.com/products/22.jpeg"
        },
        "images": [
            {
                "url": "https://img.example.com/products/22.jpeg"
            }
        ]
    }
    
  • additionalProperty is now additionalProperties.

  • probability is now nested under the new metadata field.

    {
        "metadata": {
            "probability": 0.9999
        }
    }
    
  • hasVariants is now variants, and its nested items are also affected by all root schema changes listed above.

  • paginationNext has been moved to the new productNavigation data type as nextPage, with its nested text field renamed to name. paginationPrevious has been removed.

  • New fields: products[].currency, metadata, categoryName.

  • For items in products:

    • Fields price and regularPrice, previously nested under offers, have been unnested. currency has been unnested and renamed to currencyRaw. offers[].availability and offers have been removed.

      {
          "price": "9999.99",
          "regularPrice": "11999.99",
          "currency": "USD",
          "currencyRaw": "$"
      }
      
    • mainImage is no longer a string, but an object with a url field instead.

      {
          "mainImage": {
              "url": "https://img.example.com/products/22.jpeg"
          }
      }
      
    • probability is now nested under the new metadata field.

      {
          "metadata": {
              "probability": 0.9999
          }
      }
      
    • The following fields have been removed: sku, brand, images, description, descriptionHtml, aggregateRating.

  • In breadcrumbs, the link nested field is now called url instead.

    {
        "breadcrumbs": [
            {
                "url": "http://example.com/level1",
                "name": "Level 1"
            },
            {
                "url": "http://example.com/level1/level2",
                "name": "Level 2"
            }
        ]
    }
    
  • New fields: metadata.dateDownloaded.

  • author and authorsList have been replaced by authors, a list of objects with name and nameRaw fields. Specifically, authors.name replaced authorsList and authors.nameRaw replaced author.

    {
        "authors": [
            {
                "name": "Alice",
                "nameRaw": "Alice and Bob"
            },
            {
                "name": "Bob",
                "nameRaw": "Alice and Bob"
            }
        ]
    }
    
  • In breadcrumbs, the link nested field is now called url instead.

    {
        "breadcrumbs": [
            {
                "url": "http://example.com/level1",
                "name": "Level 1"
            },
            {
                "url": "http://example.com/level1/level2",
                "name": "Level 2"
            }
        ]
    }
    
  • mainImage and the images list items are no longer strings, but objects with a url field instead.

    {
        "mainImage": {
            "url": "https://img.example.com/products/22.jpeg"
        },
        "images": [
            {
                "url": "https://img.example.com/products/22.jpeg"
            }
        ]
    }
    
  • audioUrls and videoUrls have been replaced by audios and videos respectively, which are arrays of objects with url fields, rather than arrays of strings.

    {
        "audios": [
            {
                "url": "https://audio.example.com/products/22.mp3"
            }
        ],
        "videos": [
            {
                "url": "https://video.example.com/products/22.mp4"
            }
        ]
    }
    
  • probability is now nested under the new metadata field.

    {
        "metadata": {
            "probability": 0.9999
        }
    }
    
  • The articleBodyRaw field has been removed.

  • paginationNext has been moved to the new articleNavigation data type as nextPage, with its nested text field renamed to name. paginationPrevious has been removed.

  • New fields: metadata.

  • For items in articles:

    • author and authorsList have been replaced by authors, a list of objects with name and nameRaw fields. Specifically, authors.name replaced authorsList and authors.nameRaw replaced author.

      {
          "authors": [
              {
                  "name": "Alice",
                  "nameRaw": "Alice and Bob"
              },
              {
                  "name": "Bob",
                  "nameRaw": "Alice and Bob"
              }
          ]
      }
      
    • mainImage and the images list items are no longer strings, but objects with a url field instead.

      {
          "mainImage": {
              "url": "https://img.example.com/products/22.jpeg"
          },
          "images": [
              {
                  "url": "https://img.example.com/products/22.jpeg"
              }
          ]
      }
      
    • probability is now nested under the new metadata field.

      {
          "metadata": {
              "probability": 0.9999
          }
      }
      

Keeping the old schema#

Migrating to the new schema is recommended, to enjoy richer, better-typed data.

However, if you use Python, you can speed up your initial migration by automatically downgrading new items to the old schema, so that you can postpone updating your code and processes that still rely on the old schema.

First install or upgrade zyte-common-items:

pip install --upgrade zyte-common-items

Then use it as follows:

If you have migrated from the Python asyncio interface of zyte-autoextract to that of python-zyte-api, you can convert your extracted data with zyte_common_items.ae.downgrade.

For example, for a product:

from zyte_common_items import Product, ae

...

zyte_api_product = Product.from_dict(response["product"])
ae_product = ae.downgrade(zyte_api_product)

The resulting object is compatible with itemadapter, e.g. you can turn it into a dictionary as follows:

from itemadapter import ItemAdapter

ae_product_dict = ItemAdapter(ae_product).asdict()

If you have migrated from the scrapy-autoextract middleware to scrapy-zyte-api (without scrapy-poet), you have 2 options to convert your extracted data.

If you keep the extracted data unchanged, e.g. your callback looks something like this:

from scrapy import Spider

...

class MySpider(Spider):

    ...

    def parse(self, response):
        yield response.raw_api_response["product"]

You can change your callback to something like this:

from scrapy import Spider
from zyte_common_items import Product

...

class MySpider(Spider):

    ...

    def parse(self, response):
        yield Product.from_dict(response.raw_api_response["product"])

And enable the AEPipeline item pipeline to convert your data:

settings.py#
ITEM_PIPELINES = {
    "zyte_common_items.pipelines.AEPipeline": 500,
}

Tip

If you have item pipelines that rely on the old schema, you might need to use a value lower than theirs, instead of 500, to run AEPipeline before them.

Also, mind that AEPipeline returns an attrs object instead of a dict. Use itemadapter to interact with items.

However, if you do make changes to the extracted data, check if you still need those changes. If you do, you have 2 options:

  • Update your custom extraction code to use the new schema, and let the item pipeline change the item schema later. For example:

    from scrapy import Spider
    from zyte_common_items import Product
    
    ...
    
    class MySpider(Spider):
    
        ...
    
        def parse(self, response):
            product = Product.from_dict(response.raw_api_response["product"])
            product.price = response.css(".hidden-price").get()
            yield product
    
  • Change the item schema at the beginning of your callback, before your custom code. For example:

    from scrapy import Spider
    from zyte_common_items import Product, ae
    
    ...
    
    class MySpider(Spider):
    
        ...
    
        def parse(self, response):
            zyte_api_product = Product.from_dict(response.raw_api_response["product"])
            ae_product = ae.downgrade(zyte_api_product)
            ae_product.offers = [
                ae.AEOffer(
                    price=response.css(".hidden-price").get(),
                )
            ]
            yield ae_product
    

    The object that ae.downgrade returns is compatible with itemadapter, e.g. you can turn it into a dictionary as follows:

    from itemadapter import ItemAdapter
    
    ae_product_dict = ItemAdapter(ae_product).asdict()
    

If you have migrated from scrapy-autoextract page object providers to scrapy-zyte-api page object providers, you have 2 options to convert your extracted data.

If you keep the extracted data unchanged, e.g. you have no custom page object class and your callback looks something like this:

from scrapy import Spider
from scrapy_poet import DummyResponse
from zyte_common_items import Product

...

class MySpider(Spider):

    ...

    def parse(self, response: DummyResponse, product: Product):
        yield product

Enable the AEPipeline item pipeline to convert your data:

settings.py#
ITEM_PIPELINES = {
    "zyte_common_items.pipelines.AEPipeline": 500,
}

Tip

If you have item pipelines that rely on the old schema, you might need to use a value lower than theirs, instead of 500, to run AEPipeline before them.

Also, mind that AEPipeline returns an attrs object instead of a dict. Use itemadapter to interact with items.

However, if you do make changes to the extracted data, check if you still need those changes. If you do, you have 2 options:

  • Upgrade your custom extraction code, and let the item pipeline change the item schema later.

    For every website or set of websites that require custom extraction code, write a page object class with a proper rule that subclasses an Auto-prefixed page object class (e.g. zyte_common_items.AutoProductPage), and use fields to implement your custom extraction code. For example:

    myproject/pages/books_toscrape_com.py#
    import attrs
    from web_poet import AnyResponse, field, handle_urls
    
    
    @handle_urls("books.toscrape.com")
    @attrs.define
    class BooksToScrapeComProductPage(AutoProductPage):
        response: AnyResponse
    
        @field
        def price(self):
            return self.response.css(".hidden-price")
    

    Tip

    The order of decorators matters. Also, see Overriding parsing for another example.

    Make sure you migrate into those page object classes all your custom extraction code, both from custom old-style page object classes where extraction logic used to be in their to_item method instead of fields, and from your callback, where there should be no extraction logic anymore.

    Also make sure you point the SCRAPY_POET_DISCOVER setting to a module containing your new page object classes, directly or indirectly. We recommend to have a pages module under your project, e.g. myproject.pages, and keep all page objects there, so that you can use:

    settings.py#
    SCRAPY_POET_DISCOVER = ["myproject.pages"]
    
  • Move all your custom extraction code to your callback, and change the item schema at the beginning of your callback, before your custom code. For example:

    from scrapy import Spider
    from scrapy_poet import DummyResponse
    from web_poet import AnyResponse
    from zyte_common_items import Product, ae
    
    ...
    
    class MySpider(Spider):
    
        ...
    
        def parse(self, response: DummyResponse, zyte_api_product: Product, _response: AnyResponse):
            ae_product = ae.downgrade(zyte_api_product)
            ae_product.offers = [
                ae.AEOffer(
                    price=_response.css(".hidden-price").get(),
                )
            ]
            yield ae_product
    

    The object that ae.downgrade returns is compatible with itemadapter, e.g. you can turn it into a dictionary as follows:

    from itemadapter import ItemAdapter
    
    ae_product_dict = ItemAdapter(ae_product).asdict()
    

Handling custom extraction code#

On average, Zyte API automatic extraction performs better than Zyte Automatic Extraction. If your code includes custom extraction code to fix extraction issues from Zyte Automatic Extraction, see if you can remove some of that code now.

If the output of Zyte API is still imperfect, please open a support ticket indicating which field is not being properly extracted on which website.