Migrating from scrapy-zyte-smartproxy to scrapy-zyte-api#

This migration guide provides the steps necessary to migrate from scrapy-zyte-smartproxy or scrapy-crawlera to scrapy-zyte-api.

Note

If you use Smart Proxy Manager, see Migrating from Smart Proxy Manager to Zyte API for general migration information.

Maybe keep scrapy-zyte-smartproxy#

If you use scrapy-zyte-smartproxy for Scrapy integration with Smart Proxy Manager, and you only want to migrate to Zyte API to enjoy better ban avoidance or pricing, you can continue using scrapy-zyte-smartproxy: scrapy-zyte-smartproxy 2.3.1 and higher support the proxy mode of Zyte API.

Tip

If you are using scrapy-crawlera, you would need to migrate to scrapy-zyte-smartproxy to use Zyte API proxy mode. See the release notes of scrapy-zyte-smartproxy 2.0.0 for details. It might be worth migrating to scrapy-zyte-api instead.

To switch from Smart Proxy Manager to the proxy mode of Zyte API, replace your Smart Proxy Manager API key with your Zyte API key, and set the ZYTE_SMARTPROXY_URL setting to "http://api.zyte.com:8011". Alternatively, you can enable Zyte API proxy mode for specific requests.

You should also add 520 and 521 to the RETRY_HTTP_CODES setting:

settings.py#
from scrapy.settings.default_settings import RETRY_HTTP_CODES as DEFAULT_RETRY_HTTP_CODES

RETRY_HTTP_CODES = DEFAULT_RETRY_HTTP_CODES + [520, 521]

scrapy-zyte-smartproxy will automatically translate Smart Proxy Manager headers into their Zyte API counterparts where possible, and drop them when not. But you should eventually update your headers, see Parameter mapping.

Using scrapy-zyte-smartproxy for Zyte API makes it easier to migrate from Smart Proxy Manager. However, the proxy mode of Zyte API has limitations. Continue reading to learn how to migrate to scrapy-zyte-api and get access to all Zyte API features.

Tip

You can keep both scrapy-zyte-smartproxy and scrapy-zyte-api, and use one or the other for different requests or spiders.

Set up scrapy-zyte-api#

  1. You need Python 3.8 or higher to use the latest version of scrapy-zyte-api.

  2. You need Scrapy 2.0.1 or higher to use the latest version of scrapy-zyte-api.

    If you are using a lower version of Scrapy, please upgrade to a higher Scrapy version, and make sure your code works as expected with the newer Scrapy version before you continue the migration process.

    The Scrapy release notes of every Scrapy version cover backward-incompatible changes and deprecation removals, which should help you upgrade your existing code as you upgrade Scrapy.

  3. Install the latest version of scrapy-zyte-api:

    pip install --upgrade scrapy-zyte-api
    
  4. Configure scrapy-zyte-api in your settings.py file. If your Scrapy version is 2.10 or higher, add the following settings:

    ADDONS = {
        "scrapy_zyte_api.Addon": 500,
    }
    ZYTE_API_TRANSPARENT_MODE = False
    

    Otherwise add the following settings:

    DOWNLOAD_HANDLERS = {
        "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
        "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    }
    DOWNLOADER_MIDDLEWARES = {
        "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
    }
    REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
    SPIDER_MIDDLEWARES = {
        "scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware": 100,
    }
    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    

    If any of these settings already exists in your settings.py file, modify the existing setting as needed instead of re-defining it. For example, if you already have DOWNLOADER_MIDDLEWARES defined, add "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000, to your existing definition, keeping existing downloader middlewares untouched.

    Also, make sure that these settings are not being overridden elsewhere. For example, make sure they are not defined in multiple lines of your settings.py file, and that they are not overridden in your Scrapy Cloud project settings.

    Note

    On projects that were not using the asyncio Twisted reactor, your existing code may need changes, such as:

    • Handling a pre-installed Twisted reactor.

      Some Twisted imports install the default, non-asyncio Twisted reactor as a side effect. Once a reactor is installed, it cannot be changed for the whole run time.

    • Converting Twisted Deferreds into asyncio Futures.

      Note that you might be using Deferreds without realizing it through some Scrapy functions and methods. For example, when you yield the return value of self.crawler.engine.download() from a spider callback, you are yielding a Deferred.

  5. Add your API key to settings.py as well:

    ZYTE_API_KEY = "YOUR_API_KEY"
    
  6. To enable cookie support, the COOKIES_ENABLED setting is not enough, you must also define an additional setting in settings.py:

    ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED = True
    

Migrate#

Your next steps depend on how you want to approach your migration. You can migrate some requests, migrate some spiders, or migrate your entire project.

Migrate a request#

Migrating requests makes sense if you want to keep scrapy-zyte-smartproxy but you need to drive specific requests through scrapy-zyte-api to overcome the limitations of proxy mode.

To migrate a Scrapy request, set the following fields in the request metadata:

yield Request(
    ...,
    meta={
        "dont_proxy": True,
        "zyte_api_automap": True,
    },
)

Tip

If your spider stops with the plugin_conflict finish reason, make sure the ZYTE_API_TRANSPARENT_MODE setting is False. Only set ZYTE_API_TRANSPARENT_MODE to True when migrating an entire spider or project.

Migrate a spider#

Compared to migrating an entire project, migrating spiders one by one, incrementally, can be more time consuming, but also less disruptive, giving you time to validate the migration of each spider separately.

To migrate a Scrapy spider, use custom_settings or update_settings to toggle scrapy-zyte-smartproxy, scrapy-crawlera, and scrapy-zyte-api:

class MySpider(Spider):
    custom_settings = {
        "ZYTE_API_TRANSPARENT_MODE": True,
        "ZYTE_SMARTPROXY_ENABLED": False,
        "CRAWLERA_ENABLED": False,  # Only needed if you use scrapy-crawlera
    }

You can look at the stats of a crawl after migration to check that the migration was successful: there should be scrapy-zyte-api-prefixed stats indicating scrapy-zyte-api usage, and there should be no scrapy-zyte-smartproxy stats, which are prefixed with either zyte_smartproxy (Smart Proxy Manager) or zyte_api_proxy (Zyte API), or scrapy-crawlera stats, which are prefixed with crawlera.

Migrate a project#

To migrate a Scrapy project:

  1. Disable scrapy-zyte-smartproxy or scrapy-crawlera.

    scrapy-zyte-smartproxy is enabled through the ZYTE_SMARTPROXY_ENABLED setting. scrapy-crawlera through CRAWLERA_ENABLED.

    To disable, find where you define that setting (e.g. settings.py, Scrapy Cloud settings), and remove it.

    Also, make sure you are not enabling those settings on specific spiders, e.g. through the custom_settings class attribute of a spider class, or in your cloud (e.g. in Scrapy Cloud, which allows overriding settings for specific spiders).

  2. Configure Zyte API to run in transparent mode.

    If you use scrapy_zyte_api.Addon, remove the ZYTE_API_TRANSPARENT_MODE = False line from settings.py. The add-on enables transparent mode automatically.

    If you do not use scrapy_zyte_api.Addon, add the following line to settings.py:

    ZYTE_API_TRANSPARENT_MODE = True
    

To check that the migration was successful, you can either check stats for each spider or remove scrapy-zyte-smartproxy and scrapy-crawlera.

Remove proxy headers#

Regardless of whether you are migrating only some spiders or your whole project, review the code of requests that now go through Zyte API to look for proxy headers, i.e. those prefixed with X-Crawlera- or Zyte- (case-insensitive), and replace them with Zyte API counterparts according to this table.

Tip

You can usually find Scrapy requests by searching your code for uses of the Request class, but mind that there are other ways to create requests, including: request.copy(), request.replace(), request.from_curl(), request_from_dict(), response.follow() and response.follow_all().

You can specify those parameters through a zyte_api_automap dictionary in request metadata. For example, to set the geolocation of a request to the USA:

yield Request(
    ...,
    meta={
        "zyte_api_automap": {
            "geolocation": "US",
        },
    },
)

For details, see Automatic request parameters.

Handle retries#

scrapy-zyte-api implements an advanced retry mechanism, with a default retry policy that should work for most scenarios.

If retries for temporary download errors are being exceeded and you want to increase retries, or if you want to retry permanent download errors, you can try switching to the aggressive retry policy:

settings.py#
ZYTE_API_RETRY_POLICY = "zyte_api.aggressive_retrying"

You can also create a custom retry policy, see the reference documentation of RetryFactory and AggressiveRetryFactory for examples.

When retries are exceeded for a given request, an exception is raised, and if not caught, an error message is logged. See Retrying non-successful Zyte API responses to learn how to handle such exceptions.

Adjust crawl speed#

If you find that the migration has negatively affected the run time of your spiders, increase the CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN settings accordingly.

If a higher concurrency does not improve your run time, the cause may be rate limiting; if the scrapy-zyte-api/throttle_ratio Scrapy stat is high, you may open a support ticket to request a higher rate limit for your account.

scrapy-zyte-api bypasses the AutoThrottle Scrapy extension (enabled by default in Scrapy Cloud) and lets Zyte API handle rate limiting. If you find that your spiders are too fast after migration, lower concurrency settings accordingly.

Memory may increase#

Zyte API HTTP response bodies are Base64-encoded, making them 33-37% larger, hence increasing memory usage.

If your spider runs out of memory after migration, consider:

  • Increasing available memory. If you use Scrapy Cloud, use more units.

  • Lower SCRAPER_SLOT_MAX_ACTIVE_SIZE to a value that prevents exceeding available memory while allowing an acceptable crawl speed.

Remove scrapy-zyte-smartproxy (optional)#

Once you have migrated all your code and are happy with the result, you can remove scrapy-zyte-smartproxy and scrapy-crawlera:

pip uninstall scrapy-zyte-smartproxy scrapy-crawlera

And remove from your code and from Scrapy Cloud any related Scrapy setting, i.e. those prefixed with either ZYTE_SMARTPROXY_ or CRAWLERA_, including those that you used to disable scrapy-zyte-smartproxy in an earlier migration step (no need to disable something that is not installed anymore).