Migrating from scrapy-zyte-smartproxy to scrapy-zyte-api#
This migration guide provides the steps necessary to migrate from scrapy-zyte-smartproxy or scrapy-crawlera to scrapy-zyte-api.
Note
If you use Smart Proxy Manager, see Migrating from Smart Proxy Manager to Zyte API for general migration information.
Maybe keep scrapy-zyte-smartproxy#
If you use scrapy-zyte-smartproxy for Scrapy integration with Smart Proxy Manager, and you only want to migrate to Zyte API to enjoy better ban avoidance or pricing, you can continue using scrapy-zyte-smartproxy: scrapy-zyte-smartproxy 2.3.1 and higher support the proxy mode of Zyte API.
Tip
If you are using scrapy-crawlera, you would need to migrate to scrapy-zyte-smartproxy to use Zyte API proxy mode. See the release notes of scrapy-zyte-smartproxy 2.0.0 for details. It might be worth migrating to scrapy-zyte-api instead.
To switch from Smart Proxy Manager to the proxy mode of Zyte API, replace your
Smart Proxy Manager API key with your Zyte API key, and set the
ZYTE_SMARTPROXY_URL setting to "http://api.zyte.com:8011"
.
Alternatively, you can enable Zyte API proxy mode for specific requests.
You should also add 520 and 521 to the RETRY_HTTP_CODES
setting:
from scrapy.settings.default_settings import RETRY_HTTP_CODES as DEFAULT_RETRY_HTTP_CODES
RETRY_HTTP_CODES = DEFAULT_RETRY_HTTP_CODES + [520, 521]
scrapy-zyte-smartproxy will automatically translate Smart Proxy Manager headers into their Zyte API counterparts where possible, and drop them when not. But you should eventually update your headers, see Parameter mapping.
Using scrapy-zyte-smartproxy for Zyte API makes it easier to migrate from Smart Proxy Manager. However, the proxy mode of Zyte API has limitations. Continue reading to learn how to migrate to scrapy-zyte-api and get access to all Zyte API features.
Set up scrapy-zyte-api#
You need Python 3.8 or higher to use the latest version of scrapy-zyte-api.
You need Scrapy 2.0.1 or higher to use the latest version of scrapy-zyte-api.
If you are using a lower version of Scrapy, please upgrade to a higher Scrapy version, and make sure your code works as expected with the newer Scrapy version before you continue the migration process.
The Scrapy release notes of every Scrapy version cover backward-incompatible changes and deprecation removals, which should help you upgrade your existing code as you upgrade Scrapy.
Install the latest version of scrapy-zyte-api:
pip install --upgrade scrapy-zyte-api
Configure scrapy-zyte-api in your
settings.py
file. If your Scrapy version is 2.10 or higher, add the following settings:ADDONS = { "scrapy_zyte_api.Addon": 500, } ZYTE_API_TRANSPARENT_MODE = False
Otherwise add the following settings:
DOWNLOAD_HANDLERS = { "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", } DOWNLOADER_MIDDLEWARES = { "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000, } REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter" SPIDER_MIDDLEWARES = { "scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware": 100, } TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
If any of these settings already exists in your
settings.py
file, modify the existing setting as needed instead of re-defining it. For example, if you already haveDOWNLOADER_MIDDLEWARES
defined, add"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
to your existing definition, keeping existing downloader middlewares untouched.Also, make sure that these settings are not being overridden elsewhere. For example, make sure they are not defined in multiple lines of your
settings.py
file, and that they are not overridden in your Scrapy Cloud project settings.Note
On projects that were not using the asyncio Twisted reactor, your existing code may need changes, such as:
Handling a pre-installed Twisted reactor.
Some Twisted imports install the default, non-asyncio Twisted reactor as a side effect. Once a reactor is installed, it cannot be changed for the whole run time.
Converting Twisted Deferreds into asyncio Futures.
Note that you might be using Deferreds without realizing it through some Scrapy functions and methods. For example, when you yield the return value of
self.crawler.engine.download()
from a spider callback, you are yielding a Deferred.
Add your API key to
settings.py
as well:ZYTE_API_KEY = "YOUR_API_KEY"
To enable cookie support, the
COOKIES_ENABLED
setting is not enough, you must also define an additional setting insettings.py
:ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED = True
Migrate#
Your next steps depend on how you want to approach your migration. You can migrate some requests, migrate some spiders, or migrate your entire project.
Migrate a request#
Migrating requests makes sense if you want to keep scrapy-zyte-smartproxy but you need to drive specific requests through scrapy-zyte-api to overcome the limitations of proxy mode.
To migrate a Scrapy request, set the following fields in the request metadata:
yield Request(
...,
meta={
"dont_proxy": True,
"zyte_api_automap": True,
},
)
Tip
If your spider stops with the plugin_conflict
finish reason, make
sure the ZYTE_API_TRANSPARENT_MODE
setting is False
. Only set
ZYTE_API_TRANSPARENT_MODE
to True
when migrating an
entire spider or project.
Migrate a spider#
Compared to migrating an entire project, migrating spiders one by one, incrementally, can be more time consuming, but also less disruptive, giving you time to validate the migration of each spider separately.
To migrate a Scrapy spider, use custom_settings or update_settings to toggle scrapy-zyte-smartproxy, scrapy-crawlera, and scrapy-zyte-api:
class MySpider(Spider):
custom_settings = {
"ZYTE_API_TRANSPARENT_MODE": True,
"ZYTE_SMARTPROXY_ENABLED": False,
"CRAWLERA_ENABLED": False, # Only needed if you use scrapy-crawlera
}
You can look at the stats of a crawl after migration to check that the
migration was successful: there should be scrapy-zyte-api
-prefixed
stats indicating scrapy-zyte-api usage, and
there should be no scrapy-zyte-smartproxy stats, which are prefixed with either
zyte_smartproxy
(Smart Proxy Manager) or zyte_api_proxy
(Zyte API), or
scrapy-crawlera stats, which are prefixed with crawlera
.
Migrate a project#
To migrate a Scrapy project:
Disable scrapy-zyte-smartproxy or scrapy-crawlera.
scrapy-zyte-smartproxy is enabled through the
ZYTE_SMARTPROXY_ENABLED
setting. scrapy-crawlera throughCRAWLERA_ENABLED
.To disable, find where you define that setting (e.g.
settings.py
, Scrapy Cloud settings), and remove it.Also, make sure you are not enabling those settings on specific spiders, e.g. through the custom_settings class attribute of a spider class, or in your cloud (e.g. in Scrapy Cloud, which allows overriding settings for specific spiders).
Configure Zyte API to run in transparent mode.
If you use
scrapy_zyte_api.Addon
, remove theZYTE_API_TRANSPARENT_MODE = False
line fromsettings.py
. The add-on enables transparent mode automatically.If you do not use
scrapy_zyte_api.Addon
, add the following line tosettings.py
:ZYTE_API_TRANSPARENT_MODE = True
To check that the migration was successful, you can either check stats for each spider or remove scrapy-zyte-smartproxy and scrapy-crawlera.
Remove proxy headers#
Regardless of whether you are migrating only some spiders or your whole
project, review the code of requests that now go through Zyte API to look for
proxy headers, i.e. those prefixed with X-Crawlera-
or Zyte-
(case-insensitive), and replace them with Zyte API counterparts according to
this table.
Tip
You can usually find Scrapy requests by searching your code for uses
of the Request
class, but mind that there are other
ways to create requests, including: request.copy()
, request.replace()
, request.from_curl()
,
request_from_dict()
, response.follow()
and response.follow_all()
.
You can specify those parameters through a zyte_api_automap
dictionary
in request metadata. For example, to set the geolocation of a
request to the USA:
yield Request(
...,
meta={
"zyte_api_automap": {
"geolocation": "US",
},
},
)
For details, see Automatic request parameters.
Handle retries#
scrapy-zyte-api implements an advanced retry mechanism, with a default retry policy that should work for most scenarios.
If retries for temporary download errors are being exceeded and you want to increase retries, or if you want to retry permanent download errors, you can try switching to the aggressive retry policy:
ZYTE_API_RETRY_POLICY = "zyte_api.aggressive_retrying"
You can also create a custom retry policy, see the reference documentation of
RetryFactory
and AggressiveRetryFactory
for examples.
When retries are exceeded for a given request, an exception is raised, and if not caught, an error message is logged. See Retrying non-successful Zyte API responses to learn how to handle such exceptions.
Adjust the crawl speed#
If your crawl speed lowers significantly after migrating:
Ensure that you are not setting a
DOWNLOAD_DELAY
, which scrapy-zyte-smartproxy and scrapy-crawlera ignore, but scrapy-zyte-api respects.If you need to keep a download delay for some domains, you can use the
DOWNLOAD_SLOTS
setting. Note that requests sent through scrapy-zyte-api use a different slot, prefixed withzyte-api@
(e.g.zyte-api@example.com
).Increase the
CONCURRENT_REQUESTS
andCONCURRENT_REQUESTS_PER_DOMAIN
settings as needed.If a higher concurrency does not improve your crawl speed, the cause may be rate limiting; if the
scrapy-zyte-api/throttle_ratio
Scrapy stat is high, you may open a support ticket to request a higher rate limit for your account.
If your crawl speed increases too much after migrating:
If the AutoThrottle Scrapy extension is enabled (i.e.
AUTOTHROTTLE_ENABLED
isTrue
, as it is by default in Scrapy Cloud), scrapy-zyte-api bypasses the extension for Zyte API request, to let Zyte API handle rate limiting on its own.Set the
ZYTE_API_PRESERVE_DELAY
setting toTrue
to prevent scrapy-zyte-api from bypassing the extension.
Memory may increase#
Zyte API HTTP response bodies are Base64-encoded, making them 33-37% larger, hence increasing memory usage.
If your spider runs out of memory after migration, consider:
Increasing available memory. If you use Scrapy Cloud, use more units.
Lower
SCRAPER_SLOT_MAX_ACTIVE_SIZE
to a value that prevents exceeding available memory while allowing an acceptable crawl speed.
Remove scrapy-zyte-smartproxy (optional)#
Once you have migrated all your code and are happy with the result, you can remove scrapy-zyte-smartproxy and scrapy-crawlera:
pip uninstall scrapy-zyte-smartproxy scrapy-crawlera
And remove from your code and from Scrapy Cloud any related Scrapy setting,
i.e. those prefixed with either ZYTE_SMARTPROXY_
or CRAWLERA_
,
including those that you used to disable scrapy-zyte-smartproxy in an earlier
migration step (no need to disable something that is not installed anymore).