Warning
Zyte API is replacing Smart Proxy Manager. It is no longer possible to sign up to Smart Proxy Manager. If you are an existing Smart Proxy Manager user, see Migrating from Smart Proxy Manager to Zyte API.
Using Smart Proxy Manager with Scrapy#
Warning
For the code below to work you must first install the Zyte CA certificate.
Note
All the code in this documentation has been tested with Scrapy 2.5.0 and scrapy-zyte-smartproxy 2.1.0.
The recommended way to use Smart Proxy Manager with Scrapy is by using the Zyte proxy middleware which can be installed with:
pip install scrapy-zyte-smartproxy
You can enable the middleware by adding the following settings to your Scrapy project:
DOWNLOADER_MIDDLEWARES = {'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = '<API key>'
See the scrapy-zyte-smartproxy documentation for more information.
Increasing Crawl Speed#
In order to increase crawl rate using Smart Proxy Manager with Scrapy make sure to modify following settings.
Upon increasing
CONCURRENT_REQUESTS
&CONCURRENT_REQUESTS_PER_DOMAIN
number of requests per second with increase.Disabling
AUTOTHROTTLE_ENABLED
is recommended. Read more about it here.Increase
DOWNLOAD_TIMEOUT
.
Here are the combined settings:
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
Updating Headers#
There are multiple ways to pass setup Zyte Smart Proxy Manager headers in Scrapy, they are:
Configuring the Headers Per Project#
Headers can be listed inside a Scrapy project’s settings.py
file as follows:
DEFAULT_REQUEST_HEADERS = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
}
Configuring the Headers Per Spider#
Headers can be set at spider level by adding DEFAULT_REQUEST_HEADERS to custom_settings property of the class:
class SomeSpider(scrapy.Spider):
name = 'somename'
custom_settings = {
"DEFAULT_REQUEST_HEADERS": {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
}
}
Configuring the Headers Per Request#
In order to set headers per request in Scrapy we need to pass headers as dictionary
to
scrapy.Request
object:
def start_requests(self):
headers = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
}
for start_url in self.start_urls:
yield scrapy.Request(
url=start_url,
headers=headers,
)
Note
If headers are set at all 3 levels then the priority would be given to headers passed to scrapy.Request
, then custom_settings
property of the spider class and at last DEFAULT_REQUEST_HEADERS
defined in settings.py
.
Using Scrapy with Splash#
In order to use Zyte Smart Proxy Manager with Splash and Scrapy check out Using Smart Proxy Manager with Splash and Scrapy.