Warning
Zyte API is replacing Smart Proxy Manager. See Migrating from Smart Proxy Manager to Zyte API.
Using Smart Proxy Manager with Scrapy#
Warning
For the code below to work you must first install the Zyte CA certificate.
Note
All the code in this documentation has been tested with Scrapy 2.5.0 and scrapy-zyte-smartproxy 2.1.0.
The recommended way to use Smart Proxy Manager with Scrapy is by using the Zyte proxy middleware which can be installed with:
pip install scrapy-zyte-smartproxy
You can enable the middleware by adding the following settings to your Scrapy project:
DOWNLOADER_MIDDLEWARES = {'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = '<API key>'
See the scrapy-zyte-smartproxy documentation for more information.
Increasing Crawl Speed#
In order to increase crawl rate using Smart Proxy Manager with Scrapy make sure to modify following settings.
Upon increasing
CONCURRENT_REQUESTS
&CONCURRENT_REQUESTS_PER_DOMAIN
number of requests per second with increase.Disabling
AUTOTHROTTLE_ENABLED
is recommended. Read more about it here.Increase
DOWNLOAD_TIMEOUT
.
Here are the combined settings:
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
Updating Headers#
There are multiple ways to pass setup Zyte Smart Proxy Manager headers in Scrapy, they are:
Configuring the Headers Per Project#
Headers can be listed inside a Scrapy project’s settings.py
file as follows:
DEFAULT_REQUEST_HEADERS = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
}
Configuring the Headers Per Spider#
Headers can be set at spider level by adding DEFAULT_REQUEST_HEADERS to custom_settings property of the class:
class SomeSpider(scrapy.Spider):
name = 'somename'
custom_settings = {
"DEFAULT_REQUEST_HEADERS": {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
}
}
Configuring the Headers Per Request#
In order to set headers per request in Scrapy we need to pass headers as dictionary
to
scrapy.Request
object:
def start_requests(self):
headers = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
}
for start_url in self.start_urls:
yield scrapy.Request(
url=start_url,
headers=headers,
)
Note
If headers are set at all 3 levels then the priority would be given to headers passed to scrapy.Request
, then custom_settings
property of the spider class and at last DEFAULT_REQUEST_HEADERS
defined in settings.py
.
Using Scrapy with Splash#
In order to use Zyte Smart Proxy Manager with Splash and Scrapy check out Using Smart Proxy Manager with Splash and Scrapy.