Using Smart Proxy Manager with Scrapy#

Warning

zyte-smartproxy-ca.crt should be installed in your OS for the below code to work. You can follow these instructions in order to install it.

Note

All the code in this documentation has been tested with Scrapy 2.5.0 and scrapy-zyte-smartproxy 2.1.0.

The recommended way to use Smart Proxy Manager with Scrapy is by using the Zyte proxy middleware which can be installed with:

pip install scrapy-zyte-smartproxy

You can enable the middleware by adding the following settings to your Scrapy project:

DOWNLOADER_MIDDLEWARES = {'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = '<API key>'

See the scrapy-zyte-smartproxy documentation for more information.

Increasing Crawl Speed#

In order to increase crawl rate using Smart Proxy Manager with Scrapy make sure to modify following settings.

  1. Upon increasing CONCURRENT_REQUESTS & CONCURRENT_REQUESTS_PER_DOMAIN number of requests per second with increase.

  2. Disabling AUTOTHROTTLE_ENABLED is recommended. Read more about it here.

  3. Increase DOWNLOAD_TIMEOUT.

Here are the combined settings:

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600

Updating Headers#

There are multiple ways to pass setup Zyte Smart Proxy Manager headers in Scrapy, they are:

Configuring the Headers Per Project#

Headers can be listed inside a Scrapy project’s settings.py file as follows:

DEFAULT_REQUEST_HEADERS = {
    "X-Crawlera-Profile": "desktop",
    "X-Crawlera-Cookies": "disable",
}

Configuring the Headers Per Spider#

Headers can be set at spider level by adding DEFAULT_REQUEST_HEADERS to custom_settings property of the class:

class SomeSpider(scrapy.Spider):
    name = 'somename'
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS": {
            "X-Crawlera-Profile": "desktop",
            "X-Crawlera-Cookies": "disable",
        }
    }

Configuring the Headers Per Request#

In order to set headers per request in Scrapy we need to pass headers as dictionary to scrapy.Request object:

def start_requests(self):
    headers = {
        "X-Crawlera-Profile": "desktop",
        "X-Crawlera-Cookies": "disable",
    }
    for start_url in self.start_urls:
        yield scrapy.Request(
            url=start_url,
            headers=headers,
        )

Note

If headers are set at all 3 levels then the priority would be given to headers passed to scrapy.Request, then custom_settings property of the spider class and at last DEFAULT_REQUEST_HEADERS defined in settings.py.

Using Scrapy with Splash#

In order to use Zyte Smart Proxy Manager with Splash and Scrapy check out Using Smart Proxy Manager with Splash and Scrapy.