Using Smart Proxy Manager with Scrapy

Note

All the code in this documentation has been tested with Scrapy 2.5.0 and scrapy-zyte-smartproxy 2.1.0.

The recommended way to use Smart Proxy Manager with Scrapy is by using the Zyte proxy middleware which can be installed with:

pip install scrapy-zyte-smartproxy

You can enable the middleware by adding the following settings to your Scrapy project:

DOWNLOADER_MIDDLEWARES = {'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = '<API key>'

See the scrapy-zyte-smartproxy documentation for more information.

Increasing Crawl Speed

In order to increase crawl rate using Smart Proxy Manager with Scrapy make sure to modify following settings.

  1. Upon increasing CONCURRENT_REQUESTS & CONCURRENT_REQUESTS_PER_DOMAIN number of requests per second with increase.

  2. Disabling AUTOTHROTTLE_ENABLED is recommended. Read more about it here.

  3. Increase DOWNLOAD_TIMEOUT.

Here are the combined settings:

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600

Updating Headers

There are multiple ways to pass setup Zyte Smart Proxy Manager headers in Scrapy, they are:

Configuring the Headers Per Project

Headers can be listed inside a Scrapy project’s settings.py file as follows:

DEFAULT_REQUEST_HEADERS = {
    "X-Crawlera-Profile": "desktop",
    "X-Crawlera-Cookies": "disable",
}

Configuring the Headers Per Spider

Headers can be set at spider level by adding DEFAULT_REQUEST_HEADERS to custom_settings property of the class:

class SomeSpider(scrapy.Spider):
    name = 'somename'
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS": {
            "X-Crawlera-Profile": "desktop",
            "X-Crawlera-Cookies": "disable",
        }
    }

Configuring the Headers Per Request

In order to set headers per request in Scrapy we need to pass headers as dictionary to scrapy.Request object:

def start_requests(self):
    headers = {
        "X-Crawlera-Profile": "desktop",
        "X-Crawlera-Cookies": "disable",
    }
    for start_url in self.start_urls:
        yield scrapy.Request(
            url=start_url,
            headers=headers,
        )

Note

If headers are set at all 3 levels then the priority would be given to headers passed to scrapy.Request, then custom_settings property of the spider class and at last DEFAULT_REQUEST_HEADERS defined in settings.py.

Using Scrapy with Splash

In order to use Zyte Smart Proxy Manager with Splash and Scrapy check out Using Smart Proxy Manager with Splash and Scrapy.