Using Smart Proxy Manager with Splash and Scrapy

Note

All the code in this documentation has been tested with Splash 3.5, Scrapy 2.5.0 & Python 3.9.5

We have discussed Using Smart Proxy Manager with Scrapy and Using Smart Proxy Manager with Splash in detail. In this doc, we’ll integrate them with Smart Proxy Manager.

Installation

We have created a repository with sample code that will be used to explain the integration.

  1. Clone the sample code repository using the following code.

git clone https://github.com/scrapinghub/sample-projects.git
  1. Enter the directory containing Splash + Scrapy + Smart Proxy Manager integration code.

cd sample-projects/splash_smart_proxy_manager_example

Note

There are various code samples in the repository but we are only concerned with splash_smart_proxy_manager_example directory. Please ignore the rest for the purpose of this documentation.

  1. Create and activate a Python virtual environment

python -m venv venv && source venv/bin/activate
  1. Install Python dependencies.

pip install -r requirements.txt
  1. Setup the Zyte SmartProxy (formerly Crawlera) Headless Proxy as described in Using Headless Browsers with Zyte Smart Proxy Manager.

  2. Download and install Splash following this guide as explained here: https://splash.readthedocs.io/en/stable/install.html

Assuming you installed Splash using Docker, proceed to run Splash with:

docker run -it -p 8050:8050 --rm scrapinghub/splash

You can confirm Splash is running by accessing the Splash web UI at http://localhost:8050

Note

After the Installation you should have Splash and Zyte SmartProxy (formerly Crawlera) Headless Proxy both up and running. We’ll run the sample code in the next step.

Running the Project

After installation, simply run:

scrapy crawl quotes-js

This will run the spider quotes-js using Splash and Headless Proxy. Splash will render the requested URL (i.e. http://quotes.toscrape.com/js/ in our example code) and return

Let’s understand all the important elements of the example project.

  1. splash_scrapy_spm_headless_proxy_example/settings.py - On line 18, SPLASH_URL is set to http://localhost:8050 that helps Scrapy establish connection with Splash. This can be changed if Splash is running on another URL.

  2. splash_scrapy_spm_headless_proxy_example/spiders/quotes-js.py - Scrapy spider class is defined in this file. It reads Lua file and sends as an argument to Splash alongside URL to request (i.e. http://quotes.toscrape.com/js/ defined in line 23).

  3. splash_scrapy_spm_headless_proxy_example/scripts/smart_proxy_manager.lua - Lua file used by Splash. This receives URL to request using Splash (i.e. http://quotes.toscrape.com/js/ sent as argument by Scrapy spider). On line 28, we set Zyte SmartProxy Headless Proxy to be used as a proxy while requesting the URL.

Customizing the Lua Script

The Lua file present in the example project indicates additional use of Splash to avoid request of some resource like images. If you need more control over Splash check out Splash’s official doc about scripting.