Using Smart Proxy Manager with Splash and Scrapy#
All the code in this documentation has been tested with Splash 3.5, Scrapy 2.5.0 & Python 3.9.5
We have discussed Using Smart Proxy Manager with Scrapy and Using Smart Proxy Manager with Splash in detail. In this doc, we’ll integrate them with Smart Proxy Manager.
We have created a repository with sample code that will be used to explain the integration.
Clone the sample code repository using the following code.
git clone https://github.com/scrapinghub/sample-projects.git
Enter the directory containing Splash + Scrapy + Smart Proxy Manager integration code.
There are various code samples in the repository but we are only concerned with
splash_smart_proxy_manager_example directory. Please ignore the rest for the purpose of this documentation.
Create and activate a Python virtual environment
python -m venv venv && source venv/bin/activate
Install Python dependencies.
pip install -r requirements.txt
Setup the Zyte SmartProxy (formerly Crawlera) Headless Proxy as described in Using Headless Browsers with Zyte Smart Proxy Manager.
Download and install Splash following this guide as explained here: https://splash.readthedocs.io/en/stable/install.html
Assuming you installed Splash using Docker, proceed to run Splash with:
docker run -it -p 8050:8050 --rm scrapinghub/splash
You can confirm Splash is running by accessing the Splash web UI at http://localhost:8050
After the Installation you should have Splash and Zyte SmartProxy (formerly Crawlera) Headless Proxy both up and running. We’ll run the sample code in the next step.
Running the Project#
After installation, simply run:
scrapy crawl quotes-js
This will run the spider
quotes-js using Splash and Headless Proxy. Splash will render the
requested URL (i.e.
http://quotes.toscrape.com/js/ in our example code) and return
Let’s understand all the important elements of the example project.
splash_scrapy_spm_headless_proxy_example/settings.py- On line 18,
SPLASH_URLis set to
http://localhost:8050that helps Scrapy establish connection with Splash. This can be changed if Splash is running on another URL.
splash_scrapy_spm_headless_proxy_example/spiders/quotes-js.py- Scrapy spider class is defined in this file. It reads Lua file and sends as an argument to Splash alongside URL to request (i.e.
http://quotes.toscrape.com/js/defined in line 23).
splash_scrapy_spm_headless_proxy_example/scripts/smart_proxy_manager.lua- Lua file used by Splash. This receives URL to request using Splash (i.e.
http://quotes.toscrape.com/js/sent as argument by Scrapy spider). On line 28, we set Zyte SmartProxy Headless Proxy to be used as a proxy while requesting the URL.
Customizing the Lua Script#
The Lua file present in the example project indicates additional use of Splash to avoid request of some resource like images. If you need more control over Splash check out Splash’s official doc about scripting.