Warning
Zyte API is replacing Smart Proxy Manager. It is no longer possible to sign up to Smart Proxy Manager. If you are an existing Smart Proxy Manager user, see Migrating from Smart Proxy Manager to Zyte API.
Using Smart Proxy Manager with Splash and Scrapy#
Note
All the code in this documentation has been tested with Splash 3.5, Scrapy 2.5.0 & Python 3.9.5
We have discussed Using Smart Proxy Manager with Scrapy and Using Smart Proxy Manager with Splash in detail. In this doc, we’ll integrate them with Smart Proxy Manager.
Installation#
We have created a repository with sample code that will be used to explain the integration.
Clone the sample code repository using the following code.
git clone https://github.com/scrapinghub/sample-projects.git
Enter the directory containing Splash + Scrapy + Smart Proxy Manager integration code.
cd sample-projects/splash_smart_proxy_manager_example
Note
There are various code samples in the repository but we are only concerned with splash_smart_proxy_manager_example
directory. Please ignore the rest for the purpose of this documentation.
Create and activate a Python virtual environment
python -m venv venv && source venv/bin/activate
Install Python dependencies.
pip install -r requirements.txt
Setup the Zyte SmartProxy (formerly Crawlera) Headless Proxy as described in Using Headless Browsers with Zyte Smart Proxy Manager.
Download and install Splash following this guide as explained here: https://splash.readthedocs.io/en/stable/install.html
Assuming you installed Splash using Docker, proceed to run Splash with:
docker run -it -p 8050:8050 --rm scrapinghub/splash
You can confirm Splash is running by accessing the Splash web UI at http://localhost:8050
Note
After the Installation you should have Splash and Zyte SmartProxy (formerly Crawlera) Headless Proxy both up and running. We’ll run the sample code in the next step.
Running the Project#
After installation, simply run:
scrapy crawl quotes-js
This will run the spider quotes-js
using Splash and Headless Proxy. Splash will render the
requested URL (i.e. http://quotes.toscrape.com/js/
in our example code) and return
Let’s understand all the important elements of the example project.
splash_scrapy_spm_headless_proxy_example/settings.py
- On line 18,SPLASH_URL
is set tohttp://localhost:8050
that helps Scrapy establish connection with Splash. This can be changed if Splash is running on another URL.splash_scrapy_spm_headless_proxy_example/spiders/quotes-js.py
- Scrapy spider class is defined in this file. It reads Lua file and sends as an argument to Splash alongside URL to request (i.e.http://quotes.toscrape.com/js/
defined in line 23).splash_scrapy_spm_headless_proxy_example/scripts/smart_proxy_manager.lua
- Lua file used by Splash. This receives URL to request using Splash (i.e.http://quotes.toscrape.com/js/
sent as argument by Scrapy spider). On line 28, we set Zyte SmartProxy Headless Proxy to be used as a proxy while requesting the URL.
Customizing the Lua Script#
The Lua file present in the example project indicates additional use of Splash to avoid request of some resource like images. If you need more control over Splash check out Splash’s official doc about scripting.