Using Headless Browsers with Zyte Smart Proxy Manager¶
In this documentation you may see references to “Crawlera”, which was the previous name of “Smart Proxy Manager” service. They're effectively interchangeable.
What's different about headless browsers?¶
Smart Proxy Manager does not provide raw IPs like most regular (data center) proxy providers. Instead, it delivers requests. It also introduces artificial delays to ensure websites are crawled politely while also increasing the success rates.
Because of the artificial delays, using the standard Smart Proxy Manager directly with a headless browser will cause the page to load very slowly and timeout frequently.
To use headless browsers properly, you then have to use Smart Proxy Manager in “Headless Mode”, which is different than the standard mode. Using Smart Proxy Manager in Headless Mode involves two things:
The Headless Proxy tool¶
Smart Proxy Manager provides a tool that you can use to connect your headless browser with Smart Proxy Manager. This tool is called “Zyte SmartProxy (formerly Crawlera) Headless Proxy” and it’sbasically a proxy (yes, another proxy) that sits between Smart Proxy Manager and your headless browser.
The Headless Proxy takes care of the following tasks, which are typically not required for a non-headless environment (such as a Python script or Scrapy spider):
manages Smart Proxy Manager sessions automatically, as requests from a browser need to go out from the same IP and sessions is the way to achieve that with Smart Proxy Manager
blocks ads, to speed up page loads and save costs (remember Smart Proxy Manager is priced per requests)
For more details about how each of this work please refer to the Headless Proxy README in Github. Please note that Headless Proxy code is open source.
Installing the Headless Proxy tool¶
The recommended way to install Headless Proxy is by downloading an executable, compiled for your platform. There are other ways to install the Headless Proxy, including via a Docker image. For more information check the project README.
Download the Headless Proxy from the Releases page. There are executables available for Windows, Linux, and Mac (darwin), Free BSD and Open BSD.
this config.toml file into the same folder of the Headless Proxy executable. This config comes with adblocking and direct access enabled.
Download :download:`this ca.crt <https://raw.githubusercontent.com/zytedata/zyte-smartproxy-headless-proxy/master/ca.crt>`_ certificate file and install it as explained in this page.
You are now ready to start the Headless Proxy by running:
crawlera-headless-proxy -c config.toml -a API_KEY
If you are in recent Mac OS, you may need to give this executable permissions to run, as explained in this page.
Get your API_KEY from https://app.zyte.com/o/smart-proxy-manager/setup if you already have an account OR signup at https://app.zyte.com/account/signup/smart-proxy-manager, and then go to the aforementioned URL
Make sure you are using a Headless browser user type in Smart Proxy Manager, otherwise your headless browser will work very slow.
Using the Headless Proxy tool¶
Please note that while using the Headless Proxy, the first page may take more time to load while Smart Proxy Manager warms up a pool of clean IPs to the target website you’re trying to access. Typical times could be up to 30 seconds for the first page, and 2-5 seconds for subsequent page loads from the same site.
This warm up time should be seen as a small inconvenience. In exchange, you get a perpetually refeshing pool of healthy IPs, which translates to better performance in the long run.
Once you have setup the Headless Proxy, you can use Smart Proxy Manager with multiple headless browsers, please refer to the documentation of the tool you use for code samples and more specific instructions: