Get started with web scraping#

Web scraping is the download of data from websites in a structured format that you can process.

Use cases include price intelligence, market analysis, competitor intelligence, vendor management, lead generation, investment research, and brand monitoring.

Below we expand on the steps and challenges of web scraping, and the solutions that we offer or recommend.

Tip

For a more hands-on experience, see the tutorial.

Steps in web scraping#

Getting structured data from a website involves the following steps:

  1. Building a list of target URLs from which you want to get structured data.

    You can build it manually, or get it from an external source, however you will often want to use web scraping to find all URLs of interest in a given website.

    This process, known as web crawling, is a web scraping process in itself, with the same steps (target URLs, download, parsing), where the target URL is often the homepage of the target website, and the output is usually the list of target URLs.

  2. Downloading the webpages at those URLs.

    The main challenge at the download stage is avoiding bans. However, the complexity of this step can be also influenced by your output needs (e.g. screenshots) and parsing choices (e.g. browser automation).

  3. Parsing those webpages to extract data of interest in a structured data format as output.

    Parsing can be a complex step, and it may involve downloading additional URLs, or using browser automation.

    In the long-term, however, the main challenge of the parsing stage is dealing with breaking changes in the target website.

Choosing a framework#

Choosing the right technology to write your code is key for the long-term success of your web scraping project. To make your choice, you should consider aspects like development speed, performance, maintainability, and vendor lock-in.

At Zyte we use and maintain Scrapy, a popular open source web scraping framework written in Python. Scrapy is a powerful, extensible framework that favors writing maintainable code.

For most Zyte products and services we provide Scrapy plugins that make integration with Scrapy seamless.

Avoiding bans#

An increasing number of websites ban some of their traffic.

What you need to do to avoid bans depends on the target websites, and it can vary wildly. Proxy rotation is often necessary, but on top of that you may need extra logic, including cookie and session handling, browser-like JavaScript execution, and browser-like HTTP protocol handling.

While you can implement ban avoidance on your own by combining different services and tools, it can be time-consuming to implement, maintain, and scale.

To avoid bans, we provide Zyte API, an API that automatically avoids bans cost-efficiently.

Saving time with browser automation#

For websites that use JavaScript to load content, there are 2 approaches you can take:

  • Reverse-engineering and recreating the JavaScript code that loads the content.

  • Letting a browser automation tool run the JavaScript code.

Reverse-engineering usually requires more development time, but requires fewer resources once implemented, and can uncover useful, hidden data.

Browser automation usually saves development time, but requires additional resources, and can be hard to scale.

Zyte API provides browser automation features and, unlike regular browser automation tools, it:

  • Scales easily, offsetting one of the main drawbacks of using browser automation tools.

  • Supports special actions, high-level actions built with website-specific knowledge, such as searching or filling location data, to save you even more time.

Taking screenshots#

Sometimes you want a screenshot of the webpage from which you are extracting data.

Screenshots can be handy as a visual representation of the extracted data, but they can also be used, for example, to perform random quality checks where you compare the screenshot with the extracted data.

To take webpage screenshots at scale, use Zyte API.

Running your code#

When running your web scraping code, you usually want a system where you can easily start, schedule, monitor, and inspect your web scraping jobs, where they can run uninterrupted for as long as needed, and where you can run as many parallel jobs as you wish.

Scrapy Cloud is our solution for running web scraping code in the cloud.

Avoiding breaking website changes#

Websites change, and when they do they can break your parsing code.

Monitoring your web scraping solution for breaking website changes, and addressing those changes, can be very time-consuming, and it scales up as you target more websites.

One way to avoid this issue altogether is not to write parsing code to begin with. Instead, you can let Zyte API handle parsing for you.