Introduction to web data extraction#

Web data extraction, or web scraping, is the download of data from websites in a structured format that you can process.

Use cases include price intelligence, market analysis, competitor intelligence, vendor management, lead generation, investment research, and brand monitoring.

Getting structured data from a website involves the following steps:

  1. Building a list of target URLs from which you want to get structured data.

    You can build it manually, or get it from an external source, however you will often want to use web data extraction to find all URLs of interested in a given website.

    This URL discovery process, or web crawling, is a web data extraction process in itself, with the same steps (target URLs, download, extraction), where the target URL is often the homepage of the target website, and the extraction output is usually the list of target URLs.

  2. Downloading the webpages at those URLs.

    The main challenge at the download stage is avoiding bans. However, the complexity of this step can be also influenced by your extraction needs (e.g. screenshots) and extraction choices (e.g. browser automation).

  3. Extracting the data of interest from those webpages in a structured data format as output.

    Extraction can be a complex step, and it may involve downloading additional URLs, or using browser automation.

    In the long-term, however, the main challenge of the extraction stage is dealing with breaking changes in the target website.

If you prefer to leave this work in the hands of experts, consider our web data extraction services: share your requirements with us, and let us design, build, monitor, and maintain the solution you need.

Otherwise, below we cover some of the challenges that you may face while doing web data extraction on your own, and the solutions that we offer or recommend for every problem.

Choosing a framework#

Choosing the right technology to write your code is key for the long-term success of your web data extraction project. To make your choice, you should consider aspects like development speed, performance, maintainability, and vendor lock-in.

At Zyte we use and maintain Scrapy, a popular open source web scraping framework written in Python. Scrapy is a powerful, extensible framework that favors writing maintainable code.

For most Zyte products and services we provide Scrapy plugins that make integration with Scrapy seamless.

Avoiding bans#

An increasing number of websites ban some of their traffic.

What you need to do to avoid bans depends on the target websites, and it can vary wildly. Proxy rotation is often necessary, but on top of that you may need extra logic, including cookie and session handling, browser-like JavaScript execution, and browser-like HTTP protocol handling.

While you can implement ban avoidance on your own by combining different services and tools, it can be time-consuming to implement, maintain, and scale.

To avoid bans, we provide Zyte API, an API that automatically avoids bans cost-efficiently.

Saving time with browser automation#

For websites that use JavaScript to load content, there are 2 approaches you can take:

  • Reverse-engineering and recreating the JavaScript code that loads the content.

  • Letting a browser automation tool run the JavaScript code.

Reverse-engineering usually requires more development time, but requires fewer resources once implemented, and can uncover useful, hidden data.

Browser automation usually saves development time, but requires additional resources, and can be hard to scale.

Zyte API provides browser automation features and, unlike regular browser automation tools, it:

  • Scales easily, offsetting one of the main drawbacks of using browser automation tools.

  • Supports special actions, high-level actions built with website-specific knowledge, such as searching or filling location data, to save you even more time.

Taking screenshots#

Sometimes you want a screenshot of the webpage from which you are extracting data.

Screenshots can be handy as a visual representation of the extracted data, but they can also be used, for example, to perform random quality checks where you compare the screenshot with the extracted data.

To take webpage screenshots at scale, use Zyte API.

Running your code#

When running your web data extraction code, you usually want a system where you can easily start, schedule, monitor, and inspect your web data extraction jobs, where they can run uninterrupted for as long as needed, and where you can run as many parallel jobs as you wish.

Scrapy Cloud is our solution for running web data extraction code on the cloud.

Avoiding breaking website changes#

Websites change, and when they do they can break your web data extraction code.

Monitoring your web data extraction solution for breaking website changes, and addressing those changes, can be very time-consuming, and it scales up as you target more websites.

One way to avoid this issue altogether is not to write web data extraction code to begin with. Instead, you can let our Automatic Extraction service handle web data extraction for you.