Handle JavaScript content#

Now that you know how to handle bans, you will learn how to handle websites that load content dynamically using JavaScript.

You will first reproduce what the JavaScript code does with regular HTTP requests, then you will use browser automation to achieve the same, and finally you will interact with a page.

Reproduce JavaScript requests#

Your next target will be http://quotes.toscrape.com/scroll, from which you will extract 100 quotes.

However, the HTML code of that page contains no quote at all. All 100 quotes are loaded dynamically, the webpage uses JavaScript code to send requests to its own API. To get all 100 quotes, you will need to reproduce those requests.

Create a file at tutorial/spiders/quotes_toscrape_com_scroll_api.py with the following code:

import json
from scrapy import Spider


class QuotesToScrapeComScrollAPISpider(Spider):
    name = "quotes_toscrape_com_scroll_api"
    start_urls = [
        f"http://quotes.toscrape.com/api/quotes?page={n}" for n in range(1, 11)
    ]

    def parse(self, response):
        data = json.loads(response.text)
        for quote in data["quotes"]:
            yield {
                "author": quote["author"]["name"],
                "tags": quote["tags"],
                "text": quote["text"],
            }

The code above sends 10 requests to the API of quotes.toscrape.com, reproducing what JavaScript code at http://quotes.toscrape.com/scroll does, and then parses the JSON response to extract the desired data.

Now run your code:

scrapy crawl quotes_toscrape_com_scroll_api -O quotes.csv

After all 10 requests are processed, all 100 quotes can be found at quotes.csv.

When the information that you want to extract is not readily available in the response HTML, but loaded from JavaScript, reproducing the JavaScript code manually, like you did above sending those 10 requests, is one option. Next you will try an alternative approach.

Use browser automation#

You will now ask Zyte API to use browser automation to render the page contents and return browser HTML, instead of raw HTML, and you will get Zyte API to render all 100 quotes with a single Zyte API request.

Create a file at tutorial/spiders/quotes_toscrape_com_scroll_browser.py with the following code:

from scrapy import Request, Spider


class QuotesToScrapeComScrollBrowserSpider(Spider):
    name = "quotes_toscrape_com_scroll_browser"

    def start_requests(self):
        yield Request(
            "http://quotes.toscrape.com/scroll",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "scrollBottom",
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "author": quote.css(".author::text").get(),
                "tags": quote.css(".tag::text").getall(),
                "text": quote.css(".text::text").get()[1:-1],
            }

The code above sends a single request to http://quotes.toscrape.com/scroll, but this request includes some metadata. That is why the start_requests method is used instead of start_urls, since the latter does not allow defining request metadata.

The specified metadata indicates to Zyte API that you want the URL to be loaded in a web browser, that you want to execute the scrollBottom action, and that you want the HTML rendering of the webpage DOM after that. The scrollBottom action keeps scrolling to the bottom of a webpage until that webpage stops loading additional content, so that you get all 100 quotes, and not only the first 10.

Now run your code:

scrapy crawl quotes_toscrape_com_scroll_browser -O quotes.csv

quotes.csv will have the same data as before, only that now it has been generated with a completely different approach.

Which option is best, reproducing JavaScript code manually or using browser automation, depends on each scenario. To choose one option or the other you need to factor in development time, run time, request count, request cost, etc.

Use an action sequence#

Sometimes, it can be really hard to reproduce JavaScript code manually, or the resulting code can break too easily, making the browser automation option a clear winner.

You will now extract a quote from http://quotes.toscrape.com/search.aspx by interacting with the search form through browser actions.

Create a file at tutorial/spiders/quotes_toscrape_com_search.py with the following code:

from scrapy import Request, Spider


class QuotesToScrapeComSearchSpider(Spider):
    name = "quotes_toscrape_com_search"

    def start_requests(self):
        yield Request(
            "http://quotes.toscrape.com/search.aspx",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "select",
                            "selector": {"type": "css", "value": "#author"},
                            "values": ["Albert Einstein"],
                        },
                        {
                            "action": "waitForSelector",
                            "selector": {
                                "type": "css",
                                "value": "[value=\"world\"]",
                                "state": "attached",
                            },
                        },
                        {
                            "action": "select",
                            "selector": {"type": "css", "value": "#tag"},
                            "values": ["world"],
                        },
                        {
                            "action": "click",
                            "selector": {"type": "css", "value": "[type='submit']"},
                        },
                        {
                            "action": "waitForSelector",
                            "selector": {"type": "css", "value": ".quote"},
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "author": quote.css(".author::text").get(),
                "tags": quote.css(".tag::text").getall(),
                "text": quote.css(".content::text").get()[1:-1],
            }

The code above sends a request that makes Zyte API load http://quotes.toscrape.com/search.aspx and perform the following actions:

  1. Select Albert Einstein as author.

  2. Wait for the “world” tag to load.

  3. Select the “world” tag.

  4. Click the Search button.

  5. Wait for a quote to load.

From the HTML rendering of the DOM after those actions are executed, your code extracts all displayed quotes.

Now run your code:

scrapy crawl quotes_toscrape_com_search -O quotes.csv

quotes.csv will have 1 quote from Albert Einstein about the world.

If you were to try and write alternative code that, instead of relying on the browser HTML feature from Zyte API, reproduces the underlying JavaScript code with regular requests, it may take you a while to build a working solution, and your solution may be more fragile, i.e. more likely to break with server code changes.

Continue to the next chapter to learn how you can avoid the need to write and maintain parsing and crawling code.