Handle JavaScript content#
Now that you know how to handle bans, you will learn how to handle websites that load content dynamically using JavaScript.
You will first reproduce what the JavaScript code does with regular HTTP requests, then you will use browser automation to achieve the same, and finally you will interact with a page.
Reproduce JavaScript requests#
Your next target will be http://quotes.toscrape.com/scroll, from which you will extract 100 quotes.
However, the HTML code of that page contains no quote at all. All 100 quotes are loaded dynamically, the webpage uses JavaScript code to send requests to its own API. To get all 100 quotes, you will need to reproduce those requests.
Create a file at tutorial/spiders/quotes_toscrape_com_scroll_api.py
with
the following code:
import json
from scrapy import Spider
class QuotesToScrapeComScrollAPISpider(Spider):
name = "quotes_toscrape_com_scroll_api"
start_urls = [
f"http://quotes.toscrape.com/api/quotes?page={n}" for n in range(1, 11)
]
def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
yield {
"author": quote["author"]["name"],
"tags": quote["tags"],
"text": quote["text"],
}
The code above sends 10 requests to the API of quotes.toscrape.com, reproducing what JavaScript code at http://quotes.toscrape.com/scroll does, and then parses the JSON response to extract the desired data.
Now run your code:
scrapy crawl quotes_toscrape_com_scroll_api -O quotes.csv
After all 10 requests are processed, all 100 quotes can be found at
quotes.csv
.
When the information that you want to extract is not readily available in the response HTML, but loaded from JavaScript, reproducing the JavaScript code manually, like you did above sending those 10 requests, is one option. Next you will try an alternative approach.
Use browser automation#
You will now ask Zyte API to use browser automation to render the page contents and return browser HTML, instead of raw HTML, and you will get Zyte API to render all 100 quotes with a single Zyte API request.
Create a file at tutorial/spiders/quotes_toscrape_com_scroll_browser.py
with the following code:
from scrapy import Request, Spider
class QuotesToScrapeComScrollBrowserSpider(Spider):
name = "quotes_toscrape_com_scroll_browser"
def start_requests(self):
yield Request(
"http://quotes.toscrape.com/scroll",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "scrollBottom",
},
],
},
},
)
def parse(self, response):
for quote in response.css(".quote"):
yield {
"author": quote.css(".author::text").get(),
"tags": quote.css(".tag::text").getall(),
"text": quote.css(".text::text").get()[1:-1],
}
The code above sends a single request to http://quotes.toscrape.com/scroll, but
this request includes some metadata. That is why the start_requests
method
is used instead of start_urls
, since the latter does not allow defining
request metadata.
The specified metadata indicates to Zyte API that you want the URL to be loaded
in a web browser, that you want to execute the scrollBottom
action, and
that you want the HTML rendering of the webpage DOM after that. The
scrollBottom
action keeps scrolling to the bottom of a webpage until that
webpage stops loading additional content, so that you get all 100 quotes, and
not only the first 10.
Now run your code:
scrapy crawl quotes_toscrape_com_scroll_browser -O quotes.csv
quotes.csv
will have the same data as before, only that now it has been
generated through browser rendering.
Use network capture#
What if you could have the best from both worlds, i.e. use browser rendering to avoid reverse engineering, and get the API responses and not only what is loaded into the DOM?
You will now ask Zyte API to use network capture to render the page contents and capture the API responses.
Create a file at tutorial/spiders/quotes_toscrape_com_scroll_capture.py
with the following code:
import json
from base64 import b64decode
from scrapy import Request, Spider
class QuotesToScrapeComScrollCaptureSpider(Spider):
name = "quotes_toscrape_com_scroll_capture"
def start_requests(self):
yield Request(
"http://quotes.toscrape.com/scroll",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "scrollBottom",
},
],
"networkCapture": [
{
"filterType": "url",
"httpResponseBody": True,
"value": "/api/",
"matchType": "contains",
},
],
},
},
)
def parse(self, response):
for capture in response.raw_api_response["networkCapture"]:
text = b64decode(capture["httpResponseBody"]).decode()
data = json.loads(text)
for quote in data["quotes"]:
yield {
"author": quote["author"]["name"],
"tags": quote["tags"],
"text": quote["text"],
}
The specified metadata indicates that we want to capture the body of any
network response that contains /api/
in its URL.
Now run your code:
scrapy crawl quotes_toscrape_com_scroll_capture -O quotes.csv
quotes.csv
will have the same data as before, only that now it has been
generated through network capture.
Which option is best, reproducing JavaScript code manually, using browser-rendered HTML or using network captures, depends on each scenario. To choose one option or the other you need to factor in website specificity, development time, run time, request count, request cost, etc.
Use an action sequence#
Sometimes, it can be really hard to reproduce JavaScript code manually, or the resulting code can break too easily, making the browser automation option a clear winner.
You will now extract a quote from http://quotes.toscrape.com/search.aspx by interacting with the search form through browser actions.
Create a file at tutorial/spiders/quotes_toscrape_com_search.py
with
the following code:
from scrapy import Request, Spider
class QuotesToScrapeComSearchSpider(Spider):
name = "quotes_toscrape_com_search"
def start_requests(self):
yield Request(
"http://quotes.toscrape.com/search.aspx",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "select",
"selector": {"type": "css", "value": "#author"},
"values": ["Albert Einstein"],
},
{
"action": "waitForSelector",
"selector": {
"type": "css",
"value": "[value=\"world\"]",
"state": "attached",
},
},
{
"action": "select",
"selector": {"type": "css", "value": "#tag"},
"values": ["world"],
},
{
"action": "click",
"selector": {"type": "css", "value": "[type='submit']"},
},
{
"action": "waitForSelector",
"selector": {"type": "css", "value": ".quote"},
},
],
},
},
)
def parse(self, response):
for quote in response.css(".quote"):
yield {
"author": quote.css(".author::text").get(),
"tags": quote.css(".tag::text").getall(),
"text": quote.css(".content::text").get()[1:-1],
}
The code above sends a request that makes Zyte API load http://quotes.toscrape.com/search.aspx and perform the following actions:
Select Albert Einstein as author.
Wait for the “world” tag to load.
Select the “world” tag.
Click the Search button.
Wait for a quote to load.
From the HTML rendering of the DOM after those actions are executed, your code extracts all displayed quotes.
Now run your code:
scrapy crawl quotes_toscrape_com_search -O quotes.csv
quotes.csv
will have 1 quote from Albert Einstein about the world.
If you were to try and write alternative code that, instead of relying on the browser HTML feature from Zyte API, reproduces the underlying JavaScript code with regular requests, it may take you a while to build a working solution, and your solution may be more fragile, i.e. more likely to break with server code changes.
Continue to the next chapter to learn how you can avoid the need to write and maintain parsing and crawling code.