Web scraping with LLMs#

When using Scrapy, you can easily use a large language model (LLM) to automate or augment your web parsing.

There are multiple ways of using LLM to help with web scraping. In this guide we will be invoking an LLM on every page from which we wish to extract a set of attributes which we define, without having to write any selectors or train any models.

1. Start a Scrapy project#

Follow the instructions on the Start a Scrapy project page of the web-scraping tutorial to start a Scrapy project.

2. Install LLM dependencies#

This guide will use LiteLLM as an API for LLMs.

For the purposes of this guide we will be running the Mistral 7B LLM through Ollama, but LiteLLM can run almost any LLM, as you will see later.

  1. Install html2text, LiteLLM, and Ollama:

    pip install html2text litellm ollama
    
  2. Start the Ollama server:

    ollama serve
    
  3. Open a second terminal, and install Mistral 7B:

    ollama pull mistral
    

3. Use an LLM in your spider#

Now that you have a Scrapy project with a simple spider and an LLM ready to use, create an alternative to your first spider at tutorial/spiders/books_toscrape_com_llm.py with the following code:

import json
from json.decoder import JSONDecodeError
from logging import getLogger

import ollama
from html2text import HTML2Text
from litellm import acompletion
from scrapy import Spider

html_cleaner = HTML2Text()
logger = getLogger(__name__)


async def llm_parse(response, prompts):
    key_list = ", ".join(prompts)
    formatted_scheme = "\n".join(f"{k}: {v}" for k, v in prompts.items())
    markdown = html_cleaner.handle(response.text)
    llm_response = await acompletion(
        messages=[
            {
                "role": "user",
                "content": (
                    f"Return a JSON object with the following root keys: "
                    f"{key_list}\n"
                    f"\n"
                    f"Data to scrape:\n"
                    f"{formatted_scheme}\n"
                    f"\n"
                    f"Scrape it from the following Markdown text:\n"
                    f"\n"
                    f"{markdown}"
                ),
            },
        ],
        model="ollama/mistral",
    )
    data = llm_response["choices"][0]["message"]["content"]
    try:
        return json.loads(data)
    except JSONDecodeError:
        logger.error(f"LLM returned an invalid JSON for {response.url}: {data}")
        return {}


class BooksToScrapeComLLMSpider(Spider):
    name = "books_toscrape_com_llm"
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        next_page_links = response.css(".next a")
        yield from response.follow_all(next_page_links)
        book_links = response.css("article a")
        yield from response.follow_all(book_links, callback=self.parse_book)

    async def parse_book(self, response):
        prompts = {
            "name": "Product name",
            "price": "Product price as a number, without the currency symbol",
        }
        llm_data = await llm_parse(response, prompts)
        yield {
            "url": response.url,
            **llm_data,
        }

In the code above:

  • An llm_parse function is defined that accepts a Scrapy response and a dictionary of fields to extract and their field-specific prompts.

    Then, the response is converted into Markdown syntax to make it easier for the LLM to parse, and the LLM is sent a prompt asking for a JSON object with the corresponding fields.

    Note

    Building a prompt that gets the expected data in the expected format is the hardest part of this process. The example prompt here works well for Mistral 7B and books.toscrape.com, but it might not work well for other LLMs or other websites.

    The LLM result, if a valid JSON, is returned.

  • llm_parse is called with field prompts to extract name and price, and the resulting dictionary is yielded after including an extra field that did not come from the LLM (url).

You can now run your code:

scrapy crawl books_toscrape_com_llm -O books.csv

Execution will take a long time on most computers. The logs in the terminal running ollama serve will show how your LLM gets prompts and generates responses for them.

Once execution finishes, the generated books.csv file will contain records for all books from the Mystery category of books.toscrape.com in CSV format. You can open books.csv with any spreadsheet app.

Next steps#

Here are some ideas for next steps to follow:

  • Try other LLMs.

    The following line from the code above is what determines that the LLM to be used is Mistral 7B through a local instance of Ollama:

    model="ollama/mistral"
    

    If you have access to other LLMs, you can change this line to use some other LLM instead, and see how the change affects speed, quality, and costs.

    See the LiteLLM documentation for setup instructions for many different LLMs.

  • See if you can get the same output as Zyte API automatic extraction (e.g. product) with comparable speed, quality and cost.

  • See if you can also automate the crawling part and achieve something similar to what Zyte’s AI-powered spiders can do.

  • Try extracting data that is not available in a structured way in the source HTML, such as the book author, who can sometimes be found within the book description.

  • Try extracting data that is not directly available in the source HTML, such as the book language (English), the currency code (GBP), or a summary of the book description.

  • Try different approaches to HTML cleanup, or no cleanup at all.

    The code above converts the response HTML into Markdown because that allowed Mistral 7B to work as expected. Other LLMs may work with the raw HTML, maybe after some cleanup (see clear-html), and thus enable extracting some additional data that may be lost in the conversion to Markdown.

    Mind however that LLMs have a limited context length, and HTML cleanup and trimming can be necessary to fit the HTML into your prompt without exceeding that context length.

  • If you have access to an LLM that supports image parsing, see if you can extend the spider to download the book covers, and extract additional information from them, such as the book author.

  • Instead of using an LLM per page, use your LLM to generate CSS selectors for the desired fields given the raw HTML of a first page, and use those selectors to parse all other pages.

    This minimizes the use of an LLM for better speed and cost, at the risk of quality for websites with multiple different layouts or performing some layout A-B test, or unlucky scenarios where a website changes its layout mid-crawl.