Start a Scrapy project#

To build your web scraping project, you will use Scrapy, a popular open source web scraping framework written in Python and maintained by Zyte.

Setup your project#

Install Python, version 3.9 or higher.

Tip

You can run python --version on a terminal window to make sure that you have a good-enough version of Python.
Open a terminal window.
Create a web-scraping-tutorial folder and make it your working folder:
```
mkdir web-scraping-tutorial
cd web-scraping-tutorial
```

Create and activate a Python virtual environment.

Windows

python3 -m venv venv
venv\Scripts\activate.bat

macOS, Linux

python3 -m venv venv
. venv/bin/activate

Install the latest version of Scrapy:
```
pip install scrapy==2.13.3
```

Make web-scraping-tutorial a Scrapy project folder:

scrapy startproject tutorial .

Your web-scraping-tutorial folder should now contain the following folders and files:

web-scraping-tutorial/
├── scrapy.cfg
└── tutorial/
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders/
        └── __init__.py

Create your first spider#

Now that you are all set up, you will write code to extract data from all books in the Mystery category of books.toscrape.com.

Create a file at tutorial/spiders/books_toscrape_com.py with the following code:

from scrapy import Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"
    custom_settings = {
        "CONCURRENT_REQUESTS_PER_DOMAIN": 8,
        "DOWNLOAD_DELAY": 0.01,
    }
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        next_page_links = response.css(".next a")
        yield from response.follow_all(next_page_links)
        book_links = response.css("article a")
        yield from response.follow_all(book_links, callback=self.parse_book)

    def parse_book(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css(".price_color::text").re_first("£(.*)"),
            "url": response.url,
        }

In the code above:

You define a Scrapy spider class named books_toscrape_com.
You set custom values for CONCURRENT_REQUESTS_PER_DOMAIN and DOWNLOAD_DELAY to speed crawls during the tutorial. https://toscrape.com is a test site, so it is safe to do so.
Your spider starts by sending a request for the Mystery category URL, http://books.toscrape.com/catalogue/category/books/mystery_3/index.html, (start_urls), and parses the response with the default callback method: parse.
The parse callback method:
- Finds the link to the next page and, if found, yields a request for it, whose response will also be parsed by the parse callback method.
  
  As a result, the parse callback method eventually parses all pages of the Mystery category.
- Finds links to book detail pages, and yields requests for them, whose responses will be parsed by the parse_book callback method.
  
  As a result, the parse_book callback method eventually parses all book detail pages from the Mystery category.
The parse_book callback method extracts a record of book information with the book name, price, and URL.

Now run your code:

scrapy crawl books_toscrape_com -O books.csv

Once execution finishes, the generated books.csv file will contain records for all books from the Mystery category of books.toscrape.com in CSV format. You can open books.csv with any spreadsheet app.

Continue to the next chapter to learn how you can easily deploy and run you web scraping project on the cloud.