Scrapy Cloud tutorial#

In this tutorial you will learn to deploy and run a Scrapy project in Scrapy Cloud.

Tip

The rest of this tutorial can be easier to follow if you first learn a bit of Python and Scrapy.

Start a Scrapy project#

  1. Install Python, version 3.7 or better.

    Tip

    You can run python --version on a terminal window to make sure that you have a good-enough version of Python.

  2. Open a terminal window.

  3. Create a scrapy-cloud-tutorial folder and make it your working folder:

    mkdir scrapy-cloud-tutorial
    cd scrapy-cloud-tutorial
    
  4. Create and activate a Python virtual environment.

    • On Windows:

      python3 -m venv tutorial-env
      tutorial-env\Scripts\activate.bat
      
    • On macOS and Linux:

      python3 -m venv tutorial-env
      . tutorial-env/bin/activate
      
  5. Install the latest versions of the Python packages that you will use during this tutorial:

    pip install --upgrade scrapy shub
    
  6. Make scrapy-cloud-tutorial a Scrapy project folder:

    scrapy startproject tutorial .
    

    Your scrapy-cloud-tutorial folder should now contain the following folders and files:

    scrapy-cloud-tutorial/
    ├── scrapy.cfg
    └── tutorial/
        ├── __init__.py
        ├── items.py
        ├── middlewares.py
        ├── pipelines.py
        ├── settings.py
        └── spiders/
            └── __init__.py
    

Write a Scrapy spider#

Now that you are all set up, you will write code to extract data from all books in the Mystery category of books.toscrape.com.

Create a file at tutorial/spiders/books_toscrape_com.py with the following code:

from scrapy import Spider


class BooksToScrapeComSpider(Spider):
    name = "books_toscrape_com"
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        next_page_links = response.css(".next a")
        yield from response.follow_all(next_page_links)
        book_links = response.css("article a")
        yield from response.follow_all(book_links, callback=self.parse_book)

    def parse_book(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css(".price_color::text").re_first("£(.*)"),
            "url": response.url,
        }

In the code above:

  • You define a Scrapy spider class named books_toscrape_com.

  • Your spider starts by sending a request for the Mystery category URL, http://books.toscrape.com/catalogue/category/books/mystery_3/index.html, (start_urls), and parses the response with the default callback method: parse.

  • The parse callback method:

    • Finds the link to the next page and, if found, yields a request for it, whose response will also be parsed by the parse callback method.

      As a result, the parse callback method eventually parses all pages of the Mystery category.

    • Finds links to book detail pages, and yields requests for them, whose responses will be parsed by the parse_book callback method.

      As a result, the parse_book callback method eventually parses all book detail pages from the Mystery category.

  • The parse_book callback method extracts a record of book information with the book name, price, and URL.

Now run your code:

scrapy crawl books_toscrape_com -O books.csv

Once execution finishes, the generated books.csv file will contain records for all books from the Mystery category of books.toscrape.com in CSV format. You can open books.csv with any spreadsheet app.

Deploy your Scrapy project to Scrapy Cloud#

Now that you have a working Scrapy spider, you will deploy your Scrapy project to a Scrapy Cloud project.

  1. Copy your Zyte dashboard API key.

  2. Run the following command and, when prompted, paste your API key and press Enter:

    shub login
    
  3. On the Zyte dashboard, select your Scrapy Cloud project under Scrapy Cloud Projects, and copy your Scrapy Cloud project ID from the web browser URL bar.

    For example, if the URL is https://app.zyte.com/p/000000/jobs, 000000 is your Scrapy Cloud project ID.

  4. Make sure scrapy-cloud-tutorial is your current working folder.

  5. Run the following command, replacing 000000 with your actual project ID:

    shub deploy 000000
    

Your Scrapy project has now been deployed to your Scrapy Cloud project.

Run a Scrapy Cloud job#

Now that you have deployed your Scrapy project to your Scrapy Cloud project, it is time to run your spider on Scrapy Cloud:

  1. On the Zyte dashboard, select your Scrapy Cloud project under Scrapy Cloud Projects.

  2. On the Dashboard page of your project, select Run on the top-right corner.

  3. On the Run dialog box:

    1. Select the Spiders field and, from the spider list that appears, select your spider name.

    2. Select Run.

    ../../_images/run-run.png

    A new Scrapy Cloud job will appear in the Running job list:

    Once the job finishes, it will move to the Completed job list:

  4. Follow the link from the Job column, 1/1.

  5. On the job page, select the Items tab.

  6. On the Items page, select Export → CSV.

The downloaded file will have the same data as the books.csv file that you generated locally earlier.

Next steps#

Now that you know how to deploy your code and run a job in Scrapy Cloud, see Scrapy Cloud usage for more in-depth documentation of Scrapy Cloud.