Scrapy Cloud tutorial#
In this tutorial you will learn to deploy and run a Scrapy project in Scrapy Cloud.
Tip
The rest of this tutorial can be easier to follow if you first learn a bit of Python and Scrapy.
Start a Scrapy project#
Install Python, version 3.7 or better.
Tip
You can run
python --version
on a terminal window to make sure that you have a good-enough version of Python.Open a terminal window.
Create a
scrapy-cloud-tutorial
folder and make it your working folder:mkdir scrapy-cloud-tutorial cd scrapy-cloud-tutorial
Create and activate a Python virtual environment.
On Windows:
python3 -m venv tutorial-env tutorial-env\Scripts\activate.bat
On macOS and Linux:
python3 -m venv tutorial-env . tutorial-env/bin/activate
Install the latest versions of the Python packages that you will use during this tutorial:
pip install --upgrade scrapy shub
Make
scrapy-cloud-tutorial
a Scrapy project folder:scrapy startproject tutorial .
Your
scrapy-cloud-tutorial
folder should now contain the following folders and files:scrapy-cloud-tutorial/ ├── scrapy.cfg └── tutorial/ ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders/ └── __init__.py
Write a Scrapy spider#
Now that you are all set up, you will write code to extract data from all books in the Mystery category of books.toscrape.com.
Create a file at tutorial/spiders/books_toscrape_com.py
with the following
code:
from scrapy import Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
start_urls = [
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
]
def parse(self, response):
next_page_links = response.css(".next a")
yield from response.follow_all(next_page_links)
book_links = response.css("article a")
yield from response.follow_all(book_links, callback=self.parse_book)
def parse_book(self, response):
yield {
"name": response.css("h1::text").get(),
"price": response.css(".price_color::text").re_first("£(.*)"),
"url": response.url,
}
In the code above:
You define a Scrapy spider class named
books_toscrape_com
.Your spider starts by sending a request for the Mystery category URL, http://books.toscrape.com/catalogue/category/books/mystery_3/index.html, (
start_urls
), and parses the response with the default callback method:parse
.The
parse
callback method:Finds the link to the next page and, if found, yields a request for it, whose response will also be parsed by the
parse
callback method.As a result, the
parse
callback method eventually parses all pages of the Mystery category.Finds links to book detail pages, and yields requests for them, whose responses will be parsed by the
parse_book
callback method.As a result, the
parse_book
callback method eventually parses all book detail pages from the Mystery category.
The
parse_book
callback method extracts a record of book information with the book name, price, and URL.
Now run your code:
scrapy crawl books_toscrape_com -O books.csv
Once execution finishes, the generated books.csv
file will contain records
for all books from the Mystery category of books.toscrape.com in CSV
format. You can open books.csv
with any spreadsheet app.
Deploy your Scrapy project to Scrapy Cloud#
Now that you have a working Scrapy spider, you will deploy your Scrapy project to a Scrapy Cloud project.
Run the following command and, when prompted, paste your API key and press Enter:
shub login
On the Zyte dashboard, select your Scrapy Cloud project under Scrapy Cloud Projects, and copy your Scrapy Cloud project ID from the web browser URL bar.
For example, if the URL is
https://app.zyte.com/p/000000/jobs
,000000
is your Scrapy Cloud project ID.Make sure
scrapy-cloud-tutorial
is your current working folder.Run the following command, replacing
000000
with your actual project ID:shub deploy 000000
Your Scrapy project has now been deployed to your Scrapy Cloud project.
Run a Scrapy Cloud job#
Now that you have deployed your Scrapy project to your Scrapy Cloud project, it is time to run your spider on Scrapy Cloud:
On the Zyte dashboard, select your Scrapy Cloud project under Scrapy Cloud Projects.
On the Dashboard page of your project, select Run on the top-right corner.
On the Run dialog box:
Select the Spiders field and, from the spider list that appears, select your spider name.
Select Run.
A new Scrapy Cloud job will appear in the Running job list:
Once the job finishes, it will move to the Completed job list:
Follow the link from the Job column, 1/1.
On the job page, select the Items tab.
On the Items page, select Export → CSV.
The downloaded file will have the same data as the books.csv
file that you
generated locally earlier.
Next steps#
Now that you know how to deploy your code and run a job in Scrapy Cloud, see Scrapy Cloud usage for more in-depth documentation of Scrapy Cloud.