Scrapy Cloud spiders#
A Scrapy Cloud spider is a Scrapy spider that is part of a Scrapy project that has been deployed into a Scrapy Cloud project. You can start jobs to execute the code of a spider.
Our web scraping tutorial covers creating, deploying, and running spiders. For more information, see the Scrapy documentation.
It is also possible to create spiders without code.
Spider templates and virtual spiders#
Scrapy Cloud supports defining spider templates, that you can use from the Scrapy Cloud UI to create virtual spiders that run the code of the corresponding spider template with predefined parameters.
Tip
Zyte’s AI-powered spiders are good examples of spider templates that you can customize to create new templates.
Spider templates#
To create a spider template:
Add scrapy-spider-metadata as a dependency to your Scrapy Cloud project.
On the spiders that you wish to use as templates, define metadata including a
title
anddescription
of your choice, and settingtemplate
toTrue
:from scrapy import Spider class MySpider(Spider): ... metadata = { "title": "My Template", "description": "Description of my template.", "template": True, }
When you redeploy your code, you can start creating virtual spiders from your spider templates.
Note
Spider templates are also regular spiders, and can be executed directly as well.
Virtual spiders#
To create a virtual spider from a spider template, go to your Scrapy Cloud project page and, on the left-hand sidebar, under Spiders, select Create spider.
On the Create Spider page, you can select a template, define the parameters of your new virtual spider, and save your spider.
You can then use your virtual spider from Scrapy Cloud as if it were a regular spider.
Virtual spiders exist only in Scrapy Cloud, not in your code. However, changes to the code of their spider template will affect them.
Spider parameters#
The point of spider templates is to be able to create virtual spiders from them that each works differently based on predefined parameters.
To expose parameters to the Scrapy Cloud UI so that they can be defined when creating a virtual spider, add a parameter specification to your template spiders using scrapy-spider-metadata:
from pydantic import BaseModel
from scrapy import Spider
from scrapy_spider_metadata import Args
class MyParams(BaseModel):
foo: str
class MySpider(Args[MyParams], Spider):
...
Parameter types#
Scrapy Cloud supports the following parameter types:
bool
int
,float
(withgt
,lt
,ge
, andle
numeric constraint support)str
(with string constraint support)Scrapy Cloud also supports defining a placeholder through json_schema_extra:
from pydantic import BaseModel, Field class MyParams(BaseModel): url: str = Field( json_schema_extra={ "placeholder": "https://books.toscrape.com", }, )
str
+Enum
Define
enumMeta
in json_schema_extra to give your enumeration choices an optional title and description:from enum import Enum from pydantic import BaseModel, Field class Foo(str, Enum): bar: str = "bar" baz: str = "baz" class MyParams(BaseModel): foo: Foo = Field( json_schema_extra={ "enumMeta": { Foo.bar: { "title": "Bar", "description": "Bar description.", }, Foo.baz: { "title": "Baz", "description": "Baz description.", }, }, }, )
Widgets#
Scrapy Cloud also supports a few special UI widgets that you can enable through
the widget
key of json_schema_extra, e.g.
from pydantic import BaseModel, Field
class MyParams(BaseModel):
foo: int = Field(
json_schema_extra={
"widget": "widget-id",
},
)
The following widgets are supported:
custom-attrs
, to specify a custom attributes schema.request-limit
, to specify a maximum number of requests.textarea
, for multi-line text input.
Parameter groups#
Scrapy Cloud also supports defining 2 or more optional parameters so that filling 1 of them (and only 1) is required:
from pydantic import BaseModel, ConfigDict
class MyParams(BaseModel):
model_config = ConfigDict(
json_schema_extra={
"groups": [
{
"id": "a-or-b",
"title": "A or B",
"description": "Fill A or B.",
"widget": "exclusive",
},
],
},
)
a: str = Field(
"json_schema_extra": {
"group": "a-or-b",
"exclusiveRequired": True,
},
)
b: str = Field(
"json_schema_extra": {
"group": "a-or-b",
"exclusiveRequired": True,
},
)
Examples#
For more examples, see the source code of zyte_spider_templates.params.