Scrapy Cloud spiders#

A Scrapy Cloud spider is a Scrapy spider that is part of a Scrapy project that has been deployed into a Scrapy Cloud project. You can start jobs to execute the code of a spider.

Our web scraping tutorial covers creating, deploying, and running spiders. For more information, see the Scrapy documentation.

It is also possible to create spiders without code.

Spider templates and virtual spiders#

Scrapy Cloud supports defining spider templates, that you can use from the Scrapy Cloud UI to create virtual spiders that run the code of the corresponding spider template with predefined parameters.

Tip

Zyte’s AI-powered spiders are good examples of spider templates that you can customize to create new templates.

Spider templates#

To create a spider template:

  1. Add scrapy-spider-metadata as a dependency to your Scrapy Cloud project.

  2. On the spiders that you wish to use as templates, define metadata including a title and description of your choice, and setting template to True:

    from scrapy import Spider
    
    class MySpider(Spider):
        ...
        metadata = {
            "title": "My Template",
            "description": "Description of my template.",
            "template": True,
        }
    

When you redeploy your code, you can start creating virtual spiders from your spider templates.

Note

Spider templates are also regular spiders, and can be executed directly as well.

Virtual spiders#

To create a virtual spider from a spider template, go to your Scrapy Cloud project page and, on the left-hand sidebar, under Spiders, select Create spider.

On the Create Spider page, you can select a template, define the parameters of your new virtual spider, and save your spider.

You can then use your virtual spider from Scrapy Cloud as if it were a regular spider.

Virtual spiders exist only in Scrapy Cloud, not in your code. However, changes to the code of their spider template will affect them.

Spider parameters#

The point of spider templates is to be able to create virtual spiders from them that each works differently based on predefined parameters.

To expose parameters to the Scrapy Cloud UI so that they can be defined when creating a virtual spider, add a parameter specification to your template spiders using scrapy-spider-metadata:

from pydantic import BaseModel
from scrapy import Spider
from scrapy_spider_metadata import Args

class MyParams(BaseModel):
    foo: str

class MySpider(Args[MyParams], Spider):
    ...

Parameter types#

Scrapy Cloud supports the following parameter types:

  • bool

  • int, float (with gt, lt, ge, and le numeric constraint support)

  • str (with string constraint support)

    Scrapy Cloud also supports defining a placeholder through json_schema_extra:

    from pydantic import BaseModel, Field
    
    class MyParams(BaseModel):
        url: str = Field(
            json_schema_extra={
                "placeholder": "https://books.toscrape.com",
            },
        )
    
  • str + Enum

    Define enumMeta in json_schema_extra to give your enumeration choices an optional title and description:

    from enum import Enum
    
    from pydantic import BaseModel, Field
    
    class Foo(str, Enum):
        bar: str = "bar"
        baz: str = "baz"
    
    class MyParams(BaseModel):
        foo: Foo = Field(
            json_schema_extra={
                "enumMeta": {
                    Foo.bar: {
                        "title": "Bar",
                        "description": "Bar description.",
                    },
                    Foo.baz: {
                        "title": "Baz",
                        "description": "Baz description.",
                    },
                },
            },
        )
    

Widgets#

Scrapy Cloud also supports a few special UI widgets that you can enable through the widget key of json_schema_extra, e.g.

from pydantic import BaseModel, Field

class MyParams(BaseModel):
    foo: int = Field(
        json_schema_extra={
            "widget": "widget-id",
        },
    )

The following widgets are supported:

  • custom-attrs, to specify a custom attributes schema.

  • request-limit, to specify a maximum number of requests.

  • textarea, for multi-line text input.

Parameter groups#

Scrapy Cloud also supports defining 2 or more optional parameters so that filling 1 of them (and only 1) is required:

from pydantic import BaseModel, ConfigDict

    class MyParams(BaseModel):
        model_config = ConfigDict(
            json_schema_extra={
                "groups": [
                    {
                        "id": "a-or-b",
                        "title": "A or B",
                        "description": "Fill A or B.",
                        "widget": "exclusive",
                    },
                ],
            },
        )
        a: str = Field(
            "json_schema_extra": {
                "group": "a-or-b",
                "exclusiveRequired": True,
            },
        )
        b: str = Field(
            "json_schema_extra": {
                "group": "a-or-b",
                "exclusiveRequired": True,
            },
        )

Examples#

For more examples, see the source code of zyte_spider_templates.params.