Scrapy Cloud API#

Scrapy Cloud provides an HTTP API for interacting with your spiders, jobs and scraped data.

Getting started#

Authentication#

You’ll need to authenticate using your API key.

There are two ways to authenticate:

HTTP Basic:

$ curl -u APIKEY: https://storage.scrapinghub.com/foo

URL Parameter:

$ curl https://storage.scrapinghub.com/foo?apikey=APIKEY

Example#

Running a spider is simple:

$ curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d spider=SPIDER

Where APIKEY is your API key, PROJECT is the spider’s project ID, and SPIDER is the name of the spider you want to run.

It’s possible to override Scrapy settings for a job:

$ curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d spider=SPIDER \
    -d job_settings='{"LOG_LEVEL": "DEBUG"}'

job_settings should be a valid JSON and will be merged with project and spider settings provided for given spider.

API endpoints#

app.scrapinghub.com#

storage.scrapinghub.com#

Python client#

You can use the python-scrapinghub library to interact with Scrapy Cloud API. Check the documentation for installation instructions and usage examples.

Pagination#

You can paginate the results for the majority of the APIs using a number of parameters. The pagination parameters differ depending on the target host for a given endpoint.

app.scrapinghub.com#

Parameter

Description

count

Number of results per page.

offset

Offset to retrieve specific records.

storage.scrapinghub.com#

Parameter

Description

count

Number of results per page.

index

Offset to retrieve specific records. Multiple values supported.

start

Skip results before the given one. See a note about format below.

startafter

Return results after the given one. See a note about format below.

Note

The parameters naming inconsistency is caused by historical reasons and will be fixed in the coming platform updates.

Note

While index parameter is just a short <entity_id> (ex: index=4), start and startafter parameters should have the full form <project_id>/<spider_id>/<job_id>/<entity_id> (ex: start=1/2/3/4, startafter=1/2/3/3).

Result formats#

There are two ways to specify the format of results: Using the Accept header, or using the format parameter.

The Accept header supports the following values:

  • application/x-jsonlines

  • application/json

  • application/xml

  • text/plain

  • text/csv

The format parameter supports the following values:

  • json

  • jl

  • xml

  • csv

  • text

XML-RPC data types are used for XML output.

CSV parameters#

Parameter

Description

Required

fields

Comma delimited list of fields to include, in order from left to right.

Yes

include_headers

When set to ‘1’ or ‘Y’, show header names in first row.

No

sep

Separator character.

No

quote

Quote character.

No

escape

Escape character.

No

lineend

Line end string.

No

When using CSV, you will need to specify the fields parameter to indiciate required fields and their order. Example:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/34/7?format=csv&fields=id,name&include_headers=1"

Headers#

gzip compression is supported. A client can specify that gzip responses can be handled using the accept-encoding: gzip request header. content-encoding: gzip header must be present in the response to signal the gzip content encoding.

You can use the saveas request parameter to specify a filename for browser downloads. For example, specifying ?saveas=foo.json will cause a header of Content-Disposition: Attachment; filename=foo.json to be returned.

Meta parameters#

You can use the meta parameter to return metadata for the record in addition to its core data.

The following values are available:

Parameter

Description

_key

The item key in the format :project_id/:spider_id/:job_id/:item_no.

_ts

Timestamp in milliseconds for when the item was added.

Example:

$ curl "https://storage.scrapinghub.com/items/53/34/7?meta=_key&meta=_ts"
{"_key":"1111111/1/1/0","_ts":1342078473363, ... }

Note

If the data contains fields with the same name as the requested fields, they will both appear in the result.