Items API#

Note

Even though these APIs support writing, they are most often used for reading. The crawlers running on Scrapinghub cloud are the ones that write to these endpoints. However, both operations are documented here for completion.

The Items API lets you interact with the items stored in the hubstorage backend for your projects. For example, you can download all the items for the job '53/34/7' through:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7

Note

Most of the features provided by the API are also available through the python-scrapinghub client library.

Item object#

Field	Description
_type	The item definition.
_template	The template matched against. Portia only.
_cached_page_id	Cached page ID. Used to identify the scraped page in storage.

Scraped fields will be top level alongside the internal fields listed above.

items/:project_id[/:spider_id][/:job_id][/:item_no][/:field_name]#

Retrieve or insert items for a project, spider, or job. Where item_no is the index of the item.

Parameter	Description	Required
format	Results format. See Result formats.	No
meta	Meta keys to show.	No
nodata	If set, no data will be returned other than specified `meta` keys.	No

Note

Pagination and meta parameters are supported, see Pagination and Meta parameters.

Header	Description
Content-Range	Can be used to specify a start index when inserting items.

Method	Description	Supported parameters
GET	Retrieve items for a given project, spider, or job.	format, meta, nodata
POST	Insert items for a given job	N/A

Note

Please always use pagination parameters (start, startafter and count) to limit amount of items in response to prevent timeouts and different performance issues. See pagination examples below for more details.

Examples#

Retrieve all items from a given job

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7

Retrive first item from a given job

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7/0

Retrieve values from a single field

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7/1/fieldname

Here 1 is the Index_no of the Item for which the value is retrieved.

Retrieve all items from a given spider

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34

Retrieve all items from a given project

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/

[Pagination] Retrieve first N items from a given job

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?count=10

[Pagination] Retrieve N items from a given job starting from the given item

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?count=10&start=53/34/7/20

[Pagination] Retrieve N items from a given job starting from the item following to the given one

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?count=10&startafter=53/34/7/19

[Pagination] Retrieve a few items from a given job by their IDs

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?index=5&index=6

Get meta field from items

To get only metadata from items, pass the nodata=1 parameter along with the meta field that you want to get.

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/1/7?meta=_key&nodata=1"
{"_key":"53/1/7/0"}
{"_key":"53/1/7/1"}
{"_key":"53/1/7/2"}

Get items in a specific format

Check the available formats in the Result formats section at the API Overview.

JSON:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/34/7?meta=_key&nodata=1 -H \"Accept: application/json\""
[{"_key":"28144/1/1/0"},{"_key":"28144/1/1/1"},{"_key":"28144/1/1/2"}, ...]

JSON Lines:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/34/7?meta=_key&nodata=1 -H \"Accept: application/x-jsonlines\""
{"_key":"28144/1/1/0"}
{"_key":"28144/1/1/1"}
{"_key":"28144/1/1/2"}
...

Add items to a job via POST

Add the items stored in the file items.jl (JSON lines format) to the job 53/34/7:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7 -X POST -T items.jl

Use the Content-Range header to specify a start index:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7 -X POST -T items.jl -H "content-range: items 500-/*"

The API will only return 200 if the data was successfully stored. There’s no limit on the amount of data you can send, but a HTTP 413 response will be returned if any single item is over 1M.

items/:project_id/:spider_id/:job_id/stats#

Retrieve the item stats for a given job.

Field	Description
counts[field]	The number of times the field was scraped.
totals.input_bytes	The total size of all items in bytes.
totals.input_values	The total number of items.

Parameter	Description	Required
all	Include hidden fields in results.	No

Method	Description	Supported parameters
GET	Retrieve item stats for the specified job.	all

Example#

Get the stats from a given job

HTTP:

$ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7/stats
{"counts":{"field1":9350,"field2":514},"totals":{"input_bytes":14390294,"input_values":10000}}