Items API

Note

Even though these APIs support writing, they are most often used for reading. The crawlers running on Scrapinghub cloud are the ones that write to these endpoints. However, both operations are documented here for completion.

The Items API lets you interact with the items stored in the hubstorage backend for your projects. For example, you can download all the items for the job '53/34/7' through:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7

Note

Most of the features provided by the API are also available through the python-scrapinghub client library.

Item object

Field

Description

_type

The item definition.

_template

The template matched against. Portia only.

_cached_page_id

Cached page ID. Used to identify the scraped page in storage.

Scraped fields will be top level alongside the internal fields listed above.

items/:project_id[/:spider_id][/:job_id][/:item_no][/:field_name]

Retrieve or insert items for a project, spider, or job. Where item_no is the index of the item.

Parameter

Description

Required

format

Results format. See Result formats.

No

meta

Meta keys to show.

No

nodata

If set, no data will be returned other than specified meta keys.

No

Note

Pagination and meta parameters are supported, see Pagination and Meta parameters.

Header

Description

Content-Range

Can be used to specify a start index when inserting items.

Method

Description

Supported parameters

GET

Retrieve items for a given project, spider, or job.

format, meta, nodata

POST

Insert items for a given job

N/A

Note

Please always use pagination parameters (start, startafter and count) to limit amount of items in response to prevent timeouts and different performance issues. See pagination examples below for more details.

Examples

Retrieve all items from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7

Retrive first item from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7/0

Retrieve values from a single field

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7/1/fieldname

Here 1 is the Index_no of the Item for which the value is retrieved.

Retrieve all items from a given spider

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34

Retrieve all items from a given project

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/

[Pagination] Retrieve first N items from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?count=10

[Pagination] Retrieve N items from a given job starting from the given item

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?count=10&start=53/34/7/20

[Pagination] Retrieve N items from a given job starting from the item following to the given one

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?count=10&startafter=53/34/7/19

[Pagination] Retrieve a few items from a given job by their IDs

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7?index=5&index=6

Get meta field from items

To get only metadata from items, pass the nodata=1 parameter along with the meta field that you want to get.

HTTP:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/1/7?meta=_key&nodata=1"
{"_key":"53/1/7/0"}
{"_key":"53/1/7/1"}
{"_key":"53/1/7/2"}

Get items in a specific format

Check the available formats in the Result formats section at the API Overview.

JSON:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/34/7?meta=_key&nodata=1 -H \"Accept: application/json\""
[{"_key":"28144/1/1/0"},{"_key":"28144/1/1/1"},{"_key":"28144/1/1/2"}, ...]

JSON Lines:

$ curl -u APIKEY: "https://storage.scrapinghub.com/items/53/34/7?meta=_key&nodata=1 -H \"Accept: application/x-jsonlines\""
{"_key":"28144/1/1/0"}
{"_key":"28144/1/1/1"}
{"_key":"28144/1/1/2"}
...

Add items to a job via POST

Add the items stored in the file items.jl (JSON lines format) to the job 53/34/7:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7 -X POST -T items.jl

Use the Content-Range header to specify a start index:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7 -X POST -T items.jl -H "content-range: items 500-/*"

The API will only return 200 if the data was successfully stored. There’s no limit on the amount of data you can send, but a HTTP 413 response will be returned if any single item is over 1M.

items/:project_id/:spider_id/:job_id/stats

Retrieve the item stats for a given job.

Field

Description

counts[field]

The number of times the field was scraped.

totals.input_bytes

The total size of all items in bytes.

totals.input_values

The total number of items.

Parameter

Description

Required

all

Include hidden fields in results.

No

Method

Description

Supported parameters

GET

Retrieve item stats for the specified job.

all

Example

Get the stats from a given job

HTTP:

$ curl -u APIKEY: https://storage.scrapinghub.com/items/53/34/7/stats
{"counts":{"field1":9350,"field2":514},"totals":{"input_bytes":14390294,"input_values":10000}}