Collections API#
Collections are key-value stores for an arbitrary large number of records. They are especially useful to store information produced and/or used by multiple scraping jobs.
Note
The frontier API is best suited to store queues of URLs to be processed by scraping jobs.
Quickstart#
A collection is identified by a project id, a type, and a name.
A record can be any JSON dictionary. They are identified by a _key
field.
In the following, we use project id 78
, the regular storage type s
for the collection named my_collection
.
Note
Avoid using multiple collections with the same name and different types like /s/my_collection
and /cs/my_collection
. During operations on an entire collection, like renaming or deleting, Hubstorage will treat homonyms as a single entity and rename or delete both.
Create/Update a record:#
$ curl -u $APIKEY: -X POST -d '{"_key": "foo", "value": "bar"}' \
https://storage.scrapinghub.com/collections/78/s/my_collection
Access a record:#
$ curl -u $APIKEY: -X GET \
https://storage.scrapinghub.com/collections/78/s/my_collection/foo
Delete a record:#
$ curl -u $APIKEY: -X DELETE \
https://storage.scrapinghub.com/collections/78/s/my_collection/foo
List records:#
$ curl -u $APIKEY: -X GET \
https://storage.scrapinghub.com/collections/78/s/my_collection
Create/Update multiple records:#
We use the jsonline
format by default (json objects separated by a newline):
$ curl -u $APIKEY: -X POST -d $'{"_key": "foo", "value": "bar"}\n{"_key": "goo", "value": "baz"}' \
https://storage.scrapinghub.com/collections/78/s/my_collection
Details#
The following collection types are available:
Type |
Full name |
Hubstorage method |
Description |
---|---|---|---|
s |
store |
new_store |
Basic set store |
cs |
cached store |
new_cached_store |
Items expire after a month |
vs |
versioned store |
new_versioned_store |
Up to 3 copies of each item will be retained |
vcs |
versioned cache store |
new_versioned_cached_store |
Multiple copies are retained, and each one expires after a month |
Note
Avoid using multiple collections with the same name and different types like /s/my_collection
and /cs/my_collection
. During operations on an entire collection, like renaming or deleting, Hubstorage will treat homonyms as a single entity and rename or delete both.
Records are JSON
objects, with the following constraints:
Their serialized size can’t be larger than
1 MB
;Javascript’s
inf
values are not supported;Floating-point numbers can’t be larger than
2^64 - 1
.
API#
collections/:project_id/list#
List all collections.
$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/list
{"type":"s","name":"my_collection"}
{"type":"s","name":"my_collection_2"}
{"type":"cs","name":"my_other_collection"}
collections/:project_id/:type/:collection#
Read, write or remove items in a collection.
Parameter |
Description |
Required |
---|---|---|
key |
Read items with a specified key. Multiple values are supported. |
No |
prefix |
Read items with a specified key prefix. |
No |
prefixcount |
Maximum number of values to return per prefix. |
No |
startts |
UNIX timestamp at which to begin results, in milliseconds. |
No |
endts |
UNIX timestamp at which to end results, in milliseconds. |
No |
Method |
Description |
Supported parameters |
---|---|---|
GET |
Read items from the specified collection. |
key, prefix, prefixcount, startts, endts |
POST |
Write items to the specified collection. |
|
DELETE |
Delete items from the specified collection. |
key, prefix, prefixcount, startts, endts |
Note
Pagination and meta parameters are supported, see Pagination and Meta parameters.
GET examples:
$ curl -u APIKEY: "https://storage.scrapinghub.com/collections/78/s/my_collection?key=foo1&key=foo2"
{"value":"bar1"}
{"value":"bar2"}
$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection?prefix=f
{"value":"bar"}
$ curl -u APIKEY: "https://storage.scrapinghub.com/collections/78/s/my_collection?startts=1402699941000&endts=1403039369570"
{"value":"bar"}
Prefix filters, unlike other filters, use indexes and should be used
when possible. You can use the prefixcount
parameter to limit the
number of values returned for each prefix.
A common pattern is to download changes within a certain time period.
You can use the startts
and endts
parameters to select records
within a certain time window.
The current timestamp can be retrieved like so:
$ curl https://storage.scrapinghub.com/system/ts
1403039369570
Note
Timestamp filters may perform poorly when selecting a small number of records from a large collection.
collections/:project_id/:type/:collection/count#
Count the number of items in a collection.
$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/count
{"count":972,"scanned":972}%
If the collection is large, the result may contain a nextstart
field that
is used for pagination, see Pagination.
collections/:project_id/:type/:collection/:item#
Read Write or Delete an individual item.
Method |
Description |
---|---|
GET |
Read the item with the given key |
POST |
Write the item with the given key |
DELETE |
Delete the item with the given key |
$ curl -u $APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/foo
{"value":"bar"}
collections/:project_id/:type/:collection/:item/value#
Read an individual item value.
$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/foo/value
bar
collections/:project_id/:type/:collection/deleted#
POST
with a list of item keys to delete them.
Note
This endpoint is designed to delete a large number of
non-consecutive items. To delete consecutive items use
DELETE
-based endpoints, which are faster.
$ curl -u $APIKEY: -X POST -d '"foo"' -d '"bar"' \
https://storage.scrapinghub.com/collections/78/s/my_collection/deleted
collections/:project_id/delete?name=:collection#
Delete an entire collection immediately.
$ curl -u APIKEY: -X POST https://storage.scrapinghub.com/collections/78/delete?name=my_collection
collections/:project_id/rename?name=:collection&new_name=:new_name#
Rename a collection and move all its items immediately.
$ curl -u APIKEY: -X POST https://storage.scrapinghub.com/collections/rename?name=my_collection&new_name=my_collection_renamed