Frontier API#
The Hub Crawl Frontier (HCF) stores pages visited and outstanding requests to make. It can be thought of as a persistent shared storage for a crawl scheduler.
Web pages are identified by a fingerprint. This can be the URL of the page, but crawlers may use any other string (e.g. a hash of post parameters, if it processes post requests), so there is no requirement for the fingerprint to be a valid URL.
A project can have many frontiers and each frontier is broken down into slots. A separate priority queue is maintained per slot. This means that requests from each slot can be prioritized separately and crawled at different rates and at different times.
Arbitrary data can be stored in both the crawl queue and with the set of fingerprints.
A typical example would be to use the URL as a fingerprint and the hostname as a slot. The crawler should ensure that each host is only crawled from one process at any given time so that politeness can be maintained.
Note
Most of the features provided by the API are also available through the python-scrapinghub client library.
Batch object#
Field |
Description |
---|---|
id |
Batch ID. |
requests |
An array of request objects. |
Request object#
Field |
Description |
Required |
---|---|---|
fp |
Request fingerprint. |
Yes |
qdata |
Data to be stored along with the fingerprint in the request queue. |
No |
fdata |
Data to be stored along with the fingerprint in the fingerprint set. |
No |
p |
Priority: lower priority numbers are returned first. Defaults to 0. |
No |
/hcf/:project_id/:frontier/s/:slot#
Field |
Description |
---|---|
newcount |
The number of new requests that have been added. |
Method |
Description |
Supported parameters |
---|---|---|
POST |
Enqueues a request in the specified slot. |
fp, qdata, fdata, p |
DELETE |
Deletes the specified slot. |
POST examples#
Add a request to the frontier
HTTP:
$ curl -u API_KEY: -d '{"fp":"/some/path.html"}' \
https://storage.scrapinghub.com/hcf/78/test/s/example.com
{"newcount":1}
Add requests with additional parameters
By using the same priority as request depth, the website can be traversed in breadth-first order from the starting URL.
HTTP:
$ curl -u API_KEY: -d $'{"fp":"/"}\n{"fp":"page1.html", "p": 1, "qdata": {"depth": 1}}' \
https://storage.scrapinghub.com/hcf/78/test/s/example.com
{"newcount":2}
DELETE example#
The example belows delete the slot example.com
from the frontier.
HTTP:
$ curl -u API_KEY: -X DELETE https://storage.scrapinghub.com/hcf/78/test/s/example.com/
/hcf/:project_id/:frontier/s/:slot/q#
Retrieve requests for a given slot.
Parameter |
Description |
Required |
---|---|---|
mincount |
The minimum number of requests to retrieve. |
No |
HTTP:
$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/s/example.com/q
{"id":"00013967d8af7b0001","requests":[["/",null]]}
{"id":"01013967d8af7e0001","requests":[["page1.html",{"depth":1}]]}
/hcf/:project_id/:frontier/s/:slot/q/deleted#
Delete a batch of requests.
Once a batch has been processed, clients should indicate that the batch is completed so that it will be removed and no longer returned when new batches are requested.
This can be achieved by posting the IDs of the completed batches:
$ curl -u API_KEY: -d '"00013967d8af7b0001"' https://storage.scrapinghub.com/hcf/78/test/s/example.com/q/deleted
You can specify the IDs as arrays or single values. As with the previous examples, multiple lines of input is accepted.
/hcf/:project_id/:frontier/s/:slot/f#
Retrieve fingerprints for a given slot.
Example#
HTTP:
$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/s/example.com/f
{"fp":"/"}
{"fp":"page1.html"}
Results are ordered lexicographically by fingerprint value.
/hcf/:project_id/list#
Lists the frontiers for a given project.
Example#
HTTP:
$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/list
["test"]
/hcf/:project_id/:frontier/list#
Lists the slots for a given frontier.
Example#
HTTP:
$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/list
["example.com"]