Frontier API

The Hub Crawl Frontier (HCF) stores pages visited and outstanding requests to make. It can be thought of as a persistent shared storage for a crawl scheduler.

Web pages are identified by a fingerprint. This can be the URL of the page, but crawlers may use any other string (e.g. a hash of post parameters, if it processes post requests), so there is no requirement for the fingerprint to be a valid URL.

A project can have many frontiers and each frontier is broken down into slots. A separate priority queue is maintained per slot. This means that requests from each slot can be prioritized separately and crawled at different rates and at different times.

Arbitrary data can be stored in both the crawl queue and with the set of fingerprints.

A typical example would be to use the URL as a fingerprint and the hostname as a slot. The crawler should ensure that each host is only crawled from one process at any given time so that politeness can be maintained.

Note

Most of the features provided by the API are also available through the python-scrapinghub client library.

Batch object

Field

Description

id

Batch ID.

requests

An array of request objects.

Request object

Field

Description

Required

fp

Request fingerprint.

Yes

qdata

Data to be stored along with the fingerprint in the request queue.

No

fdata

Data to be stored along with the fingerprint in the fingerprint set.

No

p

Priority: lower priority numbers are returned first. Defaults to 0.

No

/hcf/:project_id/:frontier/s/:slot

Field

Description

newcount

The number of new requests that have been added.

Method

Description

Supported parameters

POST

Enqueues a request in the specified slot.

fp, qdata, fdata, p

DELETE

Deletes the specified slot.

POST examples

Add a request to the frontier

HTTP:

$ curl -u API_KEY: -d '{"fp":"/some/path.html"}'  \
    https://storage.scrapinghub.com/hcf/78/test/s/example.com
{"newcount":1}

Add requests with additional parameters

By using the same priority as request depth, the website can be traversed in breadth-first order from the starting URL.

HTTP:

$ curl -u API_KEY: -d $'{"fp":"/"}\n{"fp":"page1.html", "p": 1, "qdata": {"depth": 1}}' \
    https://storage.scrapinghub.com/hcf/78/test/s/example.com
{"newcount":2}

DELETE example

The example belows delete the slot example.com from the frontier.

HTTP:

$ curl -u API_KEY: -X DELETE https://storage.scrapinghub.com/hcf/78/test/s/example.com/

/hcf/:project_id/:frontier/s/:slot/q

Retrieve requests for a given slot.

Parameter

Description

Required

mincount

The minimum number of requests to retrieve.

No

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/s/example.com/q
{"id":"00013967d8af7b0001","requests":[["/",null]]}
{"id":"01013967d8af7e0001","requests":[["page1.html",{"depth":1}]]}

/hcf/:project_id/:frontier/s/:slot/q/deleted

Delete a batch of requests.

Once a batch has been processed, clients should indicate that the batch is completed so that it will be removed and no longer returned when new batches are requested.

This can be achieved by posting the IDs of the completed batches:

$ curl -u API_KEY: -d '"00013967d8af7b0001"' https://storage.scrapinghub.com/hcf/78/test/s/example.com/q/deleted

You can specify the IDs as arrays or single values. As with the previous examples, multiple lines of input is accepted.

/hcf/:project_id/:frontier/s/:slot/f

Retrieve fingerprints for a given slot.

Example

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/s/example.com/f
{"fp":"/"}
{"fp":"page1.html"}

Results are ordered lexicographically by fingerprint value.

/hcf/:project_id/list

Lists the frontiers for a given project.

Example

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/list
["test"]

/hcf/:project_id/:frontier/list

Lists the slots for a given frontier.

Example

HTTP:

$ curl -u API_KEY: https://storage.scrapinghub.com/hcf/78/test/list
["example.com"]