Integrations#
Our recommendations to integrate Automatic Extraction:
use zyte-autoextract if you want to extract multiple URLs from the command line,
use zyte-autoextract if you want to use Automatic Extraction API from Python,
use scrapy-autoextract if you want to use Automatic Extraction API from a Scrapy spider,
check out the sample code in Node.js if you want to query Automatic Extraction API in JavaScript with Node.js.
check out the sample code in PHP if you want to query Automatic Extraction API in PHP with cURL.
check out the sample code in Java if you want to query Automatic Extraction API in Java.
Note
In all of the examples, you will need to replace the string ‘[api key]’ with your unique key.
Using cURL#
Here is an example, how to query Automatic Extraction API for a product page type using cURL:
curl --verbose \
--user [api key]: \
--header 'Content-Type: application/json' \
--data '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
--max-time 605 \
--compressed \
https://autoextract.scrapinghub.com/v1/extract
Using requests
Python library#
Here is a simple example in Python, how to query Automatic Extraction with
requests
library. However, we recommend
using zyte-autoextract client for this.
import requests
response = requests.post(
'https://autoextract.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'pageType': 'product'}])
results = response.json()
Command line/Python integration#
If you want to query Automatic Extraction using the command line or in Python, then consider the zyte-autoextract client library, which makes the use of the API easier.
A command-line utility, asyncio-based library, and a simple synchronous wrapper are provided by this package.
Here is an example, how to query the product page type using the client from a command line:
python -m autoextract \
urls.txt \
--api-key [api key] \
--page-type product \
--output res.jl
where urls.txt
is a text file with URLs to query line by line, while
res.jl
is an output JSON-lines file where results will be written.
If you prefer to use Python, then synchronously querying for the product page type is as simple as follow:
from autoextract.sync import request_raw
query = [{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product'
}]
results = request_raw(query, api_key='[api key]')
where request_raw
returns results in a list of dictionaries structure.
It is also possible to query Automatic Extraction asynchronously using asyncio
event
loop:
from autoextract.aio import request_raw
async def foo():
query = [{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product'
}]
results = await request_raw(query)
# ...
If you have many URLs to send, you might want to send them with some level of concurrency to achieve a good throughput. The following example shows how to do so:
import asyncio
from autoextract.aio import request_parallel_as_completed, create_session, \
RequestError
from autoextract import ProductRequest
async def extract_from(urls):
requests = [ProductRequest(url) for url in urls]
responses = []
async with create_session() as session:
for res_iter in request_parallel_as_completed(
requests,
api_key=[api key],
n_conn=15,
session=session):
try:
responses.extend(await res_iter)
except RequestError as e:
# Do something with the error
...
return responses
urls = ["http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"]
responses = asyncio.run(extract_from(urls))
# ...
More detailed information about the usage, installation, and the package in general, can be found in zyte-autoextract documentation.
Scrapy integration#
In case you want to integrate querying Automatic Extraction into your Scrapy spider, consider scrapy-autoextract. It provides the possibility to consume the Automatic Extraction API by using Scrapy middleware or Page Object providers.
To learn more about the library, please check scrapy-autoextract documentation.
Node.js integration#
Here is an example of how to use Automatic Extraction in JavaScript with Node.js:
const https = require('https');
const data = JSON.stringify([{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product',
}]);
const options = {
host: 'autoextract.scrapinghub.com',
path: '/v1/extract',
headers: {
'Authorization': 'Basic ' + Buffer.from('[api key]:').toString('base64'),
'Content-Type': 'application/json',
'Content-Length': data.length
},
method: 'POST',
};
const req = https.request(options, res => {
console.log(`statusCode: ${res.statusCode}`)
res.on('data', d => {
process.stdout.write(d)
})
});
req.on('error', error => {
console.error(error)
});
req.write(data);
req.end();
PHP integration#
Here is an example of how to use Automatic Extraction in PHP with cURL library:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://autoextract.scrapinghub.com/v1/extract');
curl_setopt($ch, CURLOPT_USERPWD, '[api_key]:');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
curl_setopt($ch, CURLOPT_TIMEOUT_MS, 605000);
curl_setopt($ch, CURLOPT_POSTFIELDS, '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
// $output contains the result
$output = curl_exec($ch);
curl_close($ch);
?>
Java integration#
Here is an example of how to use Automatic Extraction in Java, requesting a single product extraction,
to be placed in Main.java
file:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Base64;
public class Main {
private static final String API_KEY_ENVIRONMENT_VARIABLE_NAME = "AUTOEXTRACT_API_KEY";
private static final String AUTOEXTRACT_PAGE_TYPE = "product";
private static final String AUTOEXTRACT_URL = "https://autoextract.scrapinghub.com/v1/extract";
public static void main(String[] args) {
if (args.length <= 0) {
System.err.println("No URL specified");
System.exit(1);
}
try {
URL url = new URL(args[0]);
String apiKey = System.getenv(API_KEY_ENVIRONMENT_VARIABLE_NAME);
if (apiKey == null) {
System.err.println(String.format(
"No API key specified, please set environment variable %s",
API_KEY_ENVIRONMENT_VARIABLE_NAME));
System.exit(1);
}
String extractedData = fetchExtractedData(url, apiKey);
System.out.println(extractedData);
} catch (MalformedURLException e) {
System.err.println("Invalid URL");
System.exit(1);
} catch (IOException e) {
System.err.println(String.format("Something went wrong: %s", e.getMessage()));
System.exit(1);
}
}
private static String fetchExtractedData(URL url, String apiKey) throws IOException {
URL autoExtractUrl = new URL(AUTOEXTRACT_URL);
HttpURLConnection connection = (HttpURLConnection) autoExtractUrl.openConnection();
connection.setRequestMethod("POST");
connection.setRequestProperty(
"Authorization",
String.format("Basic %s", Base64.getEncoder().encodeToString(String.format("%s:", apiKey).getBytes())));
connection.setRequestProperty("Content-Type", "application/json");
String payload = String.format(
"[{\"url\": \"%s\", \"pageType\": \"%s\"}]", url, AUTOEXTRACT_PAGE_TYPE);
connection.setDoOutput(true);
connection.setRequestProperty("Content-Length", Integer.toString(payload.length()));
connection.getOutputStream().write(payload.getBytes());
StringBuffer response = new StringBuffer();
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String inLine;
while ((inLine = in.readLine()) != null) {
response.append(inLine);
}
in.close();
return response.toString();
}
}
This needs an API key in AUTOEXTRACT_API_KEY
environment variable, example usage would be:
$ javac Main.java
$ AUTOEXTRACT_API_KEY="your-key-here" java Main 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'