Integrations¶
Our recommendations to integrate Automatic Extraction:
use scrapinghub-autoextract if you want to extract multiple URLs from the command line,
use scrapinghub-autoextract if you want to use Automatic Extraction API from Python,
use scrapy-autoextract if you want to use Automatic Extraction API from a Scrapy spider,
check out the sample code in Node.js if you want to query Automatic Extraction API in JavaScript with Node.js.
check out the sample code in PHP if you want to query Automatic Extraction API in PHP with cURL.
check out the sample code in Java if you want to query Automatic Extraction API in Java.
Note
In all of the examples, you will need to replace the string ‘[api key]’ with your unique key.
Using cURL¶
Here is an example, how to query Automatic Extraction API for a product page type using cURL:
curl --verbose \
--user [api key]: \
--header 'Content-Type: application/json' \
--data '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
--max-time 605 \
--compressed \
https://autoextract.scrapinghub.com/v1/extract
Using requests
Python library¶
Here is a simple example in Python, how to query Automatic Extraction with
requests
library. However, we recommend
using scrapinghub-autoextract client for this.
import requests
response = requests.post(
'https://autoextract.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'pageType': 'product'}])
results = response.json()
Command line/Python integration¶
If you want to query Automatic Extraction using the command line or in Python, then consider the scrapinghub-autoextract client library, which makes the use of the API easier.
A command-line utility, asyncio-based library, and a simple synchronous wrapper are provided by this package.
Here is an example, how to query the product page type using the client from a command line:
python -m autoextract \
urls.txt \
--api-key [api key] \
--page-type product \
--output res.jl
where urls.txt
is a text file with URLs to query line by line, while
res.jl
is an output JSON-lines file where results will be written.
If you prefer to use Python, then synchronously querying for the product page type is as simple as follow:
from autoextract.sync import request_raw
query = [{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product'
}]
results = request_raw(query, api_key='[api key]')
where request_raw
returns results in a list of dictionaries structure.
It is also possible to query Automatic Extraction asynchronously using asyncio
event
loop:
from autoextract.aio import request_raw
async def foo():
query = [{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product'
}]
results = await request_raw(query)
# ...
More detailed information about the usage, installation, and the package in general, can be found in scrapinghub-autoextract documentation.
Scrapy integration¶
In case you want to integrate querying Automatic Extraction into your Scrapy spider, consider scrapy-autoextract. It provides the possibility to consume the Automatic Extraction API by using Scrapy middleware or Page Object providers.
To learn more about the library, please check scrapy-autoextract documentation.
Node.js integration¶
Here is an example of how to use Automatic Extraction in JavaScript with Node.js:
const https = require('https');
const data = JSON.stringify([{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product',
}]);
const options = {
host: 'autoextract.scrapinghub.com',
path: '/v1/extract',
headers: {
'Authorization': 'Basic ' + Buffer.from('[api key]:').toString('base64'),
'Content-Type': 'application/json',
'Content-Length': data.length
},
method: 'POST',
};
const req = https.request(options, res => {
console.log(`statusCode: ${res.statusCode}`)
res.on('data', d => {
process.stdout.write(d)
})
});
req.on('error', error => {
console.error(error)
});
req.write(data);
req.end();
PHP integration¶
Here is an example of how to use Automatic Extraction in PHP with cURL library:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://autoextract.scrapinghub.com/v1/extract');
curl_setopt($ch, CURLOPT_USERPWD, '[api_key]:');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
curl_setopt($ch, CURLOPT_TIMEOUT_MS, 605000);
curl_setopt($ch, CURLOPT_POSTFIELDS, '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
// $output contains the result
$output = curl_exec($ch);
curl_close($ch);
?>
Java integration¶
Here is an example of how to use Automatic Extraction in Java, requesting a single product extraction,
to be placed in Main.java
file:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Base64;
public class Main {
private static final String API_KEY_ENVIRONMENT_VARIABLE_NAME = "AUTOEXTRACT_API_KEY";
private static final String AUTOEXTRACT_PAGE_TYPE = "product";
private static final String AUTOEXTRACT_URL = "https://autoextract.scrapinghub.com/v1/extract";
public static void main(String[] args) {
if (args.length <= 0) {
System.err.println("No URL specified");
System.exit(1);
}
try {
URL url = new URL(args[0]);
String apiKey = System.getenv(API_KEY_ENVIRONMENT_VARIABLE_NAME);
if (apiKey == null) {
System.err.println(String.format(
"No API key specified, please set environment variable %s",
API_KEY_ENVIRONMENT_VARIABLE_NAME));
System.exit(1);
}
String extractedData = fetchExtractedData(url, apiKey);
System.out.println(extractedData);
} catch (MalformedURLException e) {
System.err.println("Invalid URL");
System.exit(1);
} catch (IOException e) {
System.err.println(String.format("Something went wrong: %s", e.getMessage()));
System.exit(1);
}
}
private static String fetchExtractedData(URL url, String apiKey) throws IOException {
URL autoExtractUrl = new URL(AUTOEXTRACT_URL);
HttpURLConnection connection = (HttpURLConnection) autoExtractUrl.openConnection();
connection.setRequestMethod("POST");
connection.setRequestProperty(
"Authorization",
String.format("Basic %s", Base64.getEncoder().encodeToString(String.format("%s:", apiKey).getBytes())));
connection.setRequestProperty("Content-Type", "application/json");
String payload = String.format(
"[{\"url\": \"%s\", \"pageType\": \"%s\"}]", url, AUTOEXTRACT_PAGE_TYPE);
connection.setDoOutput(true);
connection.setRequestProperty("Content-Length", Integer.toString(payload.length()));
connection.getOutputStream().write(payload.getBytes());
StringBuffer response = new StringBuffer();
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String inLine;
while ((inLine = in.readLine()) != null) {
response.append(inLine);
}
in.close();
return response.toString();
}
}
This needs an API key in AUTOEXTRACT_API_KEY
environment variable, example usage would be:
$ javac Main.java
$ AUTOEXTRACT_API_KEY="your-key-here" java Main 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'