Warning

Zyte Automatic Extraction will be discontinued starting April 30th, 2024. It is replaced by Zyte API. See Migrating from Automatic Extraction to Zyte API.

Integrations#

Our recommendations to integrate Automatic Extraction:

Note

In all of the examples, you will need to replace the string ‘[api key]’ with your unique key.

Using cURL#

Here is an example, how to query Automatic Extraction API for a product page type using cURL:

curl --verbose \
    --user [api key]: \
    --header 'Content-Type: application/json' \
    --data '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
    --max-time 605 \
    --compressed \
    https://autoextract.scrapinghub.com/v1/extract

Using requests Python library#

Here is a simple example in Python, how to query Automatic Extraction with requests library. However, we recommend using zyte-autoextract client for this.

import requests

response = requests.post(
    'https://autoextract.scrapinghub.com/v1/extract',
    auth=('[api key]', ''),
    json=[{'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'pageType': 'product'}])
results = response.json()

Command line/Python integration#

If you want to query Automatic Extraction using the command line or in Python, then consider the zyte-autoextract client library, which makes the use of the API easier.

A command-line utility, asyncio-based library, and a simple synchronous wrapper are provided by this package.

Here is an example, how to query the product page type using the client from a command line:

python -m autoextract \
    urls.txt \
    --api-key [api key] \
    --page-type product \
    --output res.jl

where urls.txt is a text file with URLs to query line by line, while res.jl is an output JSON-lines file where results will be written.

If you prefer to use Python, then synchronously querying for the product page type is as simple as follow:

from autoextract.sync import request_raw

query = [{
    'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'pageType': 'product'
}]
results = request_raw(query, api_key='[api key]')

where request_raw returns results in a list of dictionaries structure.

It is also possible to query Automatic Extraction asynchronously using asyncio event loop:

from autoextract.aio import request_raw

async def foo():
    query = [{
        'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
        'pageType': 'product'
    }]
    results = await request_raw(query)
    # ...

If you have many URLs to send, you might want to send them with some level of concurrency to achieve a good throughput. The following example shows how to do so:

import asyncio
from autoextract.aio import request_parallel_as_completed, create_session, \
    RequestError
from autoextract import ProductRequest

async def extract_from(urls):
    requests = [ProductRequest(url) for url in urls]
    responses = []
    async with create_session() as session:
        for res_iter in request_parallel_as_completed(
                requests,
                api_key=[api key],
                n_conn=15,
                session=session):
            try:
                responses.extend(await res_iter)
            except RequestError as e:
                # Do something with the error
                ...
    return responses

urls = ["http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"]
responses = asyncio.run(extract_from(urls))
# ...

More detailed information about the usage, installation, and the package in general, can be found in zyte-autoextract documentation.

Scrapy integration#

In case you want to integrate querying Automatic Extraction into your Scrapy spider, consider scrapy-autoextract. It provides the possibility to consume the Automatic Extraction API by using Scrapy middleware or Page Object providers.

To learn more about the library, please check scrapy-autoextract documentation.

Node.js integration#

Here is an example of how to use Automatic Extraction in JavaScript with Node.js:

const https = require('https');

const data = JSON.stringify([{
    'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'pageType': 'product',
}]);
const options = {
    host: 'autoextract.scrapinghub.com',
    path: '/v1/extract',
    headers: {
      'Authorization': 'Basic ' + Buffer.from('[api key]:').toString('base64'),
      'Content-Type': 'application/json',
      'Content-Length': data.length
    },
    method: 'POST',
};
const req = https.request(options, res => {
    console.log(`statusCode: ${res.statusCode}`)
    res.on('data', d => {
        process.stdout.write(d)
    })
});
req.on('error', error => {
    console.error(error)
});
req.write(data);
req.end();

PHP integration#

Here is an example of how to use Automatic Extraction in PHP with cURL library:

<?php
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, 'https://autoextract.scrapinghub.com/v1/extract');
        curl_setopt($ch, CURLOPT_USERPWD, '[api_key]:');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
        curl_setopt($ch, CURLOPT_TIMEOUT_MS, 605000);
        curl_setopt($ch, CURLOPT_POSTFIELDS, '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]');
        curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
        // $output contains the result
        $output = curl_exec($ch);
        curl_close($ch);
?>

Java integration#

Here is an example of how to use Automatic Extraction in Java, requesting a single product extraction, to be placed in Main.java file:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Base64;

public class Main {

    private static final String API_KEY_ENVIRONMENT_VARIABLE_NAME = "AUTOEXTRACT_API_KEY";

    private static final String AUTOEXTRACT_PAGE_TYPE = "product";
    private static final String AUTOEXTRACT_URL = "https://autoextract.scrapinghub.com/v1/extract";

    public static void main(String[] args) {
        if (args.length <= 0) {
            System.err.println("No URL specified");
            System.exit(1);
        }

        try {
            URL url = new URL(args[0]);

            String apiKey = System.getenv(API_KEY_ENVIRONMENT_VARIABLE_NAME);

            if (apiKey == null) {
                System.err.println(String.format(
                    "No API key specified, please set environment variable %s",
                    API_KEY_ENVIRONMENT_VARIABLE_NAME));
                System.exit(1);
            }

            String extractedData = fetchExtractedData(url, apiKey);

            System.out.println(extractedData);
        } catch (MalformedURLException e) {
            System.err.println("Invalid URL");
            System.exit(1);
        } catch (IOException e) {
            System.err.println(String.format("Something went wrong: %s", e.getMessage()));
            System.exit(1);
        }
    }

    private static String fetchExtractedData(URL url, String apiKey) throws IOException {
        URL autoExtractUrl = new URL(AUTOEXTRACT_URL);
        HttpURLConnection connection = (HttpURLConnection) autoExtractUrl.openConnection();
        connection.setRequestMethod("POST");
        connection.setRequestProperty(
            "Authorization",
            String.format("Basic %s", Base64.getEncoder().encodeToString(String.format("%s:", apiKey).getBytes())));
        connection.setRequestProperty("Content-Type", "application/json");
        String payload = String.format(
            "[{\"url\": \"%s\", \"pageType\": \"%s\"}]", url, AUTOEXTRACT_PAGE_TYPE);
        connection.setDoOutput(true);
        connection.setRequestProperty("Content-Length", Integer.toString(payload.length()));
        connection.getOutputStream().write(payload.getBytes());

        StringBuffer response = new StringBuffer();
        BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
        String inLine;

        while ((inLine = in.readLine()) != null) {
            response.append(inLine);
        }
        in.close();

        return response.toString();
    }

}

This needs an API key in AUTOEXTRACT_API_KEY environment variable, example usage would be:

$ javac Main.java
$ AUTOEXTRACT_API_KEY="your-key-here" java Main 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'