Article List Extraction (beta)

Request example

If you requested an article list extraction, and the extraction succeeds, then the articleList field will be available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'http://example.com/blog/?p=3',
    'pageType': 'articleList'}
})
results = request_raw(query, api_key='[api key]')
print(response.json()[0]['articleList'])

Available fields

Top-level

The following fields are available for articleList:

url: string, required

URL of a page where articles were extracted

articles: list of dictionaries

List of articles. Individual fields are described below.

Individual articles

Each article inside articles has following fields:

headline: string

Article headline or title.

datePublished: string

Publication date. ISO-formatted with ‘T’ separator, may contain a timezone.

datePublishedRaw: string

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

author: string

Author (or authors) of the article.

authorsList: list of strings

All authors of the article split into separate strings. For example,

  • if author is "Alice and Bob", authorList would be ["Alice", "Bob"],

  • if author is "Alice Johnes" (a single author), authorList would be ["Alice Johnes"].

inLanguage: string

Language of the article, as an IETF BCP 47 language tag code.

mainImage: string

A URL or data URL value of the main image of the article. All URLs are absolute.

images: list of strings

A list of URL or data URL values of all images of the article (may include the main image). All URLs are absolute.

articleBody: string

Text of the article as it appears on list page.

probability: float

Probability that this is a single article.

url: string

URL of the full article page.

All fields are optional, except for probability. Fields without a valid value (null or empty array) are excluded from extraction results.

Response example

Below is an example response with all article list fields present:

[
  {
    "articleList": {
      "url": "http://www.example.com",
      "articles": [
        {
          "headline": "Article headline",
          "datePublished": "2019-06-19T00:00:00",
          "datePublishedRaw": "June 19, 2019",
          "author": "Article author",
          "authorsList": [
            "Article author"
          ],
          "inLanguage": "en",
          "mainImage": "http://www.example.com/image.png",
          "images": [
            "http://example.com/image.png"
          ],
          "articleBody": "Article body ...",
          "probability": 0.95,
          "url": "https://www.example.com/article?id=23"
        },
        {
          "headline": "Headline of another article",
          "probability": 0.82,
          "url": "https://www.example.com/article?id=25"
        }
      ]
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "www.example.com",
      "userQuery": {
        "pageType": "articleList",
        "url": "http://www.example.com/"
      }
    }
  }
]