Article List Extraction (beta)

Article list extraction supports pages which contain multiple articles, usually as links or short snippets. Examples of such pages are main or category pages of news sites, main pages of blogs showing multiple posts, and other pages with multiple articles. Usually each extracted article would contain a link, which can be sent to Article Extraction to get more detailed information.

Article list page type is especially useful for implementing crawling, where extracted article URLs are sent to the article API, and extracted pagination URLs are sent to the article list API.

This supports use-cases such as news and media monitoring, analytics, brand monitoring, mentions, sentiment analysis and many others.

Related page type is Article Extraction which supports single article pages.

Request example

If you requested an article list extraction, and the extraction succeeds, then the articleList field will be available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'http://example.com/blog/?p=3',
    'pageType': 'articleList'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['articleList'])

Pagination fields, when available, allow fetching subsequent pages, for example:

# example continued from above
if results[0]['articleList'].get('paginationNext'):
    query = [{
        'url': results[0]['articleList']['paginationNext']['url'],
        'pageType': 'articleList'
    }]
    results_next = request_raw(query, api_key='[api key]')
    print(results_next[0]['articleList'])

Available fields

Top-level

The following fields are available for articleList:

url: string, required

URL of a page where articles were extracted

articles: list of dictionaries

List of articles. Individual fields are described below.

paginationNext: dictionary

If pagination on an article list page is present, then paginationNext dictionary contains information about the link to the next page. Fields:

  • url is the URL of the next page. It is a required field.

  • text is the text corresponding to the link as it appears on site. Optional.

Example:

{"url": "http://example.com/foo?p=3", "text": "3"}
paginationPrevious: dictionary

If pagination on an article list page is present, then paginationPrevious dictionary contains information about the link to the previous page. Fields:

  • url is the URL of the previous page. It is a required field.

  • text is the text corresponding to the link as it appears on site. Optional.

Example:

{"url": "http://example.com/foo?p=1", "text": "Prev"}

Individual articles

Each article inside articles has following fields:

headline: string

Article headline or title.

datePublished: string

Publication date. ISO-formatted with ‘T’ separator, may contain a timezone.

datePublishedRaw: string

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

author: string

Author (or authors) of the article.

authorsList: list of strings

All authors of the article split into separate strings. For example,

  • if author is "Alice and Bob", authorList would be ["Alice", "Bob"],

  • if author is "Alice Johnes" (a single author), authorList would be ["Alice Johnes"].

inLanguage: string

Language of the article, as an IETF BCP 47 language tag code.

mainImage: string

A URL or data URL value of the main image of the article. All URLs are absolute.

images: list of strings

A list of URL or data URL values of all images of the article (may include the main image). All URLs are absolute.

articleBody: string

Text of the article as it appears on list page.

probability: float

Probability that this is a single article.

url: string

URL of the full article page.

All fields are optional, except for probability. Fields without a valid value (null or empty array) are excluded from extraction results.

Response example

Below is an example response with all article list fields present:

[
  {
    "articleList": {
      "url": "http://www.example.com/article-list-page-3",
      "paginationNext": {
        "text": "Next Page",
        "url": "http://example.com/article-list-page-4"
      },
      "paginationPrevious": {
        "text": "Previous Page",
        "url": "http://example.com/article-list-page-2"
      },
      "articles": [
        {
          "headline": "Article headline",
          "datePublished": "2019-06-19T00:00:00",
          "datePublishedRaw": "June 19, 2019",
          "author": "Article author",
          "authorsList": [
            "Article author"
          ],
          "inLanguage": "en",
          "mainImage": "http://www.example.com/image.png",
          "images": [
            "http://example.com/image.png"
          ],
          "articleBody": "Article body ...",
          "probability": 0.95,
          "url": "https://www.example.com/article?id=23"
        },
        {
          "headline": "Headline of another article",
          "probability": 0.82,
          "url": "https://www.example.com/article?id=25"
        }
      ]
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "www.example.com",
      "userQuery": {
        "pageType": "articleList",
        "url": "http://www.example.com/article-list-page-3"
      }
    }
  }
]