Article List Extraction#
Article list extraction supports pages which contain multiple articles, usually as links or short snippets. Examples of such pages are main or category pages of news sites, main pages of blogs showing multiple posts, and other pages with multiple articles. Usually each extracted article would contain a link, which can be sent to Article Extraction to get more detailed information.
Article list page type is especially useful for implementing crawling, where extracted article URLs are sent to the article API, and extracted pagination URLs are sent to the article list API.
This supports use-cases such as news and media monitoring, analytics, brand monitoring, mentions, sentiment analysis and many others.
Related page type is Article Extraction which supports single article pages.
Request example#
If you requested an article list extraction, and the extraction succeeds,
then the articleList
field will be available in the query result:
from autoextract.sync import request_raw
query = [{
'url': 'http://example.com/blog/?p=3',
'pageType': 'articleList'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['articleList'])
Pagination fields, when available, allow fetching subsequent pages, for example:
# example continued from above
if results[0]['articleList'].get('paginationNext'):
query = [{
'url': results[0]['articleList']['paginationNext']['url'],
'pageType': 'articleList'
}]
results_next = request_raw(query, api_key='[api key]')
print(results_next[0]['articleList'])
Available fields#
Top-level#
The following fields are available for articleList
:
url
: string, requiredURL of a page where articles were extracted
articles
: list of dictionariesList of articles. Individual fields are described below.
paginationNext
: dictionaryIf pagination on an article list page is present, then
paginationNext
dictionary contains information about the link to the next page. Fields:url
is the URL of the next page. It is a required field.text
is the text corresponding to the link as it appears on site. Optional.
Example:
{"url": "http://example.com/foo?p=3", "text": "3"}
paginationPrevious
: dictionaryIf pagination on an article list page is present, then
paginationPrevious
dictionary contains information about the link to the previous page. Fields:url
is the URL of the previous page. It is a required field.text
is the text corresponding to the link as it appears on site. Optional.
Example:
{"url": "http://example.com/foo?p=1", "text": "Prev"}
Individual articles#
Each article inside articles
has following fields:
headline
: stringArticle headline or title.
datePublished
: stringPublication date. ISO-formatted with âTâ separator, may contain a timezone.
datePublishedRaw
: stringSame date as
datePublished
, but before parsing/normalization, i.e. as it appears on the website.author
: stringAuthor (or authors) of the article.
authorsList
: list of stringsAll authors of the article split into separate strings. For example,
if
author
is"Alice and Bob"
,authorList
would be["Alice", "Bob"]
,if
author
is"Alice Johnes"
(a single author),authorList
would be["Alice Johnes"]
.
inLanguage
: stringLanguage of the article, as an IETF BCP 47 language tag code.
mainImage
: stringA URL or data URL value of the main image of the article. All URLs are absolute.
images
: list of stringsA list of URL or data URL values of all images of the article (may include the main image). All URLs are absolute.
articleBody
: stringText of the article as it appears on list page.
probability
: floatProbability that this is a single article.
url
: stringURL of the full article page.
All fields are optional, except for probability
.
Fields without a valid value (null or empty array) are excluded from extraction results.
Response example#
Below is an example response with all article list fields present:
[
{
"articleList": {
"url": "http://www.example.com/article-list-page-3",
"paginationNext": {
"text": "Next Page",
"url": "http://example.com/article-list-page-4"
},
"paginationPrevious": {
"text": "Previous Page",
"url": "http://example.com/article-list-page-2"
},
"articles": [
{
"headline": "Article headline",
"datePublished": "2019-06-19T00:00:00",
"datePublishedRaw": "June 19, 2019",
"author": "Article author",
"authorsList": [
"Article author"
],
"inLanguage": "en",
"mainImage": "http://www.example.com/image.png",
"images": [
"http://example.com/image.png"
],
"articleBody": "Article body ...",
"probability": 0.95,
"url": "https://www.example.com/article?id=23"
},
{
"headline": "Headline of another article",
"probability": 0.82,
"url": "https://www.example.com/article?id=25"
}
]
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a3",
"domain": "www.example.com",
"userQuery": {
"pageType": "articleList",
"url": "http://www.example.com/article-list-page-3"
}
}
}
]