Article List Extraction (beta)¶
Request example¶
If you requested an article list extraction, and the extraction succeeds,
then the articleList
field will be available in the query result:
from autoextract.sync import request_raw
query = [{
'url': 'http://example.com/blog/?p=3',
'pageType': 'articleList'}
})
results = request_raw(query, api_key='[api key]')
print(response.json()[0]['articleList'])
Available fields¶
Top-level¶
The following fields are available for articleList
:
url
: string, requiredURL of a page where articles were extracted
articles
: list of dictionariesList of articles. Individual fields are described below.
Individual articles¶
Each article inside articles
has following fields:
headline
: stringArticle headline or title.
datePublished
: stringPublication date. ISO-formatted with ‘T’ separator, may contain a timezone.
datePublishedRaw
: stringSame date as
datePublished
, but before parsing/normalization, i.e. as it appears on the website.author
: stringAuthor (or authors) of the article.
authorsList
: list of stringsAll authors of the article split into separate strings. For example,
if
author
is"Alice and Bob"
,authorList
would be["Alice", "Bob"]
,if
author
is"Alice Johnes"
(a single author),authorList
would be["Alice Johnes"]
.
inLanguage
: stringLanguage of the article, as an IETF BCP 47 language tag code.
mainImage
: stringA URL or data URL value of the main image of the article. All URLs are absolute.
images
: list of stringsA list of URL or data URL values of all images of the article (may include the main image). All URLs are absolute.
articleBody
: stringText of the article as it appears on list page.
probability
: floatProbability that this is a single article.
url
: stringURL of the full article page.
All fields are optional, except for probability
.
Fields without a valid value (null or empty array) are excluded from extraction results.
Response example¶
Below is an example response with all article list fields present:
[
{
"articleList": {
"url": "http://www.example.com",
"articles": [
{
"headline": "Article headline",
"datePublished": "2019-06-19T00:00:00",
"datePublishedRaw": "June 19, 2019",
"author": "Article author",
"authorsList": [
"Article author"
],
"inLanguage": "en",
"mainImage": "http://www.example.com/image.png",
"images": [
"http://example.com/image.png"
],
"articleBody": "Article body ...",
"probability": 0.95,
"url": "https://www.example.com/article?id=23"
},
{
"headline": "Headline of another article",
"probability": 0.82,
"url": "https://www.example.com/article?id=25"
}
]
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a3",
"domain": "www.example.com",
"userQuery": {
"pageType": "articleList",
"url": "http://www.example.com/"
}
}
}
]