Zyte Data article list schema v1.0#

Standard Article List Schema (1.0)

Standard Article List Schema used in Zyte offering. Covers the typical set of attributes present in article listings published on-line.

Standard Article List Schema v1.0

Responses

Response Schema: application/json
url
required
string (URL)

The main URL of the article list.

The URL of the final response, after any redirects.

Required attribute.

In case there is no article list data on the page or the page was not reached, the returned item still contains "url" field, "metadata" field with a timestamp in "dateDownloaded" and all the other available datapoints.

canonicalUrl
string

The canonical form of the URL, selected by the website.

Array of objects[ items ]

List of article details found on the page.

The order of the articles reflects their position on the page.

Array
articleBody
string

Clean text of the article, including sub-headings, with newline separators.

Format:

  • trimmed (no whitespace at the beginning or the end of the body string),

  • line breaks included,

  • no length limit,

  • no normalization of Unicode characters.

Array of objects[ items ]

All authors of the article.

Array
email
string

The email of the author.

url
string (URL)

The URL to the author's details page.

name
string

Full name of the author.

nameRaw
string

Text from which this author was extracted.

datePublished
string

Publication date of the article.

Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ" or "YYYY-MM-DDThh:mm:ss±zz:zz".

With timezone, if available.

If the actual publication date is not found, "dateModified" value is taken.

datePublishedRaw
string

Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.

headline
string

Article headline or title.

inLanguage
string

Language of the article, as an ISO 639-1 language code. Sometimes article language is not the same as the web page overall language.

object (Image)

The details of the main image of the article.

url
required
string (URL)

A URL of an image

Array of objects (Image) [ items ]

A list of URL values of all images of the article.

Array
url
required
string (URL)

A URL of an image

probability
number [ 0 .. 1 ]

The probability that the page is an article page.

url
required
string (URL)

The main URL of the article page.

Array of objects or objects[ items ]

The list of breadcrumbs with URL and optional category name.

Array
Any of
string

Breadcrumb name or category name.

string (URL)

Breadcrumb URL.

object

Details of the next page URL, if available.

url
string (URL)

The URL of the pagination link

text
string

Text of the pagination link

pageNumber
integer

Current page number, if displayed explicitly on the list page.

Numeration starts with 1.

object

Metadata about the data extraction process.

dateDownloaded
string

The timestamp at which the product list data was downloaded.

Timezone: UTC.

Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ"

Response samples

Content type
application/json
{}