Zyte Data article schema v1.0#

Standard Article Schema (1.0)

Standard Article Schema used in Zyte offering. Covers the typical set of attributes present in articles published on-line.

Standard Article Schema v1.0

Responses

Response Schema: application/json
headline
string

Article headline or title.

datePublished
string

Publication date of the article.

Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ" or "YYYY-MM-DDThh:mm:ss±zz:zz".

With timezone, if available.

If the actual publication date is not found, "dateModified" value is taken.

datePublishedRaw
string

Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website.

dateModified
string

The date when the article was most recently modified.

Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ" or "YYYY-MM-DDThh:mm:ss±zz:zz".

With timezone, if available.

dateModifiedRaw
string

Same date as "dateModified" but before parsing/normalization, i.e. as it appears on the website.

Array of objects[ items ]

All authors of the article.

Array
email
string

The email of the author.

url
string (URL)

The URL to the author's details page.

name
string

Full name of the author.

nameRaw
string

Text from which this author was extracted.

Array of objects or objects[ items ]

The list of breadcrumbs with URL and optional category name.

Array
Any of
string

Breadcrumb name or category name.

string (URL)

Breadcrumb URL.

inLanguage
string

Language of the article, as an ISO 639-1 language code. Sometimes article language is not the same as the web page overall language.

object (Image)

The details of the main image of the article.

url
required
string (URL)

A URL of an image

Array of objects (Image) [ items ]

A list of URL values of all images of the article.

Array
url
required
string (URL)

A URL of an image

description
string

A short summary of the article. It can be either human-provided (if available), or auto-generated.

articleBody
string

Clean text of the article, including sub-headings, with newline separators.

Format:

  • trimmed (no whitespace at the beginning or the end of the body string),

  • line breaks included,

  • no length limit,

  • no normalization of Unicode characters.

articleBodyHtml
string

Simplified and standardized HTML of the article, including sub-headings, image captions and embedded content (videos, tweets, etc.).

Format: HTML string normalized in a consistent way with internal algorithm.

Array of objects[ items ]

A list of all videos inside the article body.

Array
url
required
string (URL)

URL of a media item

Array of objects[ items ]

A list of all audios inside the article body.

Array
url
required
string (URL)

URL of a media item

canonicalUrl
string (URL)

The canonical form of the URL, selected by the website.

url
required
string (URL)

The main URL of the article page.

The URL of the final response, after any redirects.

Required attribute.

In case there is no article data on the page or the page was not reached, the returned "empty" item would still contain url field and metadata field with dateDownloaded.

object

Contains metadata about the data extraction process.

dateDownloaded
string

The timestamp at which the article data was downloaded.

Timezone: UTC.

Format: ISO 8601 format. YYYY-MM-DDThh:mm:ssZ

probability
number [ 0 .. 1 ]

The probability that the page is an article page.

Response samples

Content type
application/json
{
}