Unified Schema¶
Unified Schema
The Unified Schema project aims to provide a standard definition for the different types of data such as products, articles, reviews, jobs etc. extracted across websites.
Note: All fields in the AutoExtract have the exact same definition in the Unified Schema. We also aim to maintain backward compatibility while adding new fields. We also try our best to adhere to schema.org, only diverging when there is a reasonable benefit in doing so.
Article
Responses
Article
Response Schema: application/json
url required | string <uri> (URL) ^http[s]{0,1}\:.* Article page URL |
articleBody | string Text of the article, including sub-headings, with newline separators |
articleBodyHtml | string Simplified html of the article, including sub-headings, image captions and embedded content (videos, tweets, etc) |
articleBodyRaw | string html of the article body as seen in the source page |
audioUrls | Array of strings <uri> (URL) A list of URLs of all audio inside the article body |
authors | Array of objects Author of the article |
breadcrumbs | Array of objects or objects Article breadcrumbs |
canonicalUrl | string <uri> (URL) ^http[s]{0,1}\:.* Canonical URL of the article page |
dateModified | string or string or string or string or string (String format is date or datetime) The date when the article was most recently modified |
dateModifiedRaw | string The date when the article was most recently modified before parsing |
datePublished | string or string or string or string or string (String format is date or datetime) Publication date in ISO-format |
datePublishedRaw | string Publication date before parsing as appears on the website |
description | string A short summary of the article, human-provided if available, or auto-generated |
headline | string Article headline or title |
images | Array of strings <uri> (URL) Image urls of the article |
mainImage | string <uri> (URL) ^http[s]{0,1}\:.* A URL or data URL value of the main image of the article |
videoUrls | Array of strings <uri> (URL) A list of URLs of all videos inside the article body |
Response samples
- 200
{- "headline": "string",
- "datePublished": null,
- "datePublishedRaw": "string",
- "dateModified": null,
- "dateModifiedRaw": "string",
- "authors": [
- {
- "name": "string"
}
], - "breadcrumbs": [
- { }
], - "description": "string",
- "articleBody": "string",
- "articleBodyHtml": "string",
- "articleBodyRaw": "string",
}
Comment
Responses
Comment
Response Schema: application/json
pageUrl required | string <uri> (URL) ^http[s]{0,1}\:.* URL from where the comment is extracted ( in case different URLs between the page and comment ) |
author | object Author of the comment |
dateModified | string or string or string or string or string (String format is date or datetime) The date when the comment was most recently modified |
dateModifiedRaw | string The date when the comment was most recently modified before parsing |
datePublished | string or string or string or string or string (String format is date or datetime) Publication date in ISO-format |
datePublishedRaw | string Publication date before parsing as appears on the website |
downvoteCount | number The number of downvotes this comment received |
edited | boolean Whether comment was edited |
identifier | string To what “parentIdentifier” refers to |
locationCreated | object (Postal Address) The location where the comment was created |
parentIdentifier | string The parent of the comment |
replyCount | number The number of answers this comment has received. |
text | string text (body) of the comment |
textHtml | string Cleaned up HTML of the comment body |
textRaw | string HTML of the comment body |
upvoteCount | number The number of upvotes this comment received |
url | string <uri> (URL) ^http[s]{0,1}\:.* URL of the comment |
Response samples
- 200
{- "author": {
- "name": "string"
}, - "dateModified": null,
- "dateModifiedRaw": "string",
- "datePublished": null,
- "datePublishedRaw": "string",
- "downvoteCount": 0,
- "edited": true,
- "identifier": "string",
- "locationCreated": {
- "postalCode": "94043",
- "streetAddress": "1600 Amphitheatre Pkwy",
- "addressCountry": [
- "United States of America",
- "US"
], - "addressLocality": "Mountain View",
- "addressRegion": "California",
- "raw": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, United States of America"
}, - "parentIdentifier": "string",
- "replyCount": 0,
- "text": "string",
- "textHtml": "string",
- "textRaw": "string",
- "upvoteCount": 0,
}
Job Posting
Responses
Job Posting
Response Schema: application/json
url required | string <uri> (URL) ^http[s]{0,1}\:.* Job Posting page URL |
baseSalary | object (Monetary Amount) The base salary of the job or of an employee |
datePosted | string or string or string or string or string (String format is date or datetime) Publication date for the job posting |
datePostedRaw | string Publication date for the job posting |
description | string A description of job posting |
employmentType | string Type of employment (e.g. full-time, part-time, contract, temporary, seasonal, internship) |
hiringOrganization | object (Organization) Organization offering the job position |
jobLocation | object (Postal Address) A (typically single) geographic location associated with the job position |
title | string The title of the job |
validThrough | string or string or string or string or string (String format is date or datetime) The date after when the job posting is not valid |
validThroughRaw | string The date after when the job posting is not valid. |
Response samples
- 200
{- "jobLocation": {
- "postalCode": "94043",
- "streetAddress": "1600 Amphitheatre Pkwy",
- "addressCountry": [
- "United States of America",
- "US"
], - "addressLocality": "Mountain View",
- "addressRegion": "California",
- "raw": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, United States of America"
}, - "description": "string",
- "title": "string",
- "datePosted": null,
- "datePostedRaw": "string",
- "validThrough": null,
- "validThroughRaw": "string",
- "employmentType": "string",
- "hiringOrganization": {
- "name": "string",
- "location": {
- "postalCode": "94043",
- "streetAddress": "1600 Amphitheatre Pkwy",
- "addressCountry": [
- "United States of America",
- "US"
], - "addressLocality": "Mountain View",
- "addressRegion": "California",
- "raw": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, United States of America"
}, - "telephone": "string",
- "email": "string",
- "raw": "string"
}, - "baseSalary": {
- "value": 0,
- "minValue": 0,
- "maxValue": 0,
- "currency": "string",
- "raw": "string"
}
}
Product
Responses
Product
Response Schema: application/json
url required | string <uri> (URL) ^http[s]{0,1}\:.* The URL of the product |
additionalProperty | Array of objects (A generic name:value field) This name-value pair field holds information pertaining to product specific features that have no matching property in the Product schema. |
aggregateRating | object or object or object The overall rating, based on a collection of reviews or ratings |
brand | string The brand associated with the product
No brand is returned |
breadcrumbs | Array of objects or objects A list of breadcrumbs with optional name and URL. |
color | string The color of the product |
depth | object or object or object (A generic quantitative value) The depth of the product |
description | string A description of the product |
gtin | Array of objects Standardized GTIN product identifier which is unique for a product across different sellers. |
height | object or object or object (A generic quantitative value) The height of the product |
images | Array of strings <uri> (URL) A list of URL or data URL values of all images of the product (may include the main image). |
madeIn | string The city or country where the product has been manufactured. The website should explicitly carry wording to disambiguate this from product location |
mainImage | string <uri> (URL) ^http[s]{0,1}\:.* A URL or data URL value of the main image of the product. |
manufacturer | string The manufacturer company of the product. The difference between brand and manufacturer is difficult to stablish, so this field should only be included when the description appear explicitly on the website, otherwise, brand field is prefered over manufacturer |
mpn | string The Manufacturer Part Number (MPN) of the product. The product would have the same MPN across different e-commerce websites. |
name | string The name of the product |
nutrition | Array of objects Nutritional information about the product |
offers | Array of objects (Offer) This field contains rich information pertaining to all the buying options offered on a product. |
productionDate | string or string or string or string or string (String format is date or datetime) The date of production of the item |
productionDateRaw | string The date of production of the item as it appears on the website |
rankings | Array of objects (Ranking) Position of the product across different ranks |
ratingHistogram | Array of objects Distribution of ratings across the entire rating scale |
relatedProducts | Array of objects This field captures all products that are recommended by the website while browsing the product of interest.
Related products can thus be used to gauge customer buying behaviour, sponsored products as well best sellers in the same category.
The |
releaseDate | string or string or string or string or string (String format is date or datetime) Date on which the product was released or listed on the website in ISO 8601 date format |
releaseDateRaw | string Date on which the product was released or listed on the website |
reviews | Array of objects (Review) Product Reviews |
size | string Denotes the size of the product. Pertinent to products such as garments, shoes, accessories etc |
sku | string The Stock Keeping Unit (SKU) i.e. a merchant-specific identifier for the product |
variants | Array of objects This field returns a list of variants of the product. Each variant has the same schema as the Product schema defined in this table. |
volume | object or object or object (A generic quantitative value) The volume of the product |
weight | object or object or object (A generic quantitative value) The weight of the product |
width | object or object or object (A generic quantitative value) The width of the product |
Response samples
- 200
{- "aggregateRating": {
- "ratingValue": 4,
- "bestRating": 5,
- "reviewCount": 24
}, - "ratingHistogram": [
- {
- "ratingValue": "5",
- "ratingPercentage": 61
}, - {
- "ratingValue": "4",
- "ratingPercentage": 12
}, - {
- "ratingValue": "3",
- "ratingPercentage": 6
}, - {
- "ratingValue": "2",
- "ratingPercentage": 5
}, - {
- "ratingValue": "1",
- "ratingPercentage": 16
}
], - "brand": "Samsung",
- "breadcrumbs": [
], - "description": "string",
- "madeIn": "Vietnam",
- "name": "string",
- "offers": [
- {
- "availability": "InStock",
- "currency": "$",
- "itemCondition": {
- "description": "Used - Very Good",
- "type": "used"
}, - "price": "129.99",
- "seller": {
- "aggregateRating": {
- "bestRating": 5,
- "reviewCount": 479
}, - "identifier": "A8K32FFKI51FKN",
- "name": "Merch Store",
- "shippingInfo": {
- "description": "Arrives between September 3-18.",
- "maxDays": "30",
- "minDays": "15"
},
}
}
], - "additionalProperty": [
- {
- "name": "batteries",
- "value": "1 Lithium ion batteries required. (included)"
}, - {
- "name": "Item model number",
- "value": "SM-A105G/DS"
}
], - "sku": "A123DK9823",
- "mpn": "string",
- "gtin": [
- {
- "type": "isbn13",
- "value": "9781933624341"
}
], - "manufacturer": "string",
- "productionDate": null,
- "productionDateRaw": "string",
- "size": "XL",
- "height": { },
- "width": { },
- "depth": { },
- "weight": { },
- "volume": { },
- "url"