Product List Extraction#
Product list extraction supports pages which contain multiple products, for example as a list or a grid of products in a specific category on an ecommerce web-site. Usually each extracted product would contain a link to a detailed page, which can be sent to Product Extraction to get more detailed information.
Product list page type is especially useful for implementing crawling, where extracted product URLs are sent to the product API, and extracted pagination URLs are sent to the product list API. Product list extraction can also be used to get basic information about products on a web-site using a smaller number of requests, when product attrributes are extracted directly from a product list page, without making individual product requests.
This supports use-cases such as price monitoring, product intelligence, product analytics and many others.
Related page type is Product Extraction which supports single product pages.
Request example#
If you requested a product list extraction, and the extraction succeeds,
then the productList
field will be available in the query result:
from autoextract.sync import request_raw
query = [{
'url': 'http://books.toscrape.com/',
'pageType': 'productList'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['productList'])
Pagination fields, when available, allow fetching subsequent pages, for example:
# example continued from above
if results[0]['productList'].get('paginationNext'):
query = [{
'url': results[0]['productList']['paginationNext']['url'],
'pageType': 'productList'
}]
results_next = request_raw(query, api_key='[api key]')
print(results_next[0]['productList'])
Available fields#
Top-level#
The following fields are available for productList
:
url
: stringURL of a page where products were extracted.
products
: list of dictionariesList of products. Individual fields are described below (see Individual products).
breadcrumbs
: list of dictionaries withname
andlink
optional string fieldsA list of breadcrumbs (a specific navigation element) with optional name and URL. Example:
[ {"name": "Foo", "link": "http://example.com/foo"}, {"name": "Bar", "link": "http://example.com/foo/bar"} ]
paginationNext
: dictionaryIf pagination on a product list page is present, then
paginationNext
dictionary contains information about the link to the next page. Fields:url
is the URL of the next page. It is a required field.text
is the text corresponding to the link as it appears on site. Optional.
Example:
{"url": "http://example.com/foo?p=3", "text": "3"}
paginationPrevious
: dictionaryIf pagination on a product list page is present, then
paginationPrevious
dictionary contains information about the link to the previous page. Fields:url
is the URL of the previous page. It is a required field.text
is the text corresponding to the link as it appears on site. Optional.
Example:
{"url": "http://example.com/foo?p=1", "text": "Prev"}
url
field is required.
Individual products#
Each product inside products
field has the following fields:
name
: stringThe name of the product.
offers
: list of dictionariesProduct offers. Each offer may contain
price
,currency
,regularPrice
andavailability
string fields. All fields are optional butcurrency
is present only ifprice
is also present.price
field is a string with a valid number (a dot is used as decimal separator). It is the price a customer has to pay after discounts or special offers.currency
is the currency as given on the website, without extra normalization (for example, both “$” and “USD” are possible currencies). It is present only ifprice
is also present.regularPrice
is the price before any discount or special offer. It is present only when theprice
is different fromregularPrice
.availability
is the product availability, as a string. Allowed values:"InStock"
- includes limited availability, presale, preorder, and in-store only."OutOfStock"
- includes discontinued and sold out.
Example:
[ { "price": "42", "regularPrice": "45.00", "currency": "USD", "availability": "InStock" } ]
sku
: stringStock Keeping Unit identifier for the product assigned by the seller.
brand
: stringBrand or manufacturer of the product.
mainImage
: stringA URL or data URL value of the main image of the product.
images
: list of stringsA list of URL or data URL values of all images of the product (may include the main image).
description
: stringDescription of the product.
descriptionHtml
: stringSimplified HTML of the description, including sub-headings, image captions and embedded content.
aggregateRating
: dictionaryAggregate information about the product rating and reviews.
ratingValue
is the average rating value, as a float.bestRating
is the best possible rating value, as a float.reviewCount
is the number of reviews or ratings for the product, as int.
Example - 4.5 out of 5, based on 12 reviews:
{ "ratingValue": 4.5, "bestRating": 5, "reviewCount": 12 }
All fields are optional but one of
reviewCount
orratingValue
must be present.probability
: floatProbability that the extracted item is a single product listing.
url
: stringURL a of the main product page for this product listing.
To get full information about the product you might make an AutoExtract request to this URL with pageType “product” (see Product Extraction).
All fields are optional, except for probability
.
Fields without a valid value (null or empty array) are excluded from
extraction results.
Response example#
Below is an example response with all product list fields present:
[
{
"productList": {
"url": "http://example.com/product-list-page-3",
"breadcrumbs": [
{
"name": "Home",
"link": "http://example.com"
}
],
"paginationNext": {
"text": "Next Page",
"url": "http://example.com/product-list-page-4"
},
"paginationPrevious": {
"text": "Previous Page",
"url": "http://example.com/product-list-page-2"
},
"products": [
{
"name": "Product 1",
"url": "http://example.com/product1",
"offers": [
{
"price": "42",
"currency": "USD",
"availability": "InStock",
"regularPrice": "60"
}
],
"sku": "product sku",
"brand": "product1 brand",
"mainImage": "http://example.com/image.png",
"images": [
"http://example.com/image.png"
],
"description": "product1 description",
"descriptionHtml": "<article>HTML description for Product1 ...",
"aggregateRating": {
"ratingValue": 4.5,
"bestRating": 5.0,
"reviewCount": 31
},
"probability": 0.95
},
{
"name": "Product 2",
"url": "http://example.com/product2",
"offers": [
{
"price": "72",
"currency": "USD",
"availability": "OutOfStock"
}
],
"sku": "product2 sku",
"brand": "product2 brand",
"mainImage": "http://example.com/image2.png",
"images": [
"http://example.com/image2.png"
],
"description": "product2 description",
"descriptionHtml": "<article>HTML description for Product2 ...",
"aggregateRating": {
"ratingValue": 1.5,
"bestRating": 5.0,
"reviewCount": 85
},
"probability": 0.90
}
]
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a2",
"domain": "example.com",
"userQuery": {
"pageType": "productList",
"url": "https://example.com/product-list-page-3"
}
},
"algorithmVersion": "20.8.1"
}
]