Review Extraction#
Review extraction supports pages which contain multiple reviews for some product or service, both on specialized review sites and on ecommerce sites. Many fields are extracted, such as top-level pagination links, and review body, publication date and rating of individual reviews.
Sending pagination URLs to the review extraction API allows to get all reviews when they span multiple pages.
Related page type is Product Extraction which supports single product pages.
Request example#
If you requested review extraction, and the extraction succeeds,
then the reviews
field will be available in the query result:
from autoextract.sync import request_raw
query = [{
'url': 'https://example.com/reviews',
'pageType': 'reviews'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['reviews'])
Pagination fields, when available, allow fetching subsequent pages, for example:
# example continued from above
if results[0]['reviews'].get('paginationNext'):
query = [{
'url': results[0]['reviews']['paginationNext']['url'],
'pageType': 'reviews'
}]
results_next = request_raw(query, api_key='[api key]')
print(results_next[0]['reviews'])
Available fields#
Top-level#
The following fields are available for reviews
:
url
: string, requiredURL of a page where reviews were extracted.
reviews
: list of dictionariesList of reviews. Individual fields are described below.
paginationNext
: dictionaryIf pagination on a reviews page is present, then
paginationNext
dictionary contains information about the link to the next page. Fields:url
is the URL of the next page. It is a required field.text
is the text corresponding to the link as it appears on site. Optional.
Example:
{"url": "http://example.com/foo?p=3", "text": "3"}
paginationPrevious
: dictionaryIf pagination on a reviews page is present, then
paginationPrevious
dictionary contains information about the link to the previous page. Fields:url
is the URL of the previous page. It is a required field.text
is the text corresponding to the link as it appears on site. Optional.
Example:
{"url": "http://example.com/foo?p=1", "text": "Prev"}
Individual reviews#
Each review inside reviews
field has the following fields available:
name
: StringTitle, Header or Name of the review.
reviewBody
: StringText of the review, with newline separators.
reviewRating
: DictionaryInformation about the rating of the review. Fields:
ratingValue
is a number representing the rating given by the reviewerbestRating
is the best possible rating, if known.
For example, â3 out of 5â would look like this:
{"ratingValue": 3, "bestRating": 5}
datePublished
: StringPublication date. ISO-formatted with âTâ separator, may contain a timezone.
datePublishedRaw
: StringSame date as
datePublished
, but before parsing/normalisation, i.e. as it appears on the website.votedHelpful
: IntegerNumber of votes that consider this review as helpful (upvotes).
votedUnhelpful
: IntegerNumber of votes that consider this review as unhelpful (downvotes).
isVerified
: BooleanWhether the reviewer has been verified as a legit person or a real owner / buyer of the reiewed item.
probability
: FloatProbability that this is a review.
All fields are optional, except for probability
.
Fields without a valid value (null or empty array) are excluded from extraction results.
Response example#
Below is an example response with all review fields present:
[
{
"reviews": {
"url": "https://example.com/review-3",
"paginationNext": {
"text": "Next Page",
"url": "http://example.com/review-4"
},
"paginationPrevious": {
"text": "Previous Page",
"url": "http://example.com/review-2"
},
"reviews": [
{
"name": "A great tool!",
"reviewBody": "AutoExtract is a great tool for review extraction",
"reviewRating": {
"ratingValue": 5.0,
"bestRating": 5.0
},
"datePublished": "2020-01-30T00:00:00",
"datePublishedRaw": "Jan 30, 2020",
"votedHelpful": 12,
"votedUnhelpful": 1,
"isVerified": true,
"probability": 0.95
},
{
"name": "Another review",
"probability": 0.95
}
]
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a3",
"domain": "example.com",
"userQuery": {
"pageType": "reviews",
"url": "https://example.com/review-3"
}
},
"algorithmVersion": "20.8.1"
}
]