Review Extraction (beta)

Review extraction supports pages which contain multiple reviews for some product or service, both on specialized review sites and on ecommerce sites. Many fields are extracted, such as top-level pagination links, and review body, publication date and rating of individual reviews.

Sending pagination URLs to the review extraction API allows to get all reviews when they span multiple pages.

Related page type is Product Extraction which supports single product pages.

Request example

If you requested review extraction, and the extraction succeeds, then the reviews field will be available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'https://example.com/reviews',
    'pageType': 'reviews'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['reviews'])

Pagination fields, when available, allow fetching subsequent pages, for example:

# example continued from above
if results[0]['reviews'].get('paginationNext'):
    query = [{
        'url': results[0]['reviews']['paginationNext']['url'],
        'pageType': 'reviews'
    }]
    results_next = request_raw(query, api_key='[api key]')
    print(results_next[0]['reviews'])

Available fields

Top-level

The following fields are available for reviews:

url: string, required

URL of a page where reviews were extracted.

reviews: list of dictionaries

List of reviews. Individual fields are described below.

paginationNext: dictionary

If pagination on a reviews page is present, then paginationNext dictionary contains information about the link to the next page. Fields:

  • url is the URL of the next page. It is a required field.

  • text is the text corresponding to the link as it appears on site. Optional.

Example:

{"url": "http://example.com/foo?p=3", "text": "3"}
paginationPrevious: dictionary

If pagination on a reviews page is present, then paginationPrevious dictionary contains information about the link to the previous page. Fields:

  • url is the URL of the previous page. It is a required field.

  • text is the text corresponding to the link as it appears on site. Optional.

Example:

{"url": "http://example.com/foo?p=1", "text": "Prev"}

Individual reviews

Each review inside reviews field has the following fields available:

name: String

Title, Header or Name of the review.

reviewBody: String

Text of the review, with newline separators.

reviewRating: Dictionary

Information about the rating of the review. Fields:

  • ratingValue is a number representing the rating given by the reviewer

  • bestRating is the best possible rating, if known.

For example, “3 out of 5” would look like this:

{"ratingValue": 3, "bestRating": 5}
datePublished: String

Publication date. ISO-formatted with ‘T’ separator, may contain a timezone.

datePublishedRaw: String

Same date as datePublished, but before parsing/normalisation, i.e. as it appears on the website.

votedHelpful: Integer

Number of votes that consider this review as helpful (upvotes).

votedUnhelpful: Integer

Number of votes that consider this review as unhelpful (downvotes).

isVerified: Boolean

Whether the reviewer has been verified as a legit person or a real owner / buyer of the reiewed item.

probability: Float

Probability that this is a review.

All fields are optional, except for probability.

Fields without a valid value (null or empty array) are excluded from extraction results.

Response example

Below is an example response with all review fields present:

[
  {
    "reviews": {
      "url": "https://example.com/review-3",
      "paginationNext": {
        "text": "Next Page",
        "url": "http://example.com/review-4"
      },
      "paginationPrevious": {
        "text": "Previous Page",
        "url": "http://example.com/review-2"
      },
      "reviews": [
        {
          "name": "A great tool!",
          "reviewBody": "AutoExtract is a great tool for review extraction",
          "reviewRating": {
            "ratingValue": 5.0,
            "bestRating": 5.0
          },
          "datePublished": "2020-01-30T00:00:00",
          "datePublishedRaw": "Jan 30, 2020",
          "votedHelpful": 12,
          "votedUnhelpful": 1,
          "isVerified": true,
          "probability": 0.95
        },
        {
          "name": "Another review",
          "probability": 0.95
        }
      ]
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "example.com",
      "userQuery": {
        "pageType": "reviews",
        "url": "https://example.com/review-3"
      }
    },
    "algorithmVersion": "20.8.1"
  }
]