Custom attributes extraction#

Overview#

In addition to the extraction of standard data types, such as product or article, Zyte API allows extraction of user-defined attributes from any unstructured web page. Zyte custom attributes extraction uses a Large Language Model (LLM) operated by Zyte which receives a user-defined schema, as well as text extracted from the web page, to perform extraction of structured data according to the schema.

When custom attributes extraction is requested, a standard extraction field must also be specified (e.g. product). This determines the part of the web page which would be passed to the LLM for custom attributes extraction, e.g. when a web page is a product, we’re only going to pass the product information, ignoring other parts of the page, such as menu or footer, which makes extraction cheaper and more accurate. Any of the standard extraction fields can be used, except for serp.

The schema is passed in the customAttributes request field, and additional options can be customized in the customAttributesOptions field.

Extracted values are available in the customAttributes.values field in the response.

Here is an example body of a request to Zyte API which performs custom attributes extraction, adding “summary” and “article_sentiment” attributes:

{
  "url": "https://www.zyte.com/blog/intercept-network-patterns-within-zyte-api/",
  "article": true,
  "customAttributes": {
    "summary": {
      "type": "string",
      "description": "A two sentence article summary"
    },
    "article_sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"]
    }
  }
}

And here is an example response body, with “article” and “metadata” values omitted:

{
  "url": "https://www.zyte.com/blog/intercept-network-patterns-within-zyte-api/",
  "statusCode": 200,
  "article": {

  },
  "customAttributes": {
    "values": {
      "summary": "The Zyte API now allows developers to intercept network patterns, enabling better web scraping and bypassing challenges posed by modern websites with dynamic content and anti-bot measures. This feature allows for enhanced ban-handling strategies and more efficient scraping.",
      "article_sentiment": "positive"
    },
    "metadata": {

    }
  }
}

Refer to examples of making Zyte API requests with different languages and libraries.

Method of extraction#

customAttributesOptions.method allows to select the method of custom attribute extraction:

  • “generate” (default) generates extracted data with the help of a generative Large Language Model (LLM). It is the most powerful and versatile extraction method, but also the most expensive one, with variable per-request cost.

  • “extract” locates extracted data in the requested web page with the help of a non-generative LLM. It only supports a subset of the schema (only string, integer and number types), and can’t perform generative tasks such as summarization or data transformation. It is however much cheaper compared to the generative method and has a fixed per-request cost.

Schema for the generative method#

The schema of the custom attributes is passed in the customAttributes request field, and is a subset of the OpenAPI specification, using JSON syntax. Here is an example custom attributes schema, showcasing the main features and good practices with the default “generate” method:

{
  "pockets": {
    "type": "integer",
    "description": "how many pockets the piece of clothing has"
  },
  "has_reflective_elements": {
    "type": "boolean",
    "description": "does the piece of clothing have reflective elements?"
  },
  "pattern_orientation": {
    "type": "string",
    "description": "if the piece of clothing has a pattern, the orientation of this pattern",
    "enum": ["horizontal", "vertical", "diagonal"]
  },
  "materials": {
    "type": "array",
    "description": "the materials the product is made of",
    "items": {"type": "string"}
  },
  "materials_details": {
    "type": "array",
    "description": "information about the materials the product is made of",
    "items": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string",
          "description": "the name of the material"
        },
        "percentage": {
          "type": "number",
          "description": "the percentage of the material in the product"
        }
      }
    }
  },
  "price": {
    "type": "object",
    "properties": {
      "regular": {
        "type": "number",
        "description": "the regular price of the product. This is, without any discount"
      },
      "discounted": {
        "type": "number",
        "description": "the current price of the product, with the discount"
      },
      "unit": {
        "type": "string",
        "description": "the currency code of the price, usually given as a 3-letter code, e.g. USD, EUR, GBP, etc."
      }
    }
  }
}

An example output which may be produced for this schema would be:

{
  "pockets": 3,
  "has_reflective_elements": false,
  "materials": ["cotton", "polyester", "elastane"],
  "materials_details": [
    {"name": "cotton", "percentage": 70},
    {"name": "polyester", "percentage": 25},
    {"name": "elastane"}
  ],
  "price": {
    "regular": 100,
    "discounted": 99,
    "unit": "EUR"
  }
}

Note that "pattern_orientation" is missing from the response, as well as "percentage" for one of the materials: this is due to all attributes being implicitly nullable, so if an attribute can not be extracted, it will not be returned.

The returned value is guaranteed to conform to the requested schema.

The following attribute data types are supported:

  • string

  • boolean

  • number

  • integer

  • array of any data type except for array

  • object with string, boolean, number and integer sub-fields

When the type is string, number or integer, an enum can also be indicated, and the extraction value for that attribute will always be one of these options, or empty when it cannot be extracted - see the "pattern_orientation" in the example above. This is especially useful in data analysis use cases, where one might need to split the dataset into pre-defined groups.

Generative attributes#

Custom Attributes don’t need to be restricted to extract data as it appears on the web site verbatim. They can be used for different operations on data that can only be achieved with generative extraction. You can find some examples below that take advantage of this aspect.

Normalization#

A custom attribute can be extracted following some data normalization, when specified in the description, usually in some explicit format or via an example.

This is especially useful for later parsing, e.g. for visualization or data analysis, for example:

{
  "datetime_posted": {
    "type": "string",
    "description": "the date when the article was created, in the following format: YYYY/MM/DD"
  }
}

Example output:

{
  "datetime_posted": "2021/12/30"
}

Summarization#

Sometimes, attributes that are summaries rather than the whole text can be useful, especially to save tokens needed to generate them, or when some simplification of the content of the page is needed.

Example schema:

{
  "summary": {
    "type": "string",
    "description": "a brief summary of the article. Max 2 phrases. Explain it as a third person, e.g. start like this: The article.."
  }
}

Example output:

{
  "summary": "The article describes the scenic beauty and vast adventure opportunities of the Grand Canyon National Park, highlighting its colorful landscapes and the meandering Colorado River. It provides practical information for visitors, such as entrance fees, lodging options, and tips for hiking and rafting."
}

Translation#

An extract-and-translate can be done on the fly. Just specify the conditions and/or details in the description.

Example text:

[...]
Couleurs du produit disponibles: jaune, rouge
[...]

Example schema:

{
  "colors": {
    "type": "array",
    "description": "the available colors of the product. Translate to English if needed.",
    "items": {"type": "string"}
  }
}

Output:

{
  "colors": ["yellow", "red"]
}

When there are several attributes in the schema, these kinds of specifications made to custom attributes may apply to later attributes, so in these cases, if you want different behavior, it’s recommended to specify so in the description of each attribute, for example:

{
  "colors": {
    "type": "array",
    "description": "the available colors of the product. Translate to English if needed.",
    "items": {"type": "string"}
  },
  "materials": {
    "type": "array",
    "description": "the materials the product is made of. Extract as they appear on the page, without translating them.",
    "items": {"type": "string"}
  }
}

Explanation#

We can make the LLM perform an analysis and explain the page content (or other details) before doing the actual extraction, in the same attribute or in another attribute after the explanatory one.

This is especially useful to force the LLM to develop a “logic” before doing the actual extraction, which has been demonstrated to improve the final answer, for example:

{
  "explain is a toy": {
    "type": "string",
    "description": "analyze the content of the page and detailedly explain it, explaining if it is a single product page and if the product is a toy or not."
  },
  "is a toy": {
    "type": "boolean",
    "description": "whether the product is a toy or not"
  }
}

Example output:

{
  "explain is a toy": "The content of the page is a product page for \"Roasted & Salted Plantain Chips\", which is a type of snack food. It includes details such as brand, price, ratings, ingredients, and product description. It does not mention anything about toys or games, so it is not a single product page for a toy.",
  "is a toy": false
}

Overall, we’d expect the extraction of the “is a toy” custom attribute in the example above to be more accurate if we use the “explain is a toy” before it, especially in hard or ambiguous cases (e.g. the product is a manual for a toy).

Note

Since these kind of explanations need to generate a fair amount of tokens, it can considerably increase the extraction cost.

Also, it is important that the attribute that does the explanation/analysis (“explain is a toy”) comes before the final one where the final extraction is made (“is a toy”).

Other tips and tricks#

Avoiding mathematical transformations#

We recommend doing a simple extraction when possible, and then apply your rules or transformations as a post-processing of the extraction.

For example, imagine you want to extract the height of a product, but always in inches. However, you’re scraping a lot of product pages and some web sites might display the height of the product in cm, m, ft, etc. One option is to explicitly ask in the schema to “transform to X metric if found in Y metric”. The LLM generally has the capacity to do this conversion internally, but we cannot ensure the result will always be correct, and it will overcomplicate extraction for the LLM.

Example Text:

Vacuum cleaner Turbo master 2000
Price: 200 $
Specified height by the manufacturer is 1.2 meters

Example Schema:

{
  "height": {
    "type": "number",
    "description": "height of the product, in inches. Transform it to inches if found in other metric"
  }
}

Extraction result:

{
  "height": 47.24
}

The result is correct. However, when the schema is bigger (i.e. there are more custom attributes to extract), the LLM attention is more spread, it has a higher chance to fail these internal conversions, which cannot be verified. For this reason, we recommend writing a schema that allows the LLM to extract the desired data verbatim from the page, with the necessary fields to do your own transformation in your favorite programming language. This extraction is easier for the LLM and has a lower chance of being extracted incorrectly.

Example schema:

{
  "product_height": {
    "type": "object",
    "description": "info about the height of the product",
    "properties": {
      "value": {
        "type": "number",
        "description": "the value of the height"
      },
      "unit_normalized": {
        "type": "string",
        "description": "the normalized unit of measurement for the height",
        "enum": ["cm", "m", "in", "ft", "mm", "other"]
      }
    }
  }
}

Extraction result:

{
  "product_height": {
    "value": 1.2,
    "unit_normalized": "m"
  }
}

And then do the necessary transformation. For example, using Python:

if values["product_height"]["unit_normalized"] == "m":
   return values["product_height"]["value"] * 39.37  # meters to inches

Reducing the number of attributes#

A lower number of attributes generally means better extraction quality. The easier it is to solve a problem, the better the LLM will be at solving that problem. For that reason, the more attributes there are in the schema, the harder the overall extraction will be for the LLM. In the latter case, the LLM tends to miss some details in the descriptions, or in the web page.

Generally, the fewer the attributes, the more the LLM can focus on those, and the better the extraction quality will be. It is especially important when some attributes are already hard or complex on their own (e.g. array of objects).

Schema for the extractive method#

The main use case for the extractive method is the extraction of simple, not-too-large attributes that do not require any transformation, such as memory capacity or screen resolution for product.

When you select the “extract” custom attributes extraction method, and your schema contains attributes that are not supported by the extractive method, e.g. objects, lists, or booleans, those attributes are ignored during extraction.

When creating a schema with the extractive method, we recommend to start without attribute descriptions. If an attribute name alone is not enough to reach the desired quality, we recommend writing a description for that attribute, but formulating it as a question, and describing it in detail without assuming that the attribute name will be implicit as context for that question.

For example, you might start with the following:

{
  "number_of_pockets": {
    "type": "integer"
  }
}

And then make it more specific:

{
  "pockets": {
    "type": "integer",
    "description": "What is the number of pockets in this garment?"
  }
}

But we don’t recommend having an incomplete description that relies on the attribute name, or a description that is not a question:

{
  "pockets": {
    "type": "integer",
    "description": "number of them in this garment"
  }
}

When doing extraction of values with units, we recommend to extract the whole value as one attribute, instead of splitting the value and the unit, for example do this:

{
  "memory_capacity": {
    "type": "string"
  }
}

instead of this, which is less likely to work well:

{
  "memory_capacity_value": {
    "type": "integer"
  },
  "memory_capacity_unit": {
    "type": "string"
  }
}