Choose an extraction source#

Zyte API automatic extraction can work with HTTP requests or with browser requests. This choice can have quite the impact in terms of output quality, response time and cost.

Currently, Zyte API uses browser requests by default in most cases, which usually provides the highest output quality. However, HTTP requests almost always have a lower response time and cost, so you should use them where their output quality is good enough for your needs.

In this chapter of the Zyte AI spiders tutorial you will explore the Extraction Source parameter.

Create an HTTP spider#

Now, create a new spider that uses HTTP for automatic extraction:

  1. Select Create spider on the left-hand sidebar.

    ../../../_images/create-spider.png
  2. Click the Select button on the E-commerce box.

  3. Fill the fields that show up as follows:

    1. On the Name field, type “books.toscrape.com / Mystery category (HTTP)”.

    2. On the Inputs / URL field, enter the same URL as before: https://books.toscrape.com/catalogue/category/books/mystery_3/index.html.

    3. On the Extraction Source field, select httpResponseBody.

      ../../../_images/http.png
  4. On the bottom-right corner, click Save and run.

Compare output quality#

You have run 2 jobs, from 2 spiders with only 1 difference: their Extraction Source parameter. Now, compare those jobs to find any differences in their output quality.

On the left-hand sidebar, select Jobs / Dashboard:

../../../_images/jobs-dash.png

And look at your 2 jobs:

../../../_images/jobs.png

First, check coverage: how many of the target items were extracted? From the job dashboard you can see coverage is 100% in both jobs, they both found all 32 items.

Next, compare the items from each job and look for data mismatches: fields with different values in each job or fields only present in one job. You can either open the Items page of each job, or download the items and inspect them locally with some text editor.

Tip

It is normal for items to be in a different order. For technical reasons, crawling does not happen in a fixed order. So, pick one item from a job, then find an item in the other job with the same url, and finally compare all other fields of those 2 items.

Spoiler warning — these are the differences:

  • The 1st, browser-based job has 2 extra fields: aggregateRating and brand.

    However:

    • The brand value is wrong. Books to Scrape is the name of the website, not the brand.

    • The aggregateRating value is incomplete. It is missing the ratingValue and bestRating values.

  • Neither job extracts the right sku value (742).

    The 1st job extracts the UPC (2d1e337aaf341858) as sku.

    The 2nd, HTTP-based job extracts the right value with a bad prefix (3_742), most likely because it extracts the sku from the URL and accidentally picks the number at the end of the book title slug (tastes-like-fear-di-marnie-rome-3) as part of the sku.

After you compare the output quality of 2 jobs from similar spiders with a different Extraction Source, you can make a choice for that parameter based on your needs.

For https://books.toscrape.com, if the aggregateRating.reviewCount field was very important to you, browserHtml may be the right choice. Otherwise, httpResponseBody would be the way to go.

However, you have more options:

  • You can open a support ticket to report any extraction issue you detect.

    Specify an example URL, the extracted data type (e.g. product) and the extraction source you used (default, httpResponseBody, browserHtml), as well as the expected and actual field values.

    Our AI models support quick fixes in some scenarios, so our support team may be able to address some of your issues right away. Any issue that we cannot address right away we will take into account for future iterations of our AI model.

  • You can use custom code to manually address any extraction issue yourself. You will learn how to do this in a later chapter.

Compare response times and costs#

Now that you have compared the output quality of both spiders, it is time to compare response times and costs.

httpResponseBody will almost always be faster and cheaper, but see for yourself: select the burger menu on the top-left corner, expand the Zyte API menu, select Stats, open the Requests tab, and select today in the Date filter.

../../../_images/stats.png

Now compare the Response time and Cost of requests with product under Requested features (i.e. browserHtml) and requests with product (from httpResponseBody):

../../../_images/stats-rows.png

As you see, for https://books.toscrape.com, httpResponseBody requests are 4+ times cheaper and can be about 10 times faster.

Next steps#

You now know how Extraction Source can impact output quality, response time and cost, and how to compare those to make a choice based on your needs.

On the next chapter you will learn to extend AI spiders with custom code.