AI spiders#
AI spiders get you structured data from any e-commerce, article or job posting website, as well as from Google Search results.
Use AI spiders however you prefer:
Start a Scrapy project with AI spiders (covered in the AI spiders tutorial).
Add AI spiders to an existing Scrapy project (covered in the web scraping tutorial).
Spider list#
The following AI spiders are currently available: E-commerce, Google Search Results, Article and Job posting.
E-commerce#
The E-commerce spider gets product data from e-commerce websites.
It supports regular input parameters including search queries, many crawl strategies, geolocation, max requests, extraction source, and custom attributes.
You can also use the Extract field to specify whether you want your spider to go into the detail webpage of every product to extract product data, getting as much data per product as possible, or skip detail webpages and instead extract productList data, minimizing costs by drastically lowering the number of requests.
Google Search Results#
The Google Search Results spider gets search results from Google.
Set a target Domain (defaults to google.com), a list of Search Queries (1 query per line), and the Max Pages to crawl per search query (defaults to 1). You may use query operators in your queries.
It returns serp data by default. Use Follow And Extract to instead have the spider follow the result URLs and extract some other structured data from them, to choose from: article, articleList, forumThread, jobPosting, product, productList.
It also supports geolocation (IP Country) and max requests.
Article#
The Article spider gets article data from websites.
It supports regular input parameters, basic crawl strategies, geolocation, max requests (total and per seed), and extraction source.
It also supports incremental crawls: Enable Incremental to only return new articles in subsequent runs of your spider. Your spider automatically creates a collection and uses it to keep track of seen articles. Setting a custom Incremental Collection Name is recommended (only alphanumeric characters and underscores are allowed).
Job posting#
The Job posting spider gets jobPosting data from websites.
It supports regular input parameters, basic crawl strategies, geolocation, max requests, extraction source, and custom attributes.
Inputs#
The E-commerce, Article and Job posting spiders support 3 different ways to define their start URLs:
Use URL to set a single start URL, e.g. https://toscrape.com.
Use URLs to set multiple start URLs, 1 per line.
Use URLs file to set the URL to a plain-text file containing the list of URLs to use as start URLs, 1 per line, e.g. https://pastebin.com/raw/is25dWkC.
The E-commerce spider also allows to specify Search Queries (1 per line). If you do, your spider downloads each start URL, then looks for a search form or search metadata in each response, and when found, it sends a search request per search query.
Note
If the parsing of a start URL for a search form or search metadata fails, that start URL is ignored. If it returns bad data, invalid search requests may be sent. You can always customize your project to improve search support for a given website, or set search URLs as start URLs instead of using the Search Queries field.
Max requests#
To protect you from costly accidental runs while experimenting with Zyte spiders, all spiders are limited to 100 Zyte API requests by default. Remember to increase Max Requests as needed when creating a spider for a larger crawl.
When using the Article spider, you can also set Max Requests Per Seed, a maximum number of requests per start URL.
Crawl strategies#
The E-commerce, Article and Job posting spiders support a subset of the following crawl strategies:
Automatic: If the start URL looks like a home page, use Full. Otherwise, use Navigation.
Full: Get all items from the entire website.
In addition to navigation data from the start URLs (e.g. productNavigation), this strategy may employ additional techniques to try and reach all relevant content from the target website.
Navigation: Extract navigation data from the start URLs (e.g. productNavigation), and use that navigation data recursively to find items.
Use this strategy to get items from a specific category and its subcategories. On some cases, it could work as a replacement for Full that requires fewer requests.
Pagination only may be a more reliable alternative in cases where there are no subcategories.
Pagination only: Extract navigation data from the start URLs (e.g. productNavigation), find new pages following pagination data only (e.g. productNavigation.nextPage), and get all items from those pages.
Use this strategy to get items from a specific category without going into its subcategories.
Direct URLs: Output items are extracted directly from the start URLs, no crawling involved.
New spiders, improvements and deployment#
If you choose to deploy Zyte’s AI-Powered Spiders when creating a Scrapy Cloud project, the code of your project is automatically maintained by Zyte. As we release new AI spiders and improve existing ones, you will get those improvements automatically in your project.
If you ever deploy your own code to that project, or create a new project where you deploy your own code from the beginning, you gain control over the project code and Zyte stops automatic code upgrades.
When in control of your project code, you might want to monitor new releases of the zyte-spider-templates library or changes to the zyte-spider-templates-project project template, to decide when and how to upgrade your project.
Customization#
AI spiders are Scrapy-powered spider templates implemented in the open source zyte-spider-templates library. You can customize them completely.