# Zyte Documentation ## Zyte documentation How would you like to get public data from the Internet? > ##### Write your own code > > New to web scraping? Learn here! > > Have some experience? Improve your stack with our solutions below. > ##### [We do it for you](https://www.zyte.com/data-extraction/) > > We can design, build, monitor, and maintain custom web scraping > solutions. > > [Talk to us!](https://www.zyte.com/data-extraction/) At Zyte we have solutions for every web scraping need: > ##### Antiban > > Get HTTP responses without bans with Zyte > API. > ##### Hosting > > Run your code on Scrapy Cloud. > ##### Browser automation > > Get powerful browser automation with > Zyte API. > ##### Coding Agent Add-Ons > > Use Coding Agent Add-Ons to build better web scraping projects faster. > ##### Automatic extraction > > Let AI do the parsing for you with Zyte API. ## Get started with web scraping **Web scraping** is the download of data from websites in a structured format that you can process. Use cases include price intelligence, market analysis, competitor intelligence, vendor management, lead generation, investment research, and brand monitoring. Below we expand on the steps and challenges of web scraping, and the solutions that we offer or recommend. > ###### TIP > > For a more hands-on experience, see the tutorial. ### Steps in web scraping Getting structured data from a website involves the following steps: 1. Building a list of **target URLs** from which you want to get structured data. You can build it manually, or get it from an external source, however you will often want to use web scraping to find all URLs of interest in a given website. This process, known as **web crawling**, is a web scraping process in itself, with the same steps (target URLs, download, parsing), where the target URL is often the homepage of the target website, and the output is usually the list of target URLs. 2. **Downloading** the webpages at those URLs. The main challenge at the download stage is avoiding bans. However, the complexity of this step can be also influenced by your output needs (e.g. screenshots) and parsing choices (e.g. browser automation). 3. **Parsing** those webpages to extract data of interest in a structured data format as output. Parsing can be a complex step, and it may involve downloading additional URLs, or using browser automation. In the long-term, however, the main challenge of the parsing stage is dealing with breaking changes in the target website. To make parsing easier, use ai-code. ### Choosing a framework Choosing the right technology to write your code is key for the long-term success of your web scraping project. To make your choice, you should consider aspects like development speed, performance, maintainability, and vendor lock-in. At Zyte we use and maintain [Scrapy](https://scrapy.org), a popular open source web scraping framework written in Python. Scrapy is a powerful, extensible framework that favors writing maintainable code. For most Zyte products and services we provide Scrapy plugins that make integration with Scrapy seamless. ### Avoiding bans An increasing number of websites ban some of their traffic. What you need to do to avoid bans depends on the target websites, and it can vary wildly. Proxy rotation is often necessary, but on top of that you may need extra logic, including cookie and session handling, browser-like JavaScript execution, and browser-like HTTP protocol handling. While you can implement ban avoidance on your own by combining different services and tools, it can be time-consuming to implement, maintain, and scale. To avoid bans, we provide Zyte API, an API that automatically avoids bans cost-efficiently. ### Saving time with browser automation For websites that use JavaScript to load content, there are 2 approaches you can take: - Reverse-engineering and recreating the JavaScript code that loads the content. - Letting a browser automation tool run the JavaScript code. Reverse-engineering usually requires more development time, but requires fewer resources once implemented, and can uncover useful, hidden data. Browser automation usually saves development time, but requires additional resources, and can be hard to scale. Zyte API provides browser automation features and, unlike regular browser automation tools, it: - Scales easily, offsetting one of the main drawbacks of using browser automation tools. - Supports special actions, high-level actions built with website-specific knowledge, such as searching or filling location data, to save you even more time. ### Taking screenshots Sometimes you want a screenshot of the webpage from which you are extracting data. Screenshots can be handy as a visual representation of the extracted data, but they can also be used, for example, to perform random quality checks where you compare the screenshot with the extracted data. To take webpage screenshots at scale, use Zyte API. ### Running your code When running your web scraping code, you usually want a system where you can easily start, schedule, monitor, and inspect your web scraping jobs, where they can run uninterrupted for as long as needed, and where you can run as many parallel jobs as you wish. Scrapy Cloud is our solution for running web scraping code in the cloud. ### Avoiding breaking website changes Websites change, and when they do they can break your parsing code. Monitoring your web scraping solution for breaking website changes, and addressing those changes, can be very time-consuming, and it scales up as you target more websites. One way to avoid this issue altogether is *not* to write parsing code to begin with. Instead, you can let Zyte API handle parsing for you. Alternatively, you can make addressing those changes less time-consuming with ai-code. ## Web scraping tutorials > ##### Web scraping tutorial > > Build a production-ready web-scraping project from scratch. > ##### Tutorial for Zyte Web Data for Claude Code > > Use **Zyte Web Data for Claude Code** to generate better web scraping > code faster. > ##### Tutorial for Web Scraping Copilot > > Use **Web Scraping Copilot** to generate better web scraping code > faster with **Visual Studio Code**. ## Web scraping tutorial In this tutorial you will build a [production-ready web-scraping project](https://github.com/zytedata/web-scraping-tutorial-project) from scratch: > ##### 1. Start a Scrapy project > > Install Python and Scrapy, create a Scrapy project, and write your > first spider. > ##### 2. Deploy and run on Scrapy Cloud > > Deploy your project to Scrapy Cloud, run a job, and download the > results. > ##### 3. Enable Zyte API to avoid bans > > Install scrapy-zyte-api, and configure your project to use it in > transparent mode. > ##### 4. Handle JavaScript content > > Reproduce JavaScript code with HTTP requests, or execute it with > browser automation. > ##### 5. Automate parsing > > Use automatic extraction to get structured data without writing parsing > code. If you want to learn more, check out our guides! ## Start a Scrapy project To build your web scraping project, you will use [Scrapy](https://scrapy.org/), a popular open source web scraping framework written in [Python](https://www.python.org/) and maintained by Zyte. ### Set up your project #### Claude Code > ###### NOTE > > Uses claude. 1. Install Zyte Web Data for Claude Code. 2. Create a `web-scraping-tutorial` folder and start a **Claude Code** session from it: ```shell mkdir web-scraping-tutorial cd web-scraping-tutorial claude ``` 3. Prompt **Claude Code** to: > Create a Scrapy project named `web-scraping-tutorial` in the > current folder. #### Copilot > ###### NOTE > > Uses copilot. 1. Install Web Scraping Copilot. 2. On the Web Scraping Copilot sidebar view, select **Start building › Create new project**. 3. On the **Create new Scrapy project** page, set the **Scrapy project name** to `web-scraping-tutorial`, select a projects folder, and click **Create**. Your new `web-scraping-tutorial` workspace will be created and set up. #### CLI 1. [Install Python](https://wiki.python.org/moin/BeginnersGuide/Download), version `3.10` or higher. 2. Create a `web-scraping-tutorial` folder and make it your working folder: ```bash mkdir web-scraping-tutorial cd web-scraping-tutorial ``` 3. Create and activate a [Python virtual environment](https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments). #### Windows ```batch python3 -m venv venv venv\Scripts\activate.bat ``` #### macOS, Linux ```bash python3 -m venv venv . venv/bin/activate ``` 4. Install the latest version of Scrapy: ```bash pip install scrapy==2.14.2 ``` 5. Make `web-scraping-tutorial` a [Scrapy](https://scrapy.org/) project folder: ```bash scrapy startproject web_scraping_tutorial . ``` Your `web-scraping-tutorial` folder should now contain at least the following folders and files: ```text web-scraping-tutorial/ ├── .venv/ │ └── … ├── web_scraping_tutorial/ │ ├── spiders/ │ │ └── __init__.py │ ├── __init__.py │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ └── settings.py └── scrapy.cfg ``` ### Create your first spider Now that you are all set up, you will write code to extract data from all books in the Mystery category of [books.toscrape.com](http://books.toscrape.com/). Create a file at `web_scraping_tutorial/spiders/books_toscrape_com.py` with the following code: ``` from scrapy import Spider class BooksToScrapeComSpider(Spider): name = "books_toscrape_com" custom_settings = { "CONCURRENT_REQUESTS_PER_DOMAIN": 8, "DOWNLOAD_DELAY": 0.01, } start_urls = [ "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html" ] def parse(self, response): next_page_links = response.css(".next a") yield from response.follow_all(next_page_links) book_links = response.css("article a") yield from response.follow_all(book_links, callback=self.parse_book) def parse_book(self, response): yield { "name": response.css("h1::text").get(), "price": response.css(".price_color::text").re_first("£(.*)"), "url": response.url, } ``` In the code above: - You define a [Scrapy spider class](https://docs.scrapy.org/en/latest/topics/spiders.html) named `books_toscrape_com`. - You set custom values for `CONCURRENT_REQUESTS_PER_DOMAIN` and `DOWNLOAD_DELAY` to speed crawls during the tutorial. [https://toscrape.com](https://toscrape.com) is a test site, so it is safe to do so. - Your spider starts by sending a request for the Mystery category URL, [http://books.toscrape.com/catalogue/category/books/mystery_3/index.html](http://books.toscrape.com/catalogue/category/books/mystery_3/index.html) (`start_urls`), and parses the response with the default callback method: `parse`. - The `parse` callback method: - Finds the link to the next page and, if found, yields a request for it, whose response will also be parsed by the `parse` callback method. As a result, the `parse` callback method eventually parses all pages of the Mystery category. - Finds links to book detail pages, and yields requests for them, whose responses will be parsed by the `parse_book` callback method. As a result, the `parse_book` callback method eventually parses all book detail pages from the Mystery category. - The `parse_book` callback method extracts a record of book information with the book name, price, and URL. > ###### TIP > > What if, instead of writing parsing code manually, you could use AI to > generate it? See the tutorials of ai-code. Now run your spider: #### Claude Code > ###### NOTE > > Uses claude. In a separate terminal, *not* in your **Claude Code** session, run: ```bash scrapy crawl books_toscrape_com -O books.csv ``` #### Copilot > ###### NOTE > > Uses copilot. 1. Select **Web Scraping Copilot** on the sidebar. 2. Expand the **Spiders** view. Click the **Refresh** button if your spider is not listed. 3. Click the **Run Spider Locally** button of your spider. 4. Paste the following in the **Arguments** field: ```none -O books.csv ``` 5. Click **Run Spider**. #### CLI ```bash scrapy crawl books_toscrape_com -O books.csv ``` Once execution finishes, the generated `books.csv` file will contain records for all books from the Mystery category of [books.toscrape.com](http://books.toscrape.com/) in [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) format. You can open `books.csv` with any spreadsheet app. Continue to the next chapter to learn how you can easily deploy and run you web scraping project on the cloud. ## Deploy and run on Scrapy Cloud You now have a working Scrapy project that you have been running locally. Running your code locally is fine during development, but for production you usually want something better. You will now deploy and run your code on Scrapy Cloud, which you can do for free. ### Deploy to Scrapy Cloud #### Claude Code > ###### NOTE > > Uses claude. 1. Create a Scrapy Cloud project on the [Zyte dashboard](https://app.zyte.com/). 2. Once created, copy your Scrapy Cloud project ID from the browser URL bar. For example, if the URL is `https://app.zyte.com/p/000000/deploy?state=deploy`, `000000` is your Scrapy Cloud project ID. 3. Prompt **Claude Code** to: > Deploy to Scrapy Cloud project `000000` Replacing `000000` with your actual project ID. #### Copilot > ###### NOTE > > Uses copilot. 1. Create a Scrapy Cloud project on the [Zyte dashboard](https://app.zyte.com/). 2. Back to **Visual Studio Code**, select **Web Scraping Copilot** on the sidebar. 3. On the **Spiders** view title, click the **Deploy to Scrapy Cloud** button. ![image](web-scraping/tutorials/main/images/cloud/deploy-button.png) 4. Complete the interactive Scrapy Cloud setup steps. ![image](web-scraping/tutorials/main/images/cloud/interactive-steps.png) 5. Add the following to `web-scraping-tutorial/scrapinghub.yml`: ```yaml stacks: default: scrapy:2.14-20260217 ``` 6. Click the **Deploy to Scrapy Cloud** button again, and confirm. Once your Scrapy project has been deployed to your Scrapy Cloud project, you will see a `Run your spiders at: ` line in the output. #### CLI 1. Create a Scrapy Cloud project on the [Zyte dashboard](https://app.zyte.com/). 2. Install the latest version of `shub`, the Scrapy Cloud command-line application: ```bash pip install --upgrade shub ``` 3. Create a YAML file at `web-scraping-tutorial/scrapinghub.yml` with the following content: ```yaml stacks: default: scrapy:2.14-20260217 ``` 4. Copy your [Scrapy Cloud API key](https://app.zyte.com/o/settings/apikey) (*not* a Zyte API key) from the Zyte dashboard. 5. Run the following command and, when prompted, paste your API key and press `Enter`: ```bash shub login ``` 6. On the [Zyte dashboard](https://app.zyte.com/), select your Scrapy Cloud project under **Scrapy Cloud Projects**, and copy your Scrapy Cloud project ID from the browser URL bar. For example, if the URL is `https://app.zyte.com/p/000000/jobs`, `000000` is your Scrapy Cloud project ID. 7. Make sure `web-scraping-tutorial` is your current working directory. 8. Run the following command, replacing `000000` with your actual project ID: ```bash shub deploy 000000 ``` Your Scrapy project has now been deployed to your Scrapy Cloud project. ### Run a Scrapy Cloud job Now that you have deployed your Scrapy project to your Scrapy Cloud project, it is time to run one of your spiders on Scrapy Cloud: 1. On the [Zyte dashboard](https://app.zyte.com/), select your Scrapy Cloud project under **Scrapy Cloud Projects**. ![](web-scraping/tutorials/main/images/cloud/select-project.png) 2. On the **Dashboard** page of your project, select **Run** on the top-right corner. ![](web-scraping/tutorials/main/images/cloud/run.png) 3. On the **Run** dialog box: 1. Select the **Spiders** field and, from the spider list that appears, select your spider name. 2. Select **Run**. ![image](web-scraping/tutorials/main/images/cloud/run-run.png) A new Scrapy Cloud job will appear in the **Running** job list: ![](web-scraping/tutorials/main/images/cloud/running.png) Once the job finishes, it will move to the **Completed** job list: ![](web-scraping/tutorials/main/images/cloud/completed.png) 4. Follow the link from the **Job** column, **1/1**. ![](web-scraping/tutorials/main/images/cloud/job-link.png) 5. On the job page, select the **Items** tab. ![](web-scraping/tutorials/main/images/cloud/items.png) 6. On the **Items** page, select **Export › CSV**. ![](web-scraping/tutorials/main/images/cloud/export-csv.png) The downloaded file will have the same data as the `books.csv` file that you generated locally with your first spider. Continue to the next chapter to learn how to avoid website bans. ## Enable Zyte API to avoid bans Now that you have run your project in Scrapy Cloud, it is time to improve the project itself, starting with handling website bans. Your target domain in this tutorial, [toscrape.com](http://toscrape.com/), does not ban traffic. However, when targeting other websites, sooner or later you will get bans. You will now configure your web scraping code to use Zyte API to avoid bans on any website: 1. [Sign up for Zyte API](https://app.zyte.com/account/signup/zyteapi). You get $5 free for a month, and you should only need a fraction of that to complete this tutorial. 2. Set up your project to use Zyte API with your key: #### Claude Code > ###### NOTE > > Uses claude. Add `ZYTE_API_KEY = "YOUR_API_KEY"` to `web-scraping-tutorial/settings.py`, and replace `YOUR_API_KEY` with [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access). #### Copilot > ###### NOTE > > Uses copilot. Remove `#` from the `#ZYTE_API_KEY = "YOUR_API_KEY"` line in `web-scraping-tutorial/settings.py`, and replace `YOUR_API_KEY` with [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access). #### CLI 1. Install the latest version of scrapy-zyte-api: ```bash pip install --upgrade scrapy-zyte-api ``` 2. Configure scrapy-zyte-api in transparent mode by adding the following code at the end of `web-scraping-tutorial/settings.py`, replacing `YOUR_ZYTE_API_KEY` with [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access): ```python ZYTE_API_KEY = "YOUR_ZYTE_API_KEY" ``` Then, edit the Addons section at the start of the `web-scraping-tutorial/settings.py` file: ```python ADDONS = { "scrapy_zyte_api.Addon": 500, } ``` 3. Get `scrapy-zyte-api` installed when running in Scrapy Cloud: #### Claude Code > ###### NOTE > > Uses claude. Skip this step. **Zyte Web Data for Claude Code** handled this automatically when you deployed your project to Scrapy Cloud. #### Copilot > ###### NOTE > > Uses copilot. 1. Create `web-scraping-tutorial/requirements.txt` with the following content: ```none scrapy-zyte-api ``` 2. Add the following to `web-scraping-tutorial/scrapinghub.yml`: ```yaml requirements: file: requirements.txt ``` #### CLI 1. Create `web-scraping-tutorial/requirements.txt` with the following content: ```none scrapy-zyte-api ``` 2. Add the following to `web-scraping-tutorial/scrapinghub.yml`: ```yaml requirements: file: requirements.txt ``` If you run your code again, your code will work the same, only that requests will be sent through Zyte API, to avoid bans cost-efficiently. Continue to the next chapter to learn about browser automation. > ###### TIP > - We closely monitor the success rate for the most popular websites, but > less popular websites may slip under our radar. If you ever find a > website for which Zyte API does not work as expected (e.g. gives you a > ban response or too many errors), you can [reach > out to our expert anti-ban team](https://support.zyte.com/support/tickets/new). > - If you get an SSL error, install the Zyte CA certificate on > your system and try again. ## Handle JavaScript content Now that you know how to handle bans, you will learn how to handle websites that load content dynamically using JavaScript. You will first reproduce what the JavaScript code does with regular HTTP requests, then you will use browser automation to achieve the same, and finally you will interact with a page. ### Reproduce JavaScript requests Your next target will be [http://quotes.toscrape.com/scroll](http://quotes.toscrape.com/scroll), from which you will extract 100 quotes. However, the HTML code of that page contains no quote at all. All 100 quotes are loaded dynamically, the webpage uses JavaScript code to send requests to its own API. To get all 100 quotes, you will need to reproduce those requests. Create a file at `web_scraping_tutorial/spiders/quotes_toscrape_com_scroll_api.py` with the following code: ``` import json from scrapy import Spider class QuotesToScrapeComScrollAPISpider(Spider): name = "quotes_toscrape_com_scroll_api" custom_settings = { "CONCURRENT_REQUESTS_PER_DOMAIN": 8, "DOWNLOAD_DELAY": 0.01, } start_urls = [ f"http://quotes.toscrape.com/api/quotes?page={n}" for n in range(1, 11) ] def parse(self, response): data = json.loads(response.text) for quote in data["quotes"]: yield { "author": quote["author"]["name"], "tags": quote["tags"], "text": quote["text"], } ``` The code above sends 10 requests to the API of [quotes.toscrape.com](http://quotes.toscrape.com/), reproducing what JavaScript code at [http://quotes.toscrape.com/scroll](http://quotes.toscrape.com/scroll) does, and then parses the JSON response to extract the desired data. Now run your new `quotes_toscrape_com_scroll_api` spider with `-O quotes.csv`. After all 10 requests are processed, all 100 quotes can be found at `quotes.csv`. When the information that you want to extract is not readily available in the response HTML, but loaded from JavaScript, reproducing the JavaScript code manually, like you did above sending those 10 requests, is one option. Next you will try a few alternative approaches. ### Use browser automation You will now ask Zyte API to use browser automation to render the page contents and return browser HTML, instead of raw HTML, and you will get Zyte API to render all 100 quotes with a single Zyte API request. Create a file at `web_scraping_tutorial/spiders/quotes_toscrape_com_scroll_browser.py` with the following code: ``` from scrapy import Request, Spider class QuotesToScrapeComScrollBrowserSpider(Spider): name = "quotes_toscrape_com_scroll_browser" async def start(self): yield Request( "http://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, }, ) def parse(self, response): for quote in response.css(".quote"): yield { "author": quote.css(".author::text").get(), "tags": quote.css(".tag::text").getall(), "text": quote.css(".text::text").get()[1:-1], } ``` The code above sends a single request to [http://quotes.toscrape.com/scroll](http://quotes.toscrape.com/scroll), but this request includes some metadata. That is why the `start` method is used instead of `start_urls`, since the latter does not allow defining request metadata. The specified metadata indicates to Zyte API that you want the URL to be loaded in a web browser, that you want to execute the `scrollBottom` action, and that you want the HTML rendering of the webpage [DOM](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction) after that. The `scrollBottom` action keeps scrolling to the bottom of a webpage until that webpage stops loading additional content, so that you get all 100 quotes, and not only the first 10. Now run your new `quotes_toscrape_com_scroll_browser` spider with `-O quotes.csv`. `quotes.csv` will have the same data as before, only that now it has been generated through browser rendering. ### Use network capture What if you could have the best from both worlds, i.e. use browser rendering to avoid reverse engineering, and get the API responses and not only what is loaded into the DOM? You will now ask Zyte API to use network capture to render the page contents and *capture* the API responses. Create a file at `web_scraping_tutorial/spiders/quotes_toscrape_com_scroll_capture.py` with the following code: ``` import json from base64 import b64decode from scrapy import Request, Spider class QuotesToScrapeComScrollCaptureSpider(Spider): name = "quotes_toscrape_com_scroll_capture" async def start(self): yield Request( "http://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, }, ) def parse(self, response): for capture in response.raw_api_response["networkCapture"]: text = b64decode(capture["httpResponseBody"]).decode() data = json.loads(text) for quote in data["quotes"]: yield { "author": quote["author"]["name"], "tags": quote["tags"], "text": quote["text"], } ``` The specified metadata indicates that we want to capture the body of any network response that contains `/api/` in its URL. Now run your new `quotes_toscrape_com_scroll_capture` spider with `-O quotes.csv`. `quotes.csv` will have the same data as before, only that now it has been generated through network capture. Which option is best, reproducing JavaScript code manually, using browser-rendered HTML or using network captures, depends on each scenario. To choose the right option, you need to factor in website specificity, development time, run time, request count, request cost, etc. ### Use an action sequence Sometimes, it can be really hard to reproduce JavaScript code manually, or the resulting code can break too easily, making the browser automation option a clear winner. You will now extract a quote from [http://quotes.toscrape.com/search.aspx](http://quotes.toscrape.com/search.aspx) by interacting with the search form through browser actions. Create a file at `web_scraping_tutorial/spiders/quotes_toscrape_com_search.py` with the following code: ``` from scrapy import Request, Spider class QuotesToScrapeComSearchSpider(Spider): name = "quotes_toscrape_com_search" async def start(self): yield Request( "http://quotes.toscrape.com/search.aspx", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "select", "selector": {"type": "css", "value": "#author"}, "values": ["Albert Einstein"], }, { "action": "waitForSelector", "selector": { "type": "css", "value": "[value=\"world\"]", "state": "attached", }, }, { "action": "select", "selector": {"type": "css", "value": "#tag"}, "values": ["world"], }, { "action": "click", "selector": {"type": "css", "value": "[type='submit']"}, }, { "action": "waitForSelector", "selector": {"type": "css", "value": ".quote"}, }, ], }, }, ) def parse(self, response): for quote in response.css(".quote"): yield { "author": quote.css(".author::text").get(), "tags": quote.css(".tag::text").getall(), "text": quote.css(".content::text").get()[1:-1], } ``` The code above sends a request that makes Zyte API load [http://quotes.toscrape.com/search.aspx](http://quotes.toscrape.com/search.aspx) and perform the following actions: 1. Select Albert Einstein as author. 2. Wait for the “world” tag to load. 3. Select the “world” tag. 4. Click the **Search** button. 5. Wait for a quote to load. From the HTML rendering of the DOM after those actions are executed, your code extracts all displayed quotes. Now run your new `quotes_toscrape_com_search` spider with `-O quotes.csv`. `quotes.csv` will have 1 quote from Albert Einstein about the world. If you were to try and write alternative code that, instead of relying on the browser HTML feature from Zyte API, reproduces the underlying JavaScript code with regular requests, it may take you a while to build a working solution, and your solution may be more fragile, i.e. more likely to break with server code changes. Continue to the next chapter to learn how you can avoid the need to write and maintain parsing code in the first place. ## Automate parsing Now that you are familiar with browser automation, it is time to learn about *parsing automation*. > ###### TIP > > This page covers AI-powered parsing on every request. See also the > tutorials of ai-code for an alternative approach where AI is used to > generate parsing code instead. See also zapi-extract-vs-ai-code. Your first spider parsed 3 fields from book webpages of [books.toscrape.com](http://books.toscrape.com/): `name`, `price`, `url`. When targeting other websites, there are 2 challenges you are going to face: - You will probably want more fields. For example, our [automatic extraction product schema](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/product) has more than 25 fields. You need to write parsing logic for every combination of target field *and* target website. ai-code can speed up this work significantly, but they cannot make it go away. - Websites change, and when they do they can break your parsing code. You need to monitor your web scraping project for breaking website changes, and update your parsing code accordingly when they occur. These issues are time-consuming and scale up with additional fields and websites. To avoid them altogether, you can let Zyte API handle parsing for you. Create a file at `web_scraping_tutorial/spiders/books_toscrape_com_extract.py` with the following code: ``` from scrapy import Spider class BooksToScrapeComExtractSpider(Spider): name = "books_toscrape_com_extract" custom_settings = { "CONCURRENT_REQUESTS_PER_DOMAIN": 8, "DOWNLOAD_DELAY": 0.01, } start_urls = [ "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html" ] def parse(self, response): next_page_links = response.css(".next a") yield from response.follow_all(next_page_links) book_links = response.css("article a") for request in response.follow_all(book_links, callback=self.parse_book): request.meta["zyte_api_automap"] = {"product": True} yield request def parse_book(self, response): yield response.raw_api_response["product"] ``` The code above is a modification of your first spider that uses automatic extraction, where: - In requests for book URLs, at the end of the `parse` callback method, you include request metadata to have Zyte API give you structured data for an e-commerce product. - The `parse_book` callback method yields the product data from the Zyte API response. Now run your new `books_toscrape_com_extract` spider with `-O books.csv`. Your code will now extract many more fields from each book, all without you having to write a single line of parsing code. > ###### NOTE > > zapi-extract requires you to specify the kind of data you > want to extract. > > Your spider above uses [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product) to request the data of a > single e-commerce product, but automatic extraction supports many > other types of data extraction. > > For example, if you need to extract a news article or a blog post, use the > [article](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/article) data extraction type instead. This concludes our web scraping tutorial. The tutorial code is available [on GitHub](https://github.com/zytedata/web-scraping-tutorial-project). To learn more, check out our web scraping guides, our documentation for Zyte API and Scrapy Cloud, and the Scrapy documentation. You can also visit our [Support Center](https://support.zyte.com/support/home) or reach out to the wider [web scraping](https://discord.gg/GjB8dHCCJS) and [Scrapy](https://scrapy.org/community/) communities. ## Tutorial for Zyte Web Data for Claude Code With [Claude Code](https://code.claude.com/docs/en/overview) and Zyte Web Data for Claude Code, you can write better web scraping code faster: 1. Install Zyte Web Data for Claude Code. 2. Create a new folder and start a **Claude Code** session: ```shell mkdir claude-tutorial cd claude-tutorial claude ``` 3. Prompt **Claude Code** to: > Scrape books.toscrape.com **Claude Code** will take care of the rest, interacting with you only when needed. For example: - It will ask you how to get a detail page to analyze. You can, for example, choose to **Explore the site** to let Claude Code find a detail page on its own. - It will analyze detail pages and propose fields to extract with example data, and you can adjust the extracted data schema to fit your preferences. It will also later on give you the option to open a **browser review**, a local web app where you can review data extracted with the earlier schema and provide feedback to the model about it. ## Tutorial for Web Scraping Copilot With [GitHub Copilot](https://github.com/features/copilot) and Web Scraping Copilot you can write maintainable web scraping code using AI: > ###### WARNING > > The **GitHub Copilot Free** plan is *not* recommended for > AI-assisted web scraping. See codegen-requirements for details. > ##### 1. Set up a project > > Prepare Visual Studio Code and a Scrapy project with all prerequisites. > ##### 2. Generate parsing code > > Use AI to generate parsing code for a webpage. > ##### 3. Generate crawling code > > Use AI to generate crawling code for a website. ## Set up an AI web scraping project Set up a project for AI-assisted web scraping: 1. Install Web Scraping Copilot. 2. On the Web Scraping Copilot sidebar view, select **Start building › Create new project**. 3. On the **Create new Scrapy project** page, set the **Scrapy project name** to `copilot-tutorial`, select a projects folder, and click **Create**. Your new `copilot-tutorial` workspace will be created and set up with the following folders and files: ``` copilot-tutorial/ ├── .venv/ │ └── … ├── copilot_tutorial/ │ ├── pages/ │ │ └── __init__.py │ ├── spiders/ │ │ └── __init__.py │ ├── __init__.py │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ └── settings.py └── scrapy.cfg ``` You have now everything you need to start generating parsing code with AI. ## Generate parsing code with AI Now that your project is ready, you will use AI to generate code to parse book webpages from [https://books.toscrape.com](https://books.toscrape.com). ### 1. Generate an item class First, you need to define the type of data that you want to parse from each book page. > ###### TIP > > Select a somewhat smart model in the chat view, i.e. **GPT-5** or > similar. **GPT-5 mini** is OK if you prefer a non-premium model. GPT-4.1 is > problematic for web scraping. Ask the AI to: > Define a dataclass item called Book with title, price and url fields. Make > them optional and of type str | None. The AI should edit `copilot-tutorial/copilot_tutorial/items.py` to add: `copilot-tutorial/copilot_tutorial/items.py` ```python from dataclasses import dataclass @dataclass class Book: url: str | None = None title: str | None = None price: str | None = None ``` > ###### TIP > > You can use any item type supported by Scrapy, > `dataclass` is one of many options. > > You can also use a pre-made item type from > zyte-common-items, like > `zyte_common_items.Product`, instead of writing your own item type > from scratch. ### 2. Generate parsing code Select **Web Scraping Copilot › Page Objects › Generate Parsing Code with AI**: ![image](_static/copilot/generate-0.1.0.png) The chat view will open with the **WebScraping** agent, a prompt will be sent, and the AI will start assisting. It should: 1. Ask you for some **input**. It usually detects the right item type to use and the right path to save your page objects (more on them later), but it always needs you to specify **example target URLs**. You are generating a page object for book detail pages, so choose a few such URLs and share them in chat. For example: [https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) [https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html](https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html) [https://books.toscrape.com/catalogue/soumission_998/index.html](https://books.toscrape.com/catalogue/soumission_998/index.html) 2. Create `copilot-tutorial/copilot_tutorial/pages/books_toscrape_com.py` with something like: ```python from copilot_tutorial.items import Book from web_poet import Returns, WebPage, field, handle_urls @handle_urls("books.toscrape.com") class BooksToscrapeComBookPage(WebPage, Returns[Book]): pass ``` > ###### NOTE > > This is a page object class. It > defines how to extract a given type of data > (e.g. `Book`) from a given URL pattern (e.g. > the `books.toscrape.com` domain). 3. Generate tests for the target example URLs. > ###### NOTE > > web-poet tests are example inputs and > expected outputs for a page object class. They can also assert that a > given input should raise an expected exception. You can use them to > test your code, and the AI can use them to generate the right parsing > code. 4. Populate test expectations. 5. Generate parsing code for your new page object class. 6. Run the generated tests to check that the generated parsing code extracts the expected data. By the end, you should have a working page object class that can extract book data from any book URL from [https://books.toscrape.com](https://books.toscrape.com). ### 3. Create a spider Now that you have a working page object, it is time to implement a Scrapy spider that uses it. Create the following file: `copilot-tutorial/copilot_tutorial/spiders/books.py` ```python from scrapy import Request, Spider from copilot_tutorial.items import Book class BookSpider(Spider): name = "book" url: str async def start(self): yield Request(self.url, callback=self.parse_book) async def parse_book(self, _, book: Book): yield book ``` The spider expects a `url` argument, which you can pass to a spider with the `-a url=` syntax. When a request targets the `parse_book` callback, scrapy-poet sees the `Book` type hint and injects a `book` parameter built with your page object class. Your spider can now extract book data from any book details page from [https://books.toscrape.com](https://books.toscrape.com). For example, try running your `book` spider with the following arguments: ```bash -a url=https://books.toscrape.com/catalogue/soumission_998/index.html -o books.jsonl ``` It will generate a `books.jsonl` file with the following JSON object: ```json { "url": "https://books.toscrape.com/catalogue/soumission_998/index.html", "title": "Soumission", "price": "50.10" } ``` You can also repeat step 2 for other book stores, and this spider will also work for them, no need to have separate spiders per website. Continue to the next chapter to use AI to generate *crawling* code, to be able to write a spider that can crawl an entire book store, and not just a single book URL. ## Generate crawling code with AI Now that you have generated parsing code with AI, you will use AI to generate book URL discovery code by parsing navigation webpages (homepage, categories) from [https://books.toscrape.com](https://books.toscrape.com). ### 1. Generate navigation code This time around, you can try generating an item class and a page object with a single prompt: > Now I want you to create a new item type, BookNavigation, for navigation > data from the homepage and categories with the following optional fields: > url, book_urls, next_page_url. > Then create a page object for that item type and books.toscrape.com. The workflow will be the same as before. The generated item should look something like: `copilot-tutorial/copilot_tutorial/items.py` ```python @dataclass class BookNavigation: url: str | None = None book_urls: list[str] | None = None next_page_url: str | None = None ``` As input URLs, use some variety. For example: [https://books.toscrape.com/index.html](https://books.toscrape.com/index.html) [https://books.toscrape.com/catalogue/category/books/paranormal_24/index.html](https://books.toscrape.com/catalogue/category/books/paranormal_24/index.html) [https://books.toscrape.com/catalogue/category/books/mystery_3/index.html](https://books.toscrape.com/catalogue/category/books/mystery_3/index.html) [https://books.toscrape.com/catalogue/category/books/fiction_10/page-2.html](https://books.toscrape.com/catalogue/category/books/fiction_10/page-2.html) [https://books.toscrape.com/catalogue/category/books/historical-fiction_4/page-2.html](https://books.toscrape.com/catalogue/category/books/historical-fiction_4/page-2.html) ### 2. Create a crawling spider Finally, add a new spider to `copilot-tutorial/copilot_tutorial/spiders/books.py` that uses the new `BookNavigation` item to implement crawling: ```python from copilot_tutorial.items import BookNavigation class BookNavigationSpider(Spider): name = "books" url: str async def start(self): yield Request(self.url, callback=self.parse_navigation) async def parse_navigation(self, response, navigation: BookNavigation): if navigation.next_page_url: yield response.follow(navigation.next_page_url, callback=self.parse_navigation) for url in navigation.book_urls or []: yield response.follow(url, callback=self.parse_book) async def parse_book(self, _, book: Book): yield book ``` Your new spider expects a navigation page as its `url` argument, and can follow pagination and extract all relevant books. Before you run it, however, you best add the following at the end of `copilot-tutorial/copilot_tutorial/settings.py`: `copilot-tutorial/copilot_tutorial/settings.py` ```python DOWNLOAD_SLOTS = { "books.toscrape.com": { "delay": 0.01, "concurrency": 16, }, } ``` By default, Scrapy rate-limits requests. For [https://books.toscrape.com](https://books.toscrape.com), however, it is safe to use a higher concurrency and a lower delay, and it will make running the spider much quicker. Now run the `books` spider again with the following **Arguments**: ```bash -a url=https://books.toscrape.com/catalogue/category/books/mystery_3/index.html -o books.jsonl ``` It will add the 32 books from the [Mystery category](https://books.toscrape.com/catalogue/category/books/mystery_3/index.html) to `books.jsonl`. ### Next steps Congratulations! You have successfully used AI to generate maintainable web scraping code. This concludes our **Web Scraping Copilot** tutorial. If you are wondering what to do next, consider enabling Zyte API to avoid bans. See tutorial-zapi in tutorial. ## Web scraping guides > ##### Exporting scraped data > > Learn to download or export your scraped data however and wherever you > like. ## Exporting scraped data How would you like to get your scraped data? > ##### Scrapy Cloud > > Download from Scrapy Cloud. > ##### File storage > > Use Scrapy to export to a file storage service, like Amazon S3 or > Google Cloud Storage. > ##### Item storage > > Use Scrapy to export to a database, message queue, indexer, or similar > service. ### File storage Choose to which file storage service you wish to export using [Scrapy](https://scrapy.org): > ##### Amazon S3 > ##### Azure Storage > ##### Dropbox > ##### FTP servers > ##### Google Cloud Storage > ##### Google Drive > ##### Google Sheets > ##### SFTP servers File storage exporting with Scrapy also provides [many options](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options), including: [batching](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORT_BATCH_ITEM_COUNT), [field customization](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORT_FIELDS), [item filtering](https://docs.scrapy.org/en/latest/topics/feed-exports.html#item-filter), [compression](https://docs.scrapy.org/en/latest/topics/feed-exports.html#post-processing). You can also create your own [Scrapy storage backend](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-backends). Check the [code of existing storage backends](https://github.com/scrapy/scrapy/blob/721df895f9ea9d8073c13fbd2f75a6fbdc75ffc7/scrapy/extensions/feedexport.py#L258-L281) to learn more. ### Item storage Choose to which item storage service you wish to export using [Scrapy](https://scrapy.org): > ##### Google BigQuery You can also [create a custom Scrapy item pipeline](https://docs.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline) to implement item-based storage, for example using an existing Python asyncio client library for a [database](https://github.com/timofurrer/awesome-asyncio#database-drivers) or a [message queue](https://github.com/timofurrer/awesome-asyncio#message-queues) service. ## Exporting to Amazon S3 with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Amazon S3](https://aws.amazon.com/pm/serv-s3/): 1. Install [boto3](https://github.com/boto/boto3): ```bash pip install boto3 ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none boto3 ``` 2. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 3. Add the following key-value pair to `FEEDS`: ```python { "s3:///": { "format": "" } } ``` Where: - `` is your [bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html) name, e.g. `mybucket`. - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. > ###### WARNING > > Any pre-existing file in the specified path will be > overwritten. [Amazon S3 does not support appending to a file](https://stackoverflow.com/a/41783997). - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). 4. Define the [AWS_ACCESS_KEY_ID](https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-AWS_ACCESS_KEY_ID) and [AWS_SECRET_ACCESS_KEY](https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-AWS_SECRET_ACCESS_KEY) Scrapy settings with your [access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html): settings.py ```python AWS_ACCESS_KEY_ID = "AKIAIOSFODNN7EXAMPLE" AWS_SECRET_ACCESS_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" ``` You can alternatively define the [AWS_SESSION_TOKEN](https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-AWS_SESSION_TOKEN) setting to configure access with [temporary security credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#temporary-access-keys). [Additional settings](https://docs.scrapy.org/en/latest/topics/feed-exports.html#s3) exist to define a target region, a custom access-control list, or a custom endpoint. Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Amazon S3 location. ## Exporting to Azure Storage with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Azure Storage](https://learn.microsoft.com/en-us/azure/storage/): 1. You need Python 3.8 or higher. If you are using Scrapy Cloud, make sure you are using [stack](https://support.zyte.com/support/solutions/articles/22000200402-changing-the-deploy-environment-with-scrapy-cloud-stacks) `scrapy:1.7-py38` or higher. Using the latest stack (`scrapy:2.14-20260217`) is generally recommended. 2. Install [scrapy-feedexporter-azure-storage](https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage): ```bash pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none scrapy-feedexporter-azure-storage @ git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage ``` 3. In your `settings.py` file, define `FEED_STORAGES` as follows: settings.py ```python FEED_STORAGES = { "azure": "scrapy_azure_exporter.AzureFeedStorage", } ``` If the setting already exists in your `settings.py` file, modify the existing setting to add the key-value pair above, instead of re-defining the setting. 4. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 5. Add the following key-value pair to `FEEDS`: ```python { "azure://.blob.core.windows.net//": { "format": "" } } ``` Where: - `` is the name of your [storage account](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction#storage-accounts), e.g. `myaccount`. - `` is the name of your [container](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction#containers), e.g. `mycontainer`. - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). 6. Define the `AZURE_ACCOUNT_URL` and `AZURE_ACCOUNT_KEY` settings with your [credentials](https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python#types-of-credentials): settings.py ```python AZURE_ACCOUNT_URL = "https://.blob.core.windows.net" AZURE_ACCOUNT_KEY = "" ``` You can alternatively define the `AZURE_CONNECTION_STRING` setting to a [connection string](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string): settings.py ```python AZURE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=xxxx;AccountKey=xxxx;EndpointSuffix=core.windows.net" ``` Or, if you have an [account URL that includes a SAS token](https://learn.microsoft.com/en-us/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens), use the `AZURE_ACCOUNT_URL_WITH_SAS_TOKEN` setting instead: settings.py ```python AZURE_ACCOUNT_URL_WITH_SAS_TOKEN = "https://my.blob.core.windows.net/source-en/source-english.docx?sv=2019-12-12&st=2021-01-26T18%3A30%3A20Z&se=2021-02-05T18%3A30%3A00Z&sr=c&sp=rl&sig=d7PZKyQsIeE6xb%2B1M4Yb56I%2FEEKoNIF65D%2Fs0IFsYcE%3D" ``` Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Azure Storage location. ## Exporting to Dropbox with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Dropbox](https://www.dropbox.com/): 1. Install [scrapy-feedexporter-dropbox](https://github.com/scrapy-plugins/scrapy-feedexporter-dropbox): ```bash pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-dropbox ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none scrapy-feedexporter-dropbox @ git+https://github.com/scrapy-plugins/scrapy-feedexporter-dropbox ``` 2. In your `settings.py` file, define `FEED_STORAGES` as follows: settings.py ```python FEED_STORAGES = { "dropbox": "scrapy_dropbox.DropboxFeedStorage", } ``` If the setting already exists in your `settings.py` file, modify the existing setting to add the key-value pair above, instead of re-defining the setting. 3. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 4. Add the following key-value pair to `FEEDS`: ```python { "dropbox://": { "format": "" } } ``` Where: - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). 5. Define the `DROPBOX_API_TOKEN` setting with your [access token](https://dropbox.tech/developers/generate-an-access-token-for-your-own-account): settings.py ```python DROPBOX_API_TOKEN = "" ``` Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Dropbox location. ## Exporting to an FTP server with Scrapy > ###### NOTE > > Not to be confused with [SFTP](https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol) (see > sftp) or [FTPS](https://en.wikipedia.org/wiki/FTPS). To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to an [FTP server](https://en.wikipedia.org/wiki/File_Transfer_Protocol): 1. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 2. Add the following key-value pair to `FEEDS`: ```python { "ftp://:@/": { "format": "" } } ``` Where: - `` and `` are your credentials for the FTP server, [percent-encoded](https://en.wikipedia.org/wiki/Percent-encoding). - `` is the FTP server host, e.g. `ftp.example.com` or `203.0.113.123`. - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured FTP server location. ## Exporting to Google Cloud Storage with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Google Cloud Storage](https://cloud.google.com/storage): 1. Install [google-cloud-storage](https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python): ```bash pip install google-cloud-storage ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none google-cloud-storage ``` 2. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 3. Add the following key-value pair to `FEEDS`: ```python { "gs:///": { "format": "" } } ``` Where: - `` is your [bucket](https://cloud.google.com/storage/docs/buckets) name, e.g. `mybucket`. - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. > ###### WARNING > > Any pre-existing file in the specified path will be > overwritten. [Google Cloud Storage does not support appending to a > file](https://cloud.google.com/storage/docs/objects#immutability). - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). 4. [Configure credential provision to ADC](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to). Also define the [GCS_PROJECT_ID](https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-GCS_PROJECT_ID) Scrapy setting with your [project ID](https://cloud.google.com/resource-manager/docs/creating-managing-projects): settings.py ```python GCS_PROJECT_ID = "myproject" ``` [Additional settings](https://docs.scrapy.org/en/latest/topics/feed-exports.html#google-cloud-storage-gcs) exist to define, for example, a custom access-control list. Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Google Cloud Storage location. ## Exporting to Google Drive with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Google Drive](https://www.google.com/drive/): 1. You need Python 3.8 or higher. If you are using Scrapy Cloud, make sure you are using [stack](https://support.zyte.com/support/solutions/articles/22000200402-changing-the-deploy-environment-with-scrapy-cloud-stacks) `scrapy:1.7-py38` or higher. Using the latest stack (`scrapy:2.14-20260217`) is generally recommended. 2. Install [scrapy-feedexporter-google-drive](https://github.com/scrapy-plugins/scrapy-feedexporter-google-drive): ```bash pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-google-drive ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none scrapy-feedexporter-google-drive @ git+https://github.com/scrapy-plugins/scrapy-feedexporter-google-drive ``` 3. In your `settings.py` file, define `FEED_STORAGES` as follows: settings.py ```python FEED_STORAGES = { "gdrive": "scrapy_gdrive_exporter.gdrive_exporter.GoogleDriveFeedStorage", } ``` If the setting already exists in your `settings.py` file, modify the existing setting to add the key-value pair above, instead of re-defining the setting. 4. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 5. Add the following key-value pair to `FEEDS`: ```python { "gdrive://drive.google.com//": { "format": "" } } ``` Where: - `` is the ID of the target root folder, e.g. `1uWBpSBe3CvF8u21qTrzDqjZ6uexample`. > ###### TIP > > When inside a folder, the URL ends with the folder ID, e.g: > `https://drive.google.com/drive/folders/1uWBpSBe3CvF8u21qTrzDqjZ6uexample`. - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. > ###### NOTE > > [scrapy-feedexporter-google-drive](https://github.com/scrapy-plugins/scrapy-feedexporter-google-drive) does not support > overwriting or appending to files, it can only create new files > every time. - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). 6. Define the `GDRIVE_SERVICE_ACCOUNT_CREDENTIALS_JSON` setting as a Python string containing your [service account credentials](https://developers.google.com/identity/protocols/oauth2/service-account) in JSON format: settings.py ```python GDRIVE_SERVICE_ACCOUNT_CREDENTIALS_JSON = '{ "type": "service_account", "project_id": "myproject", "private_key_id": "…", "private_key": "…", "client_email": "…@email.iam.gserviceaccount.com", "client_id": "…", "auth_uri": "…", "token_uri": "…", "auth_provider_x509_cert_url": "…", "client_x509_cert_url": "…" }' ``` Make sure you give your service account write access on the target folder. You can do that by sharing the folder with the email of the service account (`client_email` in the JSON above). Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Google Drive location. ## Exporting to Google Sheets with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Google Sheets](https://www.google.com/sheets/about/): 1. You need Python 3.8 or higher. If you are using Scrapy Cloud, make sure you are using [stack](https://support.zyte.com/support/solutions/articles/22000200402-changing-the-deploy-environment-with-scrapy-cloud-stacks) `scrapy:1.7-py38` or higher. Using the latest stack (`scrapy:2.14-20260217`) is generally recommended. 2. Install [scrapy-feedexporter-google-sheets](https://github.com/scrapy-plugins/scrapy-feedexporter-google-sheets): ```bash pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-google-sheets ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none scrapy_google_sheets_exporter @ git+https://github.com/scrapy-plugins/scrapy-feedexporter-google-sheets ``` 3. In your `settings.py` file, define `FEED_STORAGES` as follows: settings.py ```python FEED_STORAGES = { "gsheets": "scrapy_google_sheets_exporter.gsheets_exporter.GoogleSheetsFeedStorage", } ``` If the setting already exists in your `settings.py` file, modify the existing setting to add the key-value pair above, instead of re-defining the setting. 4. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 5. Add the following key-value pair to `FEEDS`: ```python { "gsheets://docs.google.com/spreadsheets/d//edit#gid=": { "format": "csv" } } ``` Where: - You can find the right values for `` and `` in the URL when you are looking at the target worksheet, e.g: `https://docs.google.com/spreadsheets/d/1fWJgq5yuOdeN3YnkBZiTD0VhB1MLzBNomz0s9YwBREo/edit#gid=1261678709`. > ###### NOTE > > If `/edit#gid=` is omitted, the first > worksheet is used. To append to an existing worksheet, you should also: - Use the `fields` [feed option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all fields to export, in the expected order. - Set `item_export_kwargs.include_headers_line` to `False`, to not write the header row. For example: ```python { "gsheets://docs.google.com/spreadsheets/d//edit#gid=": { "format": "csv", "fields": ["field1", "field2"], "item_export_kwargs": {"include_headers_line": False} } } ``` 6. Define the `GOOGLE_CREDENTIALS` setting as a Python dictionary containing your [service account credentials](https://developers.google.com/identity/protocols/oauth2/service-account) in JSON format: settings.py ```python GOOGLE_CREDENTIALS = { "type": "service_account", "project_id": "myproject", "private_key_id": "…", "private_key": "…", "client_email": "…@email.iam.gserviceaccount.com", "client_id": "…", "auth_uri": "…", "token_uri": "…", "auth_provider_x509_cert_url": "…", "client_x509_cert_url": "…" } ``` Make sure you give your service account write access on the target spreadsheet. You can do that by sharing the spreadsheet with the email of the service account (`client_email` in the JSON above). Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Google Sheets worksheet. ## Exporting to an SFTP server with Scrapy > ###### NOTE > > Not to be confused with [FTP](https://en.wikipedia.org/wiki/File_Transfer_Protocol) (see ftp) > or [FTPS](https://en.wikipedia.org/wiki/FTPS). To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to an [SFTP server](https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol): 1. Install [scrapy-feedexporter-sftp](https://github.com/scrapy-plugins/scrapy-feedexporter-sftp): ```bash pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-sftp ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none scrapy-feedexporter-sftp @ git+https://github.com/scrapy-plugins/scrapy-feedexporter-sftp ``` 2. In your `settings.py` file, define `FEED_STORAGES` as follows: settings.py ```python FEED_STORAGES = { "sftp": "scrapy_feedexporter_sftp.SFTPFeedStorage", } ``` If the setting already exists in your `settings.py` file, modify the existing setting to add the key-value pair above, instead of re-defining the setting. 3. Add a [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS) setting to your project or spider, if not added yet. The value of `FEEDS` must be a JSON object (`{}`). If you have `FEEDS` already defined with key-value pairs, you can keep those if you want — `FEEDS` supports exporting data to multiple file storage service locations. To add `FEEDS` to a project, define it in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) or add it to your `settings.py` file: settings.py ```python FEEDS = {} ``` To add `FEEDS` to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: spiders/myspider.py ```python class MySpider: custom_settings = { "FEEDS": {}, } ``` 4. Add the following key-value pair to `FEEDS`: ```python { "sftp://:@/": { "format": "" } } ``` Where: - `` and `` are your credentials for the SFTP server, [percent-encoded](https://en.wikipedia.org/wiki/Percent-encoding). - `` is the SFTP server host, e.g. `sftp.example.com` or `203.0.113.123`. - `` is the path where you want to store the scraped data file, e.g. `scraped/data.csv`. The path can include [placeholders](https://docs.scrapy.org/en/latest/topics/feed-exports.html#storage-uri-parameters) that are replaced at run time, such as `%(time)`, which is replaced by the current timestamp. - `` is the desired [output file format](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats). Possible values include: `csv`, `json`, `jsonlines`, `xml`. You can also [implement support for more formats](https://docs.scrapy.org/en/latest/topics/exporters.html). > ###### WARNING > > If you export in CSV format, and in your spider code you yield > items as Python dictionaries, only the fields present on the first yielded > item are exported for all items. > > One solution is to [customize output fields](https://docs.scrapy.org/en/latest/topics/exporters.html#scrapy.exporters.BaseItemExporter.fields_to_export) through the `fields` [feed > option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options) of [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) or > through the [FEED_EXPORT_FIELDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields) Scrapy setting to explicitly indicate all > fields to export. > > You can alternatively yield something other than a Python dictionary that > supports declaring all possible fields, such as an [Item object](https://docs.scrapy.org/en/latest/topics/items.html#item-objects) or an > [attrs object](https://docs.scrapy.org/en/latest/topics/items.html#attr-s-objects). Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured SFTP server location. ## Exporting to Google BigQuery with Scrapy To configure a [Scrapy](https://scrapy.org) project or spider to export scraped data to [Google BigQuery](https://cloud.google.com/bigquery/): 1. You need Python 3.7 or higher and Scrapy 2.4 or higher. If you are using Scrapy Cloud, make sure you are using [stack](https://support.zyte.com/support/solutions/articles/22000200402-changing-the-deploy-environment-with-scrapy-cloud-stacks) `scrapy:2.4` or higher. Using the latest stack (`scrapy:2.14-20260217`) is generally recommended. 2. Install [scrapy-bigquery](https://github.com/8W9aG/scrapy-bigquery): ```bash pip install scrapy-bigquery ``` If you are using Scrapy Cloud, remember to add the following line to your `requirements.txt` file: ```none scrapy-bigquery ``` 3. Define the `BIGQUERY_DATASET` and `BIGQUERY_TABLE` Scrapy settings to point to the target table. For example: settings.py ```python BIGQUERY_DATASET = "my-dataset" BIGQUERY_TABLE = "my-table" ``` [Additional settings](https://github.com/8W9aG/scrapy-bigquery#bigquery_add_scraped_time-optional) are available. > ###### TIP > > To add Scrapy settings to a project, define them in your [Scrapy > Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) > or add them to your `settings.py` file. > settings.py > ```python > MY_SETTING = ... > ``` > > To add settings to a spider, define them in your Scrapy Cloud > spider-specific settings (open a spider in Scrapy Cloud and select the > **Settings** tab) or add it to your spider code with the [update_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.update_settings) > method or the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class variable: > spiders/myspider.py > ```python > class MySpider: > custom_settings = { > "MY_SETTING": ..., > } > ``` 4. Define the `BIGQUERY_SERVICE_ACCOUNT` setting as a string with your [service account credentials](https://developers.google.com/identity/protocols/oauth2/service-account) in base64-encoded JSON format: settings.py ```python BIGQUERY_SERVICE_ACCOUNT = "eyJ0eX==" ``` You can use the following command to generate the required value from your service account JSON file: ```shell cat service-account.json | jq . -c | base64 ``` Make sure you give your service account write access on the target table. You can do that by sharing the table with the email of the service account (`client_email` in the service account JSON). Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Google BigQuery table. ## Get started with Zyte API Zyte API is a [web scraping API](https://www.zyte.com/zyte-web-scraping-api/) that avoids bans, enables browser automation, enables automatic extraction, and much more, all cost-efficiently. ### Get started > ##### Sign up > > Sign up now and get $5 free for a month. > > [Try for free!](https://app.zyte.com/account/signup/zyteapi) > ##### Follow the tutorial > > Complete the web scraping tutorial covering Zyte API. ### Learn more > ##### Usage > > Learn to use Zyte API. > ##### Reference > > See the complete API reference. > ##### Proxy mode > > Use Zyte API as a proxy. > ##### Migrate > > Migrate your existing web scraping code. > ##### Zyte IDE > > Write browser scripts and build and debug requests interactively. ## Zyte API usage documentation ### Initial setup How would you prefer to use Zyte API? > ##### Scrapy > > Use scrapy-zyte-api (tutorial). > ##### Python > > Use [python-zyte-api](http://python-zyte-api.readthedocs.io/). > ##### HTTP clients > > `POST` to `https://api.zyte.com/v1/extract` with your [Zyte API key](https://app.zyte.com/o/zyte-api/api-access) and parameters: > > ```shell > curl \ > --user YOUR_ZYTE_API_KEY: \ > --header 'Content-Type: application/json' \ > --data '{"url": "https://toscrape.com", "httpResponseBody": true}' \ > --compressed \ > https://api.zyte.com/v1/extract > ``` > ##### Proxy mode > > Use `https://api.zyte.com:8011` as your proxy endpoint, with your [Zyte API key](https://app.zyte.com/o/zyte-api/api-access) and proxy headers: > > ```shell > curl \ > --proxy api.zyte.com:8011 \ > --proxy-user YOUR_ZYTE_API_KEY: \ > --compressed \ > https://toscrape.com > ``` > ###### TIP > > Learn about the different features of the > HTTP API and the proxy mode > before you choose one. > ###### TIP > > Got an SSL error? Install our CA certificate. ### Basic usage What do you want to do with Zyte API? > ##### HTTP requests > > Send low-level HTTP requests, with custom method, headers and body, > opt-out redirection following, device emulation, and more. > > HTTP > ##### Browser automation > > Get browser-rendered HTML, take screenshots, interact with pages, > capture background requests, and more. > > Browser > ##### Automatic extraction > > Get structured data from single pages or entire websites, and > enrich it with custom LLM prompts. > > Extraction > ##### Search API > > Search Google with a typed interface and get back structured > organic results — no URL construction needed. > > Search API #### Additional features Customize your Zyte API requests further to get what you want: > ##### Geolocation > > Choose a location of origin for your request. > ##### IP type > > Choose the type of IP address used by your request. > ##### Cookies > > Get and set cookies to reproduce requests and maintain sessions. > ##### Sessions > > Use the same IP address, cookie jar, network stack, etc. on multiple > requests. ### Advanced topics > ##### Proxy mode > > Use Zyte API as a proxy. > ##### Rate limits > > Requests-over-time and concurrency limits. > ##### Optimization > > Make the most out of Zyte API. > ##### Error handling > > Error response handling. > ##### API reference > > Complete API reference documentation. > ##### Stats API > > Check your Zyte API usage details. ## Zyte API HTTP requests To send HTTP requests through Zyte API, without browser rendering, set the [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody) request field to `true`, and read the [Base64](https://en.wikipedia.org/wiki/Base64)-encoded response body from the [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) response field. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"httpResponseBody", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "httpResponseBody": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ > output.html ``` #### curl input.json ```json { "url": "https://toscrape.com", "httpResponseBody": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ > output.html ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "httpResponseBody", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', httpResponseBody: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'httpResponseBody' => true, ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); ``` #### Proxy mode With the proxy mode, you always get a response body. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com \ > output.html ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "httpResponseBody": True, }, ) http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"]) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "httpResponseBody": True, } ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy In transparent mode, when you target a text resource (e.g. HTML, JSON), regular Scrapy requests work out of the box: ```python from scrapy import Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): http_response_text: str = response.text ``` While regular Scrapy requests also work for binary responses at the moment, they may stop working in future versions of scrapy-zyte-api, so passing [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody) is recommended when targeting binary resources: ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "httpResponseBody": True, }, }, ) def parse(self, response): http_response_body: bytes = response.body ``` Output (first 5 lines): ```html Scraping Sandbox ``` For HTTP requests, Zyte API also supports: - HTTP request attributes for method, body, and headers. - Redirection, device emulation. - Geolocation, IP type, cookies, sessions, response headers, and metadata. > ###### TIP > > HTTP responses do not reflect HTML content rendered by a web browser > that executes JavaScript code. To get browser HTML, use a browser request. > See also zapi-raw-vs-browser. ### Request method HTTP requests use the `GET` [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) by default. Use the [httpRequestMethod](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpRequestMethod) field to set a different HTTP method. > ###### TIP > > When using `POST`, `PUT` or similar, you probably want to also > set a request body. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var method = responseData.RootElement.GetProperty("method").ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .method ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .method ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "httpRequestMethod", "POST"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String method = data.get("method").getAsString(); System.out.println(method); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, httpRequestMethod: 'POST' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const method = JSON.parse(httpResponseBody).method }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $method = json_decode($http_response_body)->method; ``` #### Proxy mode With the proxy mode, the request method from your requests is used automatically. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -X POST \ https://httpbin.org/anything \ | jq .method ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) method = json.loads(http_response_body)["method"] ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) method = json.loads(http_response_body)["method"] print(method) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", method="POST", ) def parse(self, response): method = json.loads(response.text)["method"] ``` Output: ```json "POST" ``` ### Request body To include a body in your request, use one of the following fields: - [httpRequestText](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpRequestText), for UTF-8-encoded text. - [httpRequestBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpRequestBody), for anything else. It supports binary data as well, so the value must be [Base64](https://en.wikipedia.org/wiki/Base64)-encoded. #### `httpRequestText` example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"}, {"httpRequestText", "{\"foo\": \"bar\"}"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var requestBody = responseData.RootElement.GetProperty("data").ToString(); Console.WriteLine(requestBody); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestText": "{\"foo\": \"bar\"}"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestText": "{\"foo\": \"bar\"}" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "httpRequestMethod", "POST", "httpRequestText", "{\"foo\": \"bar\"}"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String body = data.get("data").getAsString(); System.out.println(body); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, httpRequestMethod: 'POST', httpRequestText: '{"foo": "bar"}' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).data console.log(body) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', 'httpRequestText' => '{"foo": "bar"}', ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $body = json_decode($http_response_body)->data; echo $body.PHP_EOL; ``` #### Proxy mode With the proxy mode, the request body from your requests is used automatically, be it plain text or binary. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -X POST \ -H "Content-Type: application/json" \ --data '{"foo": "bar"}' \ https://httpbin.org/anything \ | jq .data ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestText": '{"foo": "bar"}', }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) body: str = json.loads(http_response_body)["data"] print(body) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestText": '{"foo": "bar"}', } ) http_response_body = b64decode(api_response["httpResponseBody"]) body = json.loads(http_response_body)["data"] print(body) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", method="POST", body='{"foo": "bar"}', ) def parse(self, response): body = json.loads(response.body)["data"] print(body) ``` Output: ```json {"foo": "bar"} ``` #### `httpRequestBody` example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"}, {"httpRequestBody", "Zm9v"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var requestBody = responseData.RootElement.GetProperty("data").ToString(); Console.WriteLine(requestBody); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "httpRequestMethod", "POST", "httpRequestBody", "Zm9v"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String body = data.get("data").getAsString(); System.out.println(body); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, httpRequestMethod: 'POST', httpRequestBody: 'Zm9v' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).data console.log(body) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', 'httpRequestBody' => 'Zm9v', ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $body = json_decode($http_response_body)->data; echo $body.PHP_EOL; ``` #### Proxy mode With the proxy mode, the request body from your requests is used automatically, be it plain text or binary. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -X POST \ -H "Content-Type: application/octet-stream" \ --data foo \ https://httpbin.org/anything \ | jq .data ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v", }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) body: str = json.loads(http_response_body)["data"] print(body) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) body = json.loads(http_response_body)["data"] print(body) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", method="POST", body=b"foo", ) def parse(self, response): body = json.loads(response.body)["data"] print(body) ``` Output: ```none foo ``` ### Request headers In HTTP requests, use [customHttpRequestHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customHttpRequestHeaders) to set request headers. You can set any header except `Cookie` (see zapi-cookies). > ###### TIP > > You can also set headers like `Accept`, `Accept-Encoding`, > `Accept-Language` or `User-Agent`, but it is usually best to let Zyte > API set those headers; it will use values consistent with the network stack > and other request parameters (e.g. device, > geolocation). #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, { "customHttpRequestHeaders", new List>() { new Dictionary() { {"name", "Accept-Language"}, {"value", "fa"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var headerEnumerator = responseData.RootElement.GetProperty("headers").EnumerateObject(); var headers = new Dictionary(); while (headerEnumerator.MoveNext()) { headers.Add( headerEnumerator.Current.Name.ToString(), headerEnumerator.Current.Value.ToString() ); } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "customHttpRequestHeaders": [{"name": "Accept-Language", "value": "fa"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .headers ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "customHttpRequestHeaders": [ { "name": "Accept-Language", "value": "fa" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .headers ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map customHttpRequestHeader = ImmutableMap.of("name", "Accept-Language", "value", "fa"); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "customHttpRequestHeaders", Collections.singletonList(customHttpRequestHeader)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); JsonObject headers = data.get("headers").getAsJsonObject(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(headers)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, customHttpRequestHeaders: [ { name: 'Accept-Language', value: 'fa' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const headers = JSON.parse(httpResponseBody).headers }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'customHttpRequestHeaders' => [ [ 'name' => 'Accept-Language', 'value' => 'fa', ], ], ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); $data = json_decode($http_response_body); $headers = $data->headers; ``` #### Proxy mode With the proxy mode, the request headers from your requests are used automatically. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Accept-Language: fa" \ https://httpbin.org/anything \ | jq .headers ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "customHttpRequestHeaders": [ { "name": "Accept-Language", "value": "fa", }, ], }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) headers = json.loads(http_response_body)["headers"] ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "customHttpRequestHeaders": [ { "name": "Accept-Language", "value": "fa", }, ], } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) headers = json.loads(http_response_body)["headers"] print(json.dumps(headers, indent=2)) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", headers={"Accept-Language": "fa"}, ) def parse(self, response): headers = json.loads(response.text)["headers"] ``` Output (first 5 lines): ```json { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "fa", "Host": "httpbin.org", ``` ### Redirection HTTP requests follow [HTTP redirection](https://developer.mozilla.org/en-US/docs/Web/HTTP/Redirections) by default. Set [followRedirect](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/followRedirect) to `False` to change that. > ###### NOTE > > Redirection works differently in browser requests. ### Device emulation In HTTP requests, use [device](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/device) to set a type of device emulation, either `desktop` (default) or `mobile`, to use for your request. This option exists because some websites return different content depending on the type of device used to access them. > ###### NOTE > > In a request where you set [device](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/device) to `mobile`, you > cannot use [sessionContextParameters.actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters.actions). #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/user-agent"}, {"httpResponseBody", true}, {"device", "mobile"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var headerEnumerator = responseData.RootElement.EnumerateObject(); while (headerEnumerator.MoveNext()) { if (headerEnumerator.Current.Name.ToString() == "user-agent") { Console.WriteLine(headerEnumerator.Current.Value.ToString()); } } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/user-agent", "httpResponseBody": true, "device": "mobile"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output '.["user-agent"]' ``` #### curl input.json ```json { "url": "https://httpbin.org/user-agent", "httpResponseBody": true, "device": "mobile" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output '.["user-agent"]' ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/user-agent", "httpResponseBody", true, "device", "mobile"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String userAgent = data.get("user-agent").getAsString(); System.out.println(userAgent); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/user-agent', httpResponseBody: true, device: 'mobile' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(JSON.parse(httpResponseBody)['user-agent']) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/user-agent', 'httpResponseBody' => true, 'device' => 'mobile', ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); $data = json_decode($http_response_body); echo $data->{'user-agent'}.PHP_EOL; ``` #### Proxy mode With the proxy mode, use the zyte-device header. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Device: mobile" \ https://httpbin.org/user-agent \ | jq --raw-output '.["user-agent"]' ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/user-agent", "httpResponseBody": True, "device": "mobile", }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) user_agent = json.loads(http_response_body)["user-agent"] print(user_agent) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/user-agent", "httpResponseBody": True, "device": "mobile", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) user_agent = json.loads(http_response_body)["user-agent"] print(user_agent) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/user-agent", meta={ "zyte_api_automap": { "device": "mobile", } }, ) def parse(self, response): user_agent = json.loads(response.text)["user-agent"] print(user_agent) ``` Example output (may vary): ```none Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 ``` ### Submitting HTML forms While it may be easier to submit HTML forms using a browser request with actions, it is also possible to reproduce form-submission requests with HTTP requests. Reproducing an HTML form request usually requires: - Setting the right value of [httpRequestMethod](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpRequestMethod), often `POST`. - Setting the `Content-Type` header to `application/x-www-form-urlencoded` through [customHttpRequestHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customHttpRequestHeaders). - Setting the right payload, i.e. key-value pairs set by the form. For `GET` requests, that means setting those key-value pairs in the URL query string. For `POST` requests, that means encoding those key-value pairs as a query string (without the starting `?`) and using that as [httpRequestText](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpRequestText) or [httpRequestBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpRequestBody). > ###### TIP > > Your key-value pairs may need to include hidden form fields, often > used for [CSRF tokens](https://en.wikipedia.org/wiki/Cross-site_request_forgery) or to keep > the state of stateful pages (e.g. ASP.NET’s `__VIEWSTATE` field). #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. In [https://quotes.toscrape.com/search.aspx](https://quotes.toscrape.com/search.aspx) you get an HTML form that could be stripped down to: ```html
``` When you select an **Author** (e.g. Albert Einstein), a form request is sent, and the **Tag** options fill up. To reproduce that: #### C# ```cs using System; using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using System.Web; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input1 = new Dictionary(){ {"url", "https://quotes.toscrape.com/search.aspx"}, {"httpResponseBody", true} }; var inputJson1 = JsonSerializer.Serialize(input1); var content1 = new StringContent(inputJson1, Encoding.UTF8, "application/json"); HttpResponseMessage response1 = await client.PostAsync("https://api.zyte.com/v1/extract", content1); var body1 = await response1.Content.ReadAsByteArrayAsync(); var data1 = JsonDocument.Parse(body1); var base64HttpResponseBody1 = data1.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes1 = System.Convert.FromBase64String(base64HttpResponseBody1); var httpResponseBody1 = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes1); var htmlDocument1 = new HtmlDocument(); htmlDocument1.LoadHtml(httpResponseBody1); var navigator1 = htmlDocument1.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator1.Evaluate("//*[@name='__VIEWSTATE']/@value"); nodeIterator.MoveNext(); var viewState = nodeIterator.Current.ToString(); var httpRequestTextParameters = new Dictionary { { "author", "Albert Einstein" }, { "tag", "----------" }, { "__VIEWSTATE", viewState} }; var httpRequestText = string.Join("&", httpRequestTextParameters.Select(kvp => $"{HttpUtility.UrlEncode(kvp.Key)}={HttpUtility.UrlEncode(kvp.Value)}")); var input2 = new Dictionary(){ {"url", "https://quotes.toscrape.com/filter.aspx"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"}, { "customHttpRequestHeaders", new List>() { new Dictionary() { {"name", "Content-Type"}, {"value", "application/x-www-form-urlencoded"} } } }, {"httpRequestText", httpRequestText} }; var inputJson2 = JsonSerializer.Serialize(input2); var content2 = new StringContent(inputJson2, Encoding.UTF8, "application/json"); HttpResponseMessage response2 = await client.PostAsync("https://api.zyte.com/v1/extract", content2); var body2 = await response2.Content.ReadAsByteArrayAsync(); var data2 = JsonDocument.Parse(body2); var base64HttpResponseBody2 = data2.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes2 = System.Convert.FromBase64String(base64HttpResponseBody2); var httpResponseBody2 = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes2); var htmlDocument2 = new HtmlDocument(); htmlDocument2.LoadHtml(httpResponseBody2); var navigator2 = htmlDocument2.CreateNavigator(); var nodeIterator2 = (XPathNodeIterator)navigator2.Evaluate("//*[@name='tag']//option"); int tagCount = 0; while (nodeIterator2.MoveNext()) { tagCount++; } Console.WriteLine($"{tagCount}"); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.ArrayList; import java.util.Base64; import java.util.Collections; import java.util.List; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.entity.UrlEncodedFormEntity; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.NameValuePair; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.apache.hc.core5.http.message.BasicNameValuePair; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters1 = ImmutableMap.of("url", "https://quotes.toscrape.com/search.aspx", "httpResponseBody", true); String requestBody1 = new Gson().toJson(parameters1); HttpPost request1 = new HttpPost("https://api.zyte.com/v1/extract"); request1.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request1.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request1.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request1.setEntity(new StringEntity(requestBody1)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request1, (response1) -> { HttpEntity httpEntity1 = response1.getEntity(); String httpApiResponse1 = EntityUtils.toString(httpEntity1, StandardCharsets.UTF_8); JsonObject httpJsonObject1 = JsonParser.parseString(httpApiResponse1).getAsJsonObject(); String base64HttpResponseBody1 = httpJsonObject1.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes1 = Base64.getDecoder().decode(base64HttpResponseBody1); String httpResponseBody1 = new String(httpResponseBodyBytes1, StandardCharsets.UTF_8); Document document1 = Jsoup.parse(httpResponseBody1); String viewState = document1.select("[name='__VIEWSTATE']").attr("value"); Map params = ImmutableMap.of( "author", "Albert Einstein", "tag", "----------", "__VIEWSTATE", viewState); List formParams = new ArrayList<>(); for (Map.Entry entry : params.entrySet()) { formParams.add(new BasicNameValuePair(entry.getKey(), entry.getValue())); } UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formParams, StandardCharsets.UTF_8); String httpRequestText = EntityUtils.toString(entity); Map customHttpRequestHeader = ImmutableMap.of("name", "Content-Type", "value", "application/x-www-form-urlencoded"); Map parameters2 = ImmutableMap.of( "url", "https://quotes.toscrape.com/filter.aspx", "httpResponseBody", true, "httpRequestMethod", "POST", "customHttpRequestHeaders", Collections.singletonList(customHttpRequestHeader), "httpRequestText", httpRequestText); String requestBody2 = new Gson().toJson(parameters2); HttpPost request2 = new HttpPost("https://api.zyte.com/v1/extract"); request2.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request2.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request2.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request2.setEntity(new StringEntity(requestBody2)); client.execute( request2, (response2) -> { HttpEntity httpEntity2 = response2.getEntity(); String httpApiResponse2 = EntityUtils.toString(httpEntity2, StandardCharsets.UTF_8); JsonObject httpJsonObject2 = JsonParser.parseString(httpApiResponse2).getAsJsonObject(); String base64HttpResponseBody2 = httpJsonObject2.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes2 = Base64.getDecoder().decode(base64HttpResponseBody2); String httpResponseBody2 = new String(httpResponseBodyBytes2, StandardCharsets.UTF_8); Document document2 = Jsoup.parse(httpResponseBody2); Elements tags = document2.select("select[name='tag'] option"); System.out.println(tags.size()); return null; }); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') const querystring = require('querystring') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/search.aspx', httpResponseBody: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const $ = cheerio.load(httpResponseBody) const viewState = $('[name="__VIEWSTATE"]').get(0).attribs.value const httpRequestText = querystring.stringify( { author: 'Albert Einstein', tag: '----------', __VIEWSTATE: viewState } ) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/filter.aspx', httpResponseBody: true, httpRequestMethod: 'POST', customHttpRequestHeaders: [ { name: 'Content-Type', value: 'application/x-www-form-urlencoded' } ], httpRequestText }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const $ = cheerio.load(httpResponseBody) console.log($('select[name="tag"] option').length) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/search.aspx', 'httpResponseBody' => true, ], ]); $data = json_decode($response_1->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $doc = new DOMDocument(); $doc->loadHTML($http_response_body); $xpath_1 = new DOMXPath($doc); $view_state = $xpath_1->query('//*[@name="__VIEWSTATE"]/@value')->item(0)->nodeValue; $http_request_text = http_build_query( [ 'author' => 'Albert Einstein', 'tag' => '----------', '__VIEWSTATE' => $view_state, ] ); $response_2 = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/filter.aspx', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', 'customHttpRequestHeaders' => [ [ 'name' => 'Content-Type', 'value' => 'application/x-www-form-urlencoded', ], ], 'httpRequestText' => $http_request_text, ], ]); $data = json_decode($response_2->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $doc->loadHTML($http_response_body); $xpath_2 = new DOMXPath($doc); $tags = $xpath_2->query('//*[@name="tag"]/option'); echo count($tags).PHP_EOL; ``` #### Python Install form2request, which makes it easier to handle HTML forms in Python. Then: ```python from base64 import b64decode from form2request import form2request from parsel import Selector import requests api_response_1 = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/search.aspx", "httpResponseBody": True, }, ) api_response_1_data = api_response_1.json() http_response_body_1 = b64decode(api_response_1_data["httpResponseBody"]) selector_1 = Selector(body=http_response_body_1, base_url=api_response_1_data["url"]) form = selector_1.css("form") request = form2request(form, {"author": "Albert Einstein"}, click=False) api_response_2 = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": request.url, "httpRequestMethod": request.method, "customHttpRequestHeaders": [ {"name": k, "value": v} for k, v in request.headers ], "httpRequestText": request.body.decode(), "httpResponseBody": True, }, ) http_response_body_2 = b64decode(api_response_2.json()["httpResponseBody"]) selector_2 = Selector(body=http_response_body_2) print(len(selector_2.css("select[name='tag'] option"))) ``` #### Python client Install form2request, which makes it easier to handle HTML forms in Python. Then: ```python import asyncio from base64 import b64decode from form2request import form2request from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response_1 = await client.get( { "url": "https://quotes.toscrape.com/search.aspx", "httpResponseBody": True, } ) http_response_body_1 = b64decode(api_response_1["httpResponseBody"]) selector_1 = Selector(body=http_response_body_1, base_url=api_response_1["url"]) form = selector_1.css("form") request = form2request(form, {"author": "Albert Einstein"}, click=False) api_response_2 = await client.get( { "url": request.url, "httpRequestMethod": request.method, "customHttpRequestHeaders": [ {"name": k, "value": v} for k, v in request.headers ], "httpRequestText": request.body.decode(), "httpResponseBody": True, } ) http_response_body_2 = b64decode(api_response_2["httpResponseBody"]) selector_2 = Selector(body=http_response_body_2) print(len(selector_2.css("select[name='tag'] option"))) asyncio.run(main()) ``` #### Scrapy Install form2request, which makes it easier to handle HTML forms in Scrapy. Then, use it and let transparent mode take care of the rest: ```python from form2request import form2request from scrapy import Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" start_urls = ["https://quotes.toscrape.com/search.aspx"] def parse(self, response): form = response.css("form") request = form2request(form, {"author": "Albert Einstein"}, click=False) yield request.to_scrapy(callback=self.parse_tags) def parse_tags(self, response): print(len(response.css("select[name='tag'] option"))) ``` Output (number of **Tag** options): ```json 25 ``` ### Decoding HTML HTML extracted as a response body needs to be decoded. HTML content can be encoded with one of many character encodings, and you must determine the character encoding used so that you can decode that HTML content accordingly. The best way to determine the encoding of HTML content is to follow the [encoding sniffing algorithm](https://html.spec.whatwg.org/#determining-the-character-encoding) defined in the HTML standard. In addition to the HTML content, the HTML encoding sniffing algorithm takes into account any character encoding provided in the optional `charset` parameter of media types declared in the `Content-Type` response header, so make sure you get the response headers in addition to the response body if you are following the HTML encoding sniffing algorithm. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### curl Use [file](https://www.darwinsys.com/file/) to find the media type of a previously-downloaded response based solely on its body (i.e. not following the HTML encoding sniffing algorithm). ```shell file --mime-encoding output.html ``` #### JS Use [content-type-parser](https://www.npmjs.com/package/content-type-parser), [html-encoding-sniffer](https://www.npmjs.com/package/html-encoding-sniffer) and [whatwg-encoding](https://www.npmjs.com/package/whatwg-encoding): ```js const contentTypeParser = require('content-type-parser') const htmlEncodingSniffer = require('html-encoding-sniffer') const whatwgEncoding = require('whatwg-encoding') // … const httpResponseHeaders = response.data.httpResponseHeaders let contentTypeCharset httpResponseHeaders.forEach(function (item) { if (item.name.toLowerCase() === 'content-type') { contentTypeCharset = contentTypeParser(item.value).get('charset') } }) const httpResponseBody = Buffer.from(response.data.httpResponseBody, 'base64') const encoding = htmlEncodingSniffer(httpResponseBody, { transportLayerEncodingLabel: contentTypeCharset }) const html = whatwgEncoding.decode(httpResponseBody, encoding) ``` #### Python [web-poet](https://web-poet.readthedocs.io/en/stable/index.html) provides a response wrapper that automatically decodes the response body following an encoding sniffing algorithm similar to the one defined in the HTML standard. Provided that you have extracted a response with both body and headers, and you have Base64-decoded the response body, you can decode the HTML bytes as follows: ```python from web_poet import HttpResponse # … headers = tuple( (item['name'], item['value']) for item in http_response_headers ) response = HttpResponse( url='https://example.com', body=http_response_body, status=200, headers=headers, ) html = response.text ``` #### Scrapy In transparent mode, regular Scrapy requests targeting HTML resources decode them by default. See zapi-text. ### HTML and browser HTML HTML found in [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) is usually different from HTML found in [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) (browser HTML): - [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) does not reflect changes that a webpage makes at run time using JavaScript, such as loading content from additional URLs, or moving or reformatting content within the webpage. - [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) includes a normalization of the HTML from the underlying HTTP response, which web browsers perform according to the HTML5 specification. So the content of HTML and browser HTML could be different even when there is no JavaScript involved. Parsing HTML from [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) with libraries that do not implement HTML5 parsing, such as [lxml.html](https://lxml.de/lxmlhtml.html) (used by [Scrapy](https://scrapy.org/) by default), results in a different tree structure. With an HTML5-compatible parser the resulting tree structure would be the same, provided JavaScript does not cause any other difference. Because of these differences, switching between these HTML inputs can break your existing parsing code and require changes, such as updating XPath or CSS selectors. ## Zyte API browser automation You can use browser automation through Zyte API to get browser-rendered HTML, screenshots, or both. For browser requests, Zyte API also supports: - Actions, network capture, request headers, redirection, and toggling JavaScript. - Geolocation, IP type, cookies, sessions, redirection, response headers, and metadata. Unlike HTTP requests, browser requests do not support: - An HTTP request method, body, or header other than Referer. > ###### NOTE > > This only affects the initial request. During a browser request, > as a result of redirection, JavaScript, or actions, additional requests may be sent with no > limitation on method, body or headers, and may be captured. - Returning non-HTML response data, other than a screenshot. All browser request features are also available for automatic extraction requests that use a browser request as extraction source. ### Browser HTML Browser HTML is the HTML representation of the [Document Object Model](https://en.wikipedia.org/wiki/Document_Object_Model) (DOM) of a webpage after it has been rendered in a browser. To get browser HTML, set the [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/browserHtml) request field to `true`. The [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) response field is the browser HTML as a string. > ###### NOTE > > By default, [iframes](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) in [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) are empty. Set > [includeIframes](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/includeIframes) to `true` to embed iframe content in > `browserHtml`. > > To access content from the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM), check out the corresponding > example under zapi-actions. See also zapi-raw-vs-browser. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"browserHtml", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml ``` #### curl input.json ```json { "url": "https://toscrape.com", "browserHtml": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "browserHtml", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); System.out.println(browserHtml); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', browserHtml: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'browserHtml' => true, ], ]); $api = json_decode($response->getBody()); $browser_html = $api->browserHtml; ``` #### Proxy mode ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Browser-Html: true" \ https://toscrape.com ``` #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "browserHtml": True, }, ) browser_html: str = api_response.json()["browserHtml"] ``` #### Python client ```python import asyncio from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "browserHtml": True, } ) print(api_response["browserHtml"]) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): browser_html: str = response.text ``` Output (first 5 lines): ```html Scraping Sandbox ``` ### Screenshot To get a webpage screenshot in browser requests, set the [screenshot](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/screenshot) request field to `true` . The [screenshot](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/screenshot) response field is the [Base64](https://en.wikipedia.org/wiki/Base64)-encoded screenshot file data. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"screenshot", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64Screenshot = data.RootElement.GetProperty("screenshot").ToString(); var screenshot = System.Convert.FromBase64String(base64Screenshot); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "screenshot": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .screenshot \ | base64 --decode \ > screenshot.jpg ``` #### curl input.json ```json { "url": "https://toscrape.com", "screenshot": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .screenshot \ | base64 --decode \ > screenshot.jpg ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "screenshot", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64Screenshot = jsonObject.get("screenshot").getAsString(); byte[] screenshot = Base64.getDecoder().decode(base64Screenshot); try (FileOutputStream fos = new FileOutputStream("screenshot.jpg")) { fos.write(screenshot); } return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', screenshot: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const screenshot = Buffer.from(response.data.screenshot, 'base64') }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'screenshot' => true, ], ]); $api = json_decode($response->getBody()); $screenshot = base64_decode($api->screenshot); ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "screenshot": True, }, ) screenshot: bytes = b64decode(api_response.json()["screenshot"]) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "screenshot": True, } ) screenshot = b64decode(api_response["screenshot"]) with open("screenshot.jpg", "wb") as f: f.write(screenshot) asyncio.run(main()) ``` #### Scrapy ```python from base64 import b64decode from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "screenshot": True, }, }, ) def parse(self, response): screenshot: bytes = b64decode(response.raw_api_response["screenshot"]) ``` Output: ![](zyte-api/usage/code-examples/output/screenshot.jpg) ### Actions In browser requests use the [actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/actions) request field to define a sequence of browser actions to perform before output generation. > ###### SEE ALSO > > Web scraping tutorial (tutorial-actions). #### Example: scrollBottom > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://quotes.toscrape.com/scroll"}, {"browserHtml", true}, { "actions", new List>() { new Dictionary() { {"action", "scrollBottom"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var quoteCount = (double)navigator.Evaluate("count(//*[@class=\"quote\"])"); ``` #### CLI client input.jsonl ```json {"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{"action": "scrollBottom"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null ``` #### curl input.json ```json { "url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [ { "action": "scrollBottom" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map action = ImmutableMap.of("action", "scrollBottom"); Map parameters = ImmutableMap.of( "url", "https://quotes.toscrape.com/scroll", "browserHtml", true, "actions", Collections.singletonList(action)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); int quoteCount = document.select(".quote").size(); System.out.println(quoteCount); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/scroll', browserHtml: true, actions: [ { action: 'scrollBottom' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml const $ = cheerio.load(browserHtml) const quoteCount = $('.quote').length }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/scroll', 'browserHtml' => true, 'actions' => [ ['action' => 'scrollBottom'], ], ], ]); $data = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($data->browserHtml); $xpath = new DOMXPath($doc); $quote_count = $xpath->query("//*[@class='quote']")->count(); ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, ) browser_html = api_response.json()["browserHtml"] quote_count = len(Selector(browser_html).css(".quote")) ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, ) browser_html = api_response["browserHtml"] quote_count = len(Selector(browser_html).css(".quote")) print(quote_count) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield Request( "https://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, }, ) def parse(self, response): quote_count = len(response.css(".quote")) ``` Output: ```none 100 ``` #### Example: Read from the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. To get content from the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM), use the `evaluate` action to create an invisible DOM element, which you will get in [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml), and fill it with the desired content from the shadow DOM. > ###### TIP > > If your `evaluate` action does not work as expected, check the > [actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/actions) response field for errors. The following example code shows how to access the shadow DOM paragraph from [a shadow DOM example in CodePen](https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=) using the `evaluate` action with the following `source`: ```js const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') // Hide, in case you also want to take a screenshot. div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) ``` #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view="}, {"browserHtml", true}, { "actions", new List>() { new Dictionary() { {"action", "evaluate"}, {"source", @" const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) "} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//*[@id=\"shadow-root-content\"]/text()"); nodeIterator.MoveNext(); var shadowText = nodeIterator.Current.ToString(); Console.WriteLine(shadowText); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map actions = ImmutableMap.of( "action", "evaluate", "source", "const div = document.createElement('div')\n" + "div.setAttribute('id', 'shadow-root-content')\n" + "div.style.display = 'none'\n" + "const iframe = document.getElementById('result')\n" + "div.innerText = iframe\n" + " .contentWindow.document\n" + " .getElementById('shadow-root')\n" + " .shadowRoot.querySelector('p').textContent\n" + "document.body.appendChild(div)"); Map parameters = ImmutableMap.of( "url", "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", "browserHtml", true, "actions", Collections.singletonList(actions)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); String shadowText = document.select("#shadow-root-content").text(); System.out.println(shadowText); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=', browserHtml: true, actions: [ { action: 'evaluate', source: ` const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) ` } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml const $ = cheerio.load(browserHtml) const shadowText = $('#shadow-root-content').text() console.log(shadowText) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=', 'browserHtml' => true, 'actions' => [ [ 'action' => 'evaluate', 'source' => " const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) ", ], ], ], ]); $data = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($data->browserHtml); $xpath = new DOMXPath($doc); $shadow_text = $xpath->query("//*[@id='shadow-root-content']")->item(0)->textContent; echo $shadow_text.PHP_EOL; ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", "browserHtml": True, "actions": [ { "action": "evaluate", "source": """ const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) """, }, ], }, ) browser_html = api_response.json()["browserHtml"] shadow_text = Selector(browser_html).css("#shadow-root-content::text").get() print(shadow_text) ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", "browserHtml": True, "actions": [ { "action": "evaluate", "source": """ const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) """, }, ], }, ) browser_html = api_response["browserHtml"] shadow_text = Selector(browser_html).css("#shadow-root-content::text").get() print(shadow_text) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class CodePenSpider(Spider): name = "codepen" async def start(self): yield Request( "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "evaluate", "source": """ const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) """, }, ], }, }, ) def parse(self, response): shadow_text = response.css("#shadow-root-content::text").get() print(shadow_text) ``` Output: ```none Shadow Paragraph ``` #### Action types Zyte API supports 3 types of browser actions: - **Generic actions** work on every website. They allow you to type text into input fields, emulate mouse input, and wait for events or time. - **Special actions** expose functionality that requires specific knowledge of the target website, such as using their search box or filling a form. They are only available for certain websites. To find out if an action is available for a given website, send a test request using that action. If the action is not supported, you will get an error API response indicating so. - Browser scripts. #### Action limits You are free to use as many browser actions as you wish, but total browser execution time is limited to 60 seconds. If your actions are still running by that time, the on-going action is interrupted, follow-up actions are not executed at all, and you get your requested output (browser HTML, screenshot) as it was rendered at that time. The Zyte API response includes an `action` key that provides details about action execution, including `elapsedTime`, `error`, and `status` fields to help you debug your actions, e.g. to find out which actions were executed successfully and which actions were not. #### Action selectors Browser actions that interact with a webpage element all have a `selector` key that allows you to define how to find the target webpage element. You must define a query to find the target webpage element in the `selector.value` field. You must specify the language of your query in the `selector.type` field, which supports the following values: CSS Selector (`css`), XPath 1.0 (`xpath`). For information about these query languages, see [Learning CSS and XPath](https://parsel.readthedocs.io/en/latest/usage.html#learning-css-and-xpath). Note that selectors cannot interact with [iframes](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) or with the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM), only the evaluate action and browser scripts can. #### Wait actions You can use the following browser actions to introduce wait times in your browser action sequences or in your browser scripts: `waitForSelector`, `waitForRequest`, `waitForResponse`, and `waitForTimeout`. Whenever you need to wait for something to happen on a webpage, your should consider using `waitForSelector` first. It waits for an element matching a given selector. By default, it waits for a matching *visible* element, but you can change `selector.state` to `attached`, to wait for an element to exist regardless of visibility, or to `hidden`, to wait for a matching *invisible* element. > ###### TIP > > For a usage example of `waitForSelector`, see the web scraping > tutorial. `waitForRequest` and `waitForResponse` wait for a request to be sent or for a response to be received, filtering by URL pattern. `waitForTimeout` pauses your sequence of actions or your browser script for the specified amount of time. Because action run time is limited, you should avoid using this type of action when an alternative waiting action can replace it. However, this action can be necessary for certain scenarios, such as following organic website-access patterns. ### Network capture In browser requests, use the [networkCapture](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/networkCapture) request field to define filters to capture network responses received during browser rendering (including action execution). #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://quotes.toscrape.com/scroll"}, {"browserHtml", true}, { "networkCapture", new List>() { new Dictionary() { {"filterType", "url"}, {"httpResponseBody", true}, {"value", "/api/"}, {"matchType", "contains"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var apiBody = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(apiBody); var captureEnumerator = data.RootElement.GetProperty("networkCapture").EnumerateArray(); captureEnumerator.MoveNext(); var capture = captureEnumerator.Current; var base64Body = capture.GetProperty("httpResponseBody").ToString(); var body = System.Convert.FromBase64String(base64Body); var captureData = JsonDocument.Parse(body); var quoteEnumerator = captureData.RootElement.GetProperty("quotes").EnumerateArray(); quoteEnumerator.MoveNext(); var quote = quoteEnumerator.Current; var authorEnumerator = quote.GetProperty("author").EnumerateObject(); while (authorEnumerator.MoveNext()) { if (authorEnumerator.Current.Name.ToString() == "name") { Console.WriteLine(authorEnumerator.Current.Value.ToString()); break; } } ``` #### CLI client input.jsonl ```json {"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "networkCapture": [{"filterType": "url", "httpResponseBody": true, "value": "/api/", "matchType": "contains"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output ".networkCapture[0].httpResponseBody" \ | base64 --decode \ | jq --raw-output ".quotes[0].author.name" ``` #### curl input.json ```json { "url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "networkCapture": [ { "filterType": "url", "httpResponseBody": true, "value": "/api/", "matchType": "contains" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output ".networkCapture[0].httpResponseBody" \ | base64 --decode \ | jq --raw-output ".quotes[0].author.name" ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonArray; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map filter = ImmutableMap.of( "filterType", "url", "httpResponseBody", true, "value", "/api/", "matchType", "contains"); Map parameters = ImmutableMap.of( "url", "https://quotes.toscrape.com/scroll", "browserHtml", true, "networkCapture", Collections.singletonList(filter)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonArray captures = jsonObject.get("networkCapture").getAsJsonArray(); JsonObject capture = captures.get(0).getAsJsonObject(); byte[] bodyBytes = Base64.getDecoder().decode(capture.get("httpResponseBody").getAsString()); String body = new String(bodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(body).getAsJsonObject(); JsonObject quote = data.get("quotes").getAsJsonArray().get(0).getAsJsonObject(); String authorName = quote.get("author").getAsJsonObject().get("name").getAsString(); System.out.println(authorName); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/scroll', browserHtml: true, networkCapture: [ { filterType: 'url', httpResponseBody: true, value: '/api/', matchType: 'contains' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const capture = response.data.networkCapture[0] const data = JSON.parse(Buffer.from(capture.httpResponseBody, 'base64')) console.log(data.quotes[0].author.name) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/scroll', 'browserHtml' => true, 'networkCapture' => [ [ 'filterType' => 'url', 'httpResponseBody' => true, 'value' => '/api/', 'matchType' => 'contains', ], ], ], ]); $api_response = json_decode($response->getBody()); $capture = $api_response->networkCapture[0]; $data = json_decode(base64_decode($capture->httpResponseBody)); echo $data->quotes[0]->author->name.PHP_EOL; ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, ) capture = api_response.json()["networkCapture"][0] data = json.loads(b64decode(capture["httpResponseBody"]).decode()) print(data["quotes"][0]["author"]["name"]) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, ) capture = api_response["networkCapture"][0] data = json.loads(b64decode(capture["httpResponseBody"]).decode()) print(data["quotes"][0]["author"]["name"]) asyncio.run(main()) ``` #### Scrapy ```python import json from base64 import b64decode from scrapy import Request, Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield Request( "https://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, }, ) def parse(self, response): capture = response.raw_api_response["networkCapture"][0] data = json.loads(b64decode(capture["httpResponseBody"]).decode()) print(data["quotes"][0]["author"]["name"]) ``` Output: ```none Albert Einstein ``` See also tutorial-network-capture in the web scraping tutorial. ### Request headers In browser requests, use the [requestHeaders.referer](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestHeaders.referer) request field to set the [Referer header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer). #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"browserHtml", true}, { "requestHeaders", new Dictionary() { {"referer", "https://example.org/"} } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//text()"); nodeIterator.MoveNext(); var responseJson = nodeIterator.Current.ToString(); var responseData = JsonDocument.Parse(responseJson); var headerEnumerator = responseData.RootElement.GetProperty("headers").EnumerateObject(); var headers = new Dictionary(); while (headerEnumerator.MoveNext()) { headers.Add( headerEnumerator.Current.Name.ToString(), headerEnumerator.Current.Value.ToString() ); } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "browserHtml": true, "requestHeaders": {"referer": "https://example.org/"}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//text()' - 2> /dev/null \ | jq .headers ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "browserHtml": true, "requestHeaders": { "referer": "https://example.org/" } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//text()' - 2> /dev/null \ | jq .headers ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map requestHeaders = ImmutableMap.of("referer", "https://example.org/"); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "browserHtml", true, "requestHeaders", requestHeaders); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); JsonObject data = JsonParser.parseString(document.text()).getAsJsonObject(); JsonObject headers = data.get("headers").getAsJsonObject(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(headers)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', browserHtml: true, requestHeaders: { referer: 'https://example.org/' } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const $ = cheerio.load(response.data.browserHtml) const data = JSON.parse($.text()) const headers = data.headers }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'browserHtml' => true, 'requestHeaders' => [ 'referer' => 'https://example.org/', ], ], ]); $api = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($api->browserHtml); $data = json_decode($doc->textContent); $headers = $data->headers; ``` #### Python ```python import json import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "browserHtml": True, "requestHeaders": { "referer": "https://example.org/", }, }, ) browser_html = api_response.json()["browserHtml"] selector = Selector(browser_html) response_json = selector.xpath("//text()").get() response_data = json.loads(response_json) headers = response_data["headers"] ``` #### Python client ```python import asyncio import json from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "browserHtml": True, "requestHeaders": { "referer": "https://example.org/", }, } ) browser_html = api_response["browserHtml"] selector = Selector(browser_html) response_json = selector.xpath("//text()").get() response_data = json.loads(response_json) print(json.dumps(response_data["headers"], indent=2)) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", headers={"Referer": "https://example.org/"}, meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): response_json = response.xpath("//text()").get() response_data = json.loads(response_json) headers = response_data["headers"] ``` Output (`"Referer"` line): ```json "Referer": "https://example.org/", ``` At the moment, only the [Referer header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer) can be overridden this way. If you need to override additional headers, use HTTP requests with their [customHttpRequestHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customHttpRequestHeaders) request field instead. ### Redirection Browser requests always follow [HTTP redirection](https://developer.mozilla.org/en-US/docs/Web/HTTP/Redirections) and other URL changes triggered during browser rendering, e.g. by HTML or by JavaScript. > ###### TIP > > HTTP requests support not following redirection. ### JavaScript Browser requests have JavaScript execution enabled by default for most websites. For some websites, however, JavaScript execution is disabled by default because it helps avoiding bans or automating extraction. Use the [javascript](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/javascript) request field to force whether or not JavaScript execution should be enabled on a browser request. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://www.whatismybrowser.com/detect/is-javascript-enabled"}, {"browserHtml", true}, {"javascript", false} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//*[@id=\"detected_value\"]/text()"); nodeIterator.MoveNext(); var isJavaScriptEnabled = nodeIterator.Current.ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": true, "javascript": false} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//*[@id="detected_value"]/text()' - 2> /dev/null ``` #### curl input.json ```json { "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": true, "javascript": false } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//*[@id="detected_value"]/text()' - 2> /dev/null ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml", true, "javascript", false); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); String isJavaScriptEnabled = document.select("#detected_value").text(); System.out.println(isJavaScriptEnabled); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://www.whatismybrowser.com/detect/is-javascript-enabled', browserHtml: true, javascript: false }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const $ = cheerio.load(response.data.browserHtml) const isJavaScriptEnabled = $('#detected_value').text() }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://www.whatismybrowser.com/detect/is-javascript-enabled', 'browserHtml' => true, 'javascript' => false, ], ]); $api = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($api->browserHtml); $xpath = new DOMXPath($doc); $is_javascript_enabled = $xpath->query("//*[@id='detected_value']")->item(0)->textContent; ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": True, "javascript": False, }, ) browser_html = api_response.json()["browserHtml"] selector = Selector(browser_html) is_javascript_enabled: str = selector.css("#detected_value::text").get() ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": True, "javascript": False, } ) browser_html = api_response["browserHtml"] selector = Selector(browser_html) is_javascript_enabled = selector.css("#detected_value::text").get() print(is_javascript_enabled) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class WhatIsMyBrowserComSpider(Spider): name = "whatismybrowser_com" async def start(self): yield Request( "https://www.whatismybrowser.com/detect/is-javascript-enabled", meta={ "zyte_api_automap": { "browserHtml": True, "javascript": False, }, }, ) def parse(self, response): is_javascript_enabled: str = response.css("#detected_value::text").get() ``` Output: ```none No ``` ## Zyte API automatic extraction **Automatic extraction** gets you structured data from web data. Automatic extraction supports AI-powered extraction of e-commerce, article and job posting data from any website, as well as **non-AI extraction** of Google Search results. You can use Zyte API requests to get structured data from webpages. ### Structured data types In a Zyte API request, enable any of the following fields to get matching structured data: > ###### NOTE > > You can only enable 1 of these fields per Zyte API request. > ##### E-commerce > [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/product)) ai > [productList](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/productList) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/productList)) ai > [productNavigation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/productNavigation) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/productNavigation)) ai > ##### Articles > [article](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/article) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/article)) ai > [articleList](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/articleList) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/articleList)) ai > [articleNavigation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/articleNavigation) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/articleNavigation)) ai > [forumThread](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/forumThread) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/forumThread)) ai > ##### Job postings > [jobPosting](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/jobPosting) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/jobPosting)) ai > [jobPostingNavigation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/jobPostingNavigation) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/jobPostingNavigation)) ai > ##### Generic > [pageContent](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/pageContent) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/pageContent)) ai > ##### Google Search > > [serp](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/serp) ([output](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/serp)) non-ai #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"}, {"product", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var product = data.RootElement.GetProperty("product").ToString(); Console.WriteLine(product); ``` #### CLI client input.jsonl ```json {"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .product ``` #### curl input.json ```json { "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .product ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonObject product = jsonObject.get("product").getAsJsonObject(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(product)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', product: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const product = response.data.product console.log(product) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'product' => true, ], ]); $data = json_decode($response->getBody()); $product = json_encode($data->product); echo $product.PHP_EOL; ``` #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": ( "https://books.toscrape.com/catalogue" "/a-light-in-the-attic_1000/index.html" ), "product": True, }, ) product = api_response.json()["product"] print(product) ``` #### Python client ```python import asyncio import json from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": ( "https://books.toscrape.com/catalogue" "/a-light-in-the-attic_1000/index.html" ), "product": True, } ) product = api_response["product"] print(json.dumps(product, indent=2, ensure_ascii=False)) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class BooksToScrapeComSpider(Spider): name = "books_toscrape_com" async def start(self): yield Request( ( "https://books.toscrape.com/catalogue" "/a-light-in-the-attic_1000/index.html" ), meta={ "zyte_api_automap": { "product": True, }, }, ) def parse(self, response): product = response.raw_api_response["product"] print(product) ``` Output (first 5 lines): ```json { "name": "A Light in the Attic", "price": "51.77", "currency": "GBP", "currencyRaw": "£", ``` ### AI-powered extraction Automatic extraction uses AI-powered extraction for the following structured data types: [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product), [productList](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/productList), [productNavigation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/productNavigation), [article](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/article), [articleList](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/articleList), [articleNavigation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/articleNavigation), [forumThread](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/forumThread), [jobPosting](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/jobPosting), [jobPostingNavigation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/jobPostingNavigation), [pageContent](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/pageContent). AI-powered extraction also supports LLM-based extraction of custom attributes, as well as: geolocation, IP type, cookies, sessions, redirection, response headers, and metadata, plus additional features depending on your extraction source. #### Extraction source Use the corresponding `extractFrom` option, e.g. [productOptions.extractFrom](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/productOptions.extractFrom) when extracting a [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product), to indicate which sources to use for automatic extraction: - `httpResponseBody` extracts from [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody). It is usually faster and cheaper. - `browserHtmlOnly` extracts from [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml). It typically improves quality over `httpResponseBody` on JavaScript-heavy web pages. - `browserHtml` extracts from both [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) and visual features of the rendered web page. It typically improves quality over `browserHtmlOnly`, but is not as robust in case of rendering issues. If not specified, `browserHtml` is currently used by default for AI extraction, while `httpResponseBody` is used by default for non-AI extraction. In the future, the default value may depend on the target website. Automatic extraction using an HTTP request (`httpResponseBody`) supports HTTP request attributes for method, body, and headers. Automatic extraction using a browser request (`browserHtmlOnly` or `browserHtml`) supports browser HTML, screenshots, some request headers, actions, network capture, and toggling JavaScript. The limitations of browser requests also apply in this case. #### Model pinning The AI models of AI-powered extraction are retrained regularly, usually a few times per year. While new model versions aim to improve overall accuracy, they may become less accurate for specific fields of specific websites. For certain data types, we provide an option to pin a specific model version, which allows you to postpone an update to the latest model. To pin a model, use the corresponding `model` option, e.g. [productOptions.model](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/productOptions.model) when extracting a [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product). Model versions remain available for at least 1 year after their release. For example, a product model version `"2024-02-01"` would remain available at least until the 1st of February 2025. When we decide to remove a model version, we announce its end-of-life date by email to its users at least 3 months in advance, and we list that date in the table below. | Data type | Model name | Description | |-------------|--------------|-----------------------| | product | 2024-02-01 | | | product | 2024-09-16 | Default product model | ## Custom attributes extraction When you use AI-powered extraction, you can also extract arbitrary additional attributes defined by yourself: **custom attributes**. The extraction of custom attributes uses a Large Language Model (LLM) operated by Zyte that receives a schema defined by you, as well as text extracted from the target webpage, and performs extraction of structured data according to your schema. When custom attributes extraction is requested, a standard extraction field must also be specified (e.g. [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product)). This determines the part of the web page which would be passed to the LLM for custom attributes extraction, e.g. when a web page is a product, we’re only going to pass the product information, ignoring other parts of the page, such as menu or footer, which makes extraction cheaper and more accurate. Any of the standard extraction fields can be used, except for [serp](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/serp). The schema is passed in the [customAttributes](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributes) request field, and additional options can be customized in the [customAttributesOptions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributesOptions) field. Extracted values are available in the [customAttributes.values](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/customAttributes.values) field in the response. Here is an example body of a request to Zyte API which performs custom attributes extraction, adding “summary” and “article_sentiment” attributes: ```json { "url": "https://www.zyte.com/blog/intercept-network-patterns-within-zyte-api/", "article": true, "customAttributes": { "summary": { "type": "string", "description": "A two sentence article summary" }, "article_sentiment": { "type": "string", "enum": ["positive", "negative", "neutral"] } } } ``` And here is an example response body, with “article” and “metadata” values omitted: ```json { "url": "https://www.zyte.com/blog/intercept-network-patterns-within-zyte-api/", "statusCode": 200, "article": { }, "customAttributes": { "values": { "summary": "The Zyte API now allows developers to intercept network patterns, enabling better web scraping and bypassing challenges posed by modern websites with dynamic content and anti-bot measures. This feature allows for enhanced ban-handling strategies and more efficient scraping.", "article_sentiment": "positive" }, "metadata": { } } } ``` Refer to examples of making Zyte API requests with different languages and libraries. ### Method of extraction [customAttributesOptions.method](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributesOptions.method) allows to select the method of custom attribute extraction: * “generate” (default) generates extracted data with the help of a generative Large Language Model (LLM). It is the most powerful and versatile extraction method, but also the most expensive one, with variable per-request cost. * “extract” locates extracted data in the requested web page with the help of a non-generative LLM. It only supports a subset of the schema (only string, integer and number types), and can’t perform generative tasks such as summarization or data transformation. It is however much cheaper compared to the generative method and has a fixed per-request cost. ### Schema for the generative method The schema of the custom attributes is passed in the [customAttributes](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributes) request field, and is a subset of the OpenAPI specification, using JSON syntax. Here is an example custom attributes schema, showcasing the main features and good practices with the default “generate” method: ```json { "pockets": { "type": "integer", "description": "how many pockets the piece of clothing has" }, "has_reflective_elements": { "type": "boolean", "description": "does the piece of clothing have reflective elements?" }, "pattern_orientation": { "type": "string", "description": "if the piece of clothing has a pattern, the orientation of this pattern", "enum": ["horizontal", "vertical", "diagonal"] }, "materials": { "type": "array", "description": "the materials the product is made of", "items": {"type": "string"} }, "materials_details": { "type": "array", "description": "information about the materials the product is made of", "items": { "type": "object", "properties": { "name": { "type": "string", "description": "the name of the material" }, "percentage": { "type": "number", "description": "the percentage of the material in the product" } } } }, "price": { "type": "object", "properties": { "regular": { "type": "number", "description": "the regular price of the product. This is, without any discount" }, "discounted": { "type": "number", "description": "the current price of the product, with the discount" }, "unit": { "type": "string", "description": "the currency code of the price, usually given as a 3-letter code, e.g. USD, EUR, GBP, etc." } } } } ``` An example output which may be produced for this schema would be: ```json { "pockets": 3, "has_reflective_elements": false, "materials": ["cotton", "polyester", "elastane"], "materials_details": [ {"name": "cotton", "percentage": 70}, {"name": "polyester", "percentage": 25}, {"name": "elastane"} ], "price": { "regular": 100, "discounted": 99, "unit": "EUR" } } ``` Note that `"pattern_orientation"` is missing from the response, as well as `"percentage"` for one of the materials: this is due to all attributes being implicitly nullable, so if an attribute can not be extracted, it will not be returned. The returned value is guaranteed to conform to the requested schema. The following attribute data types are supported: - `string` - `boolean` - `number` - `integer` - `array` of any data type except for `array` - `object` with `string`, `boolean`, `number` and `integer` sub-fields When the type is `string`, `number` or `integer`, an `enum` can also be indicated, and the extraction value for that attribute will always be one of these options, or empty when it cannot be extracted - see the `"pattern_orientation"` in the example above. This is especially useful in data analysis use cases, where one might need to split the dataset into pre-defined groups. #### Generative attributes Custom Attributes don’t need to be restricted to extract data as it appears on the web site verbatim. They can be used for different operations on data that can only be achieved with generative extraction. You can find some examples below that take advantage of this aspect. ##### Normalization A custom attribute can be extracted following some data normalization, when specified in the description, usually in some explicit format or via an example. This is especially useful for later parsing, e.g. for visualization or data analysis, for example: ```json { "datetime_posted": { "type": "string", "description": "the date when the article was created, in the following format: YYYY/MM/DD" } } ``` Example output: ```json { "datetime_posted": "2021/12/30" } ``` ##### Summarization Sometimes, attributes that are summaries rather than the whole text can be useful, especially to save tokens needed to generate them, or when some simplification of the content of the page is needed. Example schema: ```json { "summary": { "type": "string", "description": "a brief summary of the article. Max 2 phrases. Explain it as a third person, e.g. start like this: The article.." } } ``` Example output: ```json { "summary": "The article describes the scenic beauty and vast adventure opportunities of the Grand Canyon National Park, highlighting its colorful landscapes and the meandering Colorado River. It provides practical information for visitors, such as entrance fees, lodging options, and tips for hiking and rafting." } ``` ##### Translation An extract-and-translate can be done on the fly. Just specify the conditions and/or details in the description. Example text: ``` [...] Couleurs du produit disponibles: jaune, rouge [...] ``` Example schema: ```json { "colors": { "type": "array", "description": "the available colors of the product. Translate to English if needed.", "items": {"type": "string"} } } ``` Output: ```json { "colors": ["yellow", "red"] } ``` When there are several attributes in the schema, these kinds of specifications made to custom attributes may apply to later attributes, so in these cases, if you want different behavior, it’s recommended to specify so in the description of each attribute, for example: ```json { "colors": { "type": "array", "description": "the available colors of the product. Translate to English if needed.", "items": {"type": "string"} }, "materials": { "type": "array", "description": "the materials the product is made of. Extract as they appear on the page, without translating them.", "items": {"type": "string"} } } ``` ##### Explanation We can make the LLM perform an analysis and explain the page content (or other details) before doing the actual extraction, in the same attribute or in another attribute after the explanatory one. This is especially useful to force the LLM to develop a “logic” before doing the actual extraction, which has been [demonstrated](https://arxiv.org/abs/2201.11903) to improve the final answer, for example: ```json { "explain is a toy": { "type": "string", "description": "analyze the content of the page and detailedly explain it, explaining if it is a single product page and if the product is a toy or not." }, "is a toy": { "type": "boolean", "description": "whether the product is a toy or not" } } ``` Example output: ```json { "explain is a toy": "The content of the page is a product page for \"Roasted & Salted Plantain Chips\", which is a type of snack food. It includes details such as brand, price, ratings, ingredients, and product description. It does not mention anything about toys or games, so it is not a single product page for a toy.", "is a toy": false } ``` Overall, we’d expect the extraction of the “is a toy” custom attribute in the example above to be more accurate if we use the “explain is a toy” before it, especially in hard or ambiguous cases (e.g. the product is a manual for a toy). > ###### NOTE > > Since these kind of explanations need to generate a fair amount of > tokens, it can considerably increase the extraction cost. > > Also, it is important that the attribute that does the explanation/analysis > (“explain is a toy”) comes before the final one where the final extraction > is made (“is a toy”). #### Other tips and tricks ##### Avoiding mathematical transformations We recommend doing a simple extraction when possible, and then apply your rules or transformations as a post-processing of the extraction. For example, imagine you want to extract the height of a product, but always in inches. However, you’re scraping a lot of product pages and some web sites might display the height of the product in cm, m, ft, etc. One option is to explicitly ask in the schema to “transform to X metric if found in Y metric”. The LLM generally has the capacity to do this conversion internally, but we cannot ensure the result will always be correct, and it will overcomplicate extraction for the LLM. Example Text: ``` Vacuum cleaner Turbo master 2000 Price: 200 $ Specified height by the manufacturer is 1.2 meters ``` Example Schema: ```json { "height": { "type": "number", "description": "height of the product, in inches. Transform it to inches if found in other metric" } } ``` Extraction result: ```json { "height": 47.24 } ``` The result is correct. However, when the schema is bigger (i.e. there are more custom attributes to extract), the LLM attention is more spread, it has a higher chance to fail these internal conversions, which cannot be verified. For this reason, we recommend writing a schema that allows the LLM to extract the desired data verbatim from the page, with the necessary fields to do your own transformation in your favorite programming language. This extraction is easier for the LLM and has a lower chance of being extracted incorrectly. Example schema: ```json { "product_height": { "type": "object", "description": "info about the height of the product", "properties": { "value": { "type": "number", "description": "the value of the height" }, "unit_normalized": { "type": "string", "description": "the normalized unit of measurement for the height", "enum": ["cm", "m", "in", "ft", "mm", "other"] } } } } ``` Extraction result: ```json { "product_height": { "value": 1.2, "unit_normalized": "m" } } ``` And then do the necessary transformation. For example, using Python: ```python if values["product_height"]["unit_normalized"] == "m": return values["product_height"]["value"] * 39.37 # meters to inches ``` ##### Reducing the number of attributes A lower number of attributes generally means better extraction quality. The easier it is to solve a problem, the better the LLM will be at solving that problem. For that reason, the more attributes there are in the schema, the harder the overall extraction will be for the LLM. In the latter case, the LLM tends to miss some details in the descriptions, or in the web page. Generally, the fewer the attributes, the more the LLM can focus on those, and the better the extraction quality will be. It is especially important when some attributes are already hard or complex on their own (e.g. array of objects). ### Schema for the extractive method The main use case for the extractive method is the extraction of simple, not-too-large attributes that do not require any transformation, such as memory capacity or screen resolution for [product](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/product). When you select the “extract” custom attributes extraction method, and your schema contains attributes that are not supported by the extractive method, e.g. objects, lists, or booleans, those attributes are ignored during extraction. When creating a schema with the extractive method, we recommend to start without attribute descriptions. If an attribute name alone is not enough to reach the desired quality, we recommend writing a description for that attribute, but formulating it as a question, and describing it in detail without assuming that the attribute name will be implicit as context for that question. For example, you might start with the following: ```json { "number_of_pockets": { "type": "integer" } } ``` And then make it more specific: ```json { "pockets": { "type": "integer", "description": "What is the number of pockets in this garment?" } } ``` But we don’t recommend having an incomplete description that relies on the attribute name, or a description that is not a question: ```json { "pockets": { "type": "integer", "description": "number of them in this garment" } } ``` When doing extraction of values with units, we recommend to extract the whole value as one attribute, instead of splitting the value and the unit, for example do this: ```json { "memory_capacity": { "type": "string" } } ``` instead of this, which is less likely to work well: ```json { "memory_capacity_value": { "type": "integer" }, "memory_capacity_unit": { "type": "string" } } ``` ## Zyte API shared features Learn here about Zyte API features that you can use with HTTP requests, browser requests, and automatic extraction: geolocation, IP type, cookies, sessions, response headers, and metadata. ### Geolocation The geographical point of origin of a request in terms of IP address can influence the response content. Some websites adjust the language or currency based on the country of origin. Some websites only allow traffic from specific countries. By default, Zyte API uses the most fitting geolocation based on the target website. You can override the country of origin used for a given request with the [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) request field. > ###### NOTE > > Zyte API provides 2 sets of geolocations, standard and extended, > listed in the reference documentation of [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation). > > Setting [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) explicitly on a request using an > extended geolocation, instead of letting Zyte API choose the right > geolocation based on the target website, affects request cost. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "http://ip-api.com/json"}, {"httpResponseBody", true}, {"geolocation", "AU"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var countryCode = responseData.RootElement.GetProperty("countryCode").ToString(); ``` #### CLI client input.jsonl ```json {"url": "http://ip-api.com/json", "httpResponseBody": true, "geolocation": "AU"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .countryCode ``` #### curl input.json ```json { "url": "http://ip-api.com/json", "httpResponseBody": true, "geolocation": "AU" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .countryCode ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "http://ip-api.com/json", "httpResponseBody", true, "geolocation", "AU"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String countryCode = data.get("countryCode").getAsString(); System.out.println(countryCode); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'http://ip-api.com/json', httpResponseBody: true, geolocation: 'AU' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const data = JSON.parse(httpResponseBody) const countryCode = data.countryCode }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'http://ip-api.com/json', 'httpResponseBody' => true, 'geolocation' => 'AU', ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); $data = json_decode($http_response_body); $country_code = $data->countryCode; ``` #### Proxy mode With the proxy mode, use the zyte-geolocation header. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Geolocation: US" \ http://ip-api.com/json \ | jq .countryCode ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "http://ip-api.com/json", "httpResponseBody": True, "geolocation": "AU", }, ) http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"]) response_data = json.loads(http_response_body) country_code = response_data["countryCode"] ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "http://ip-api.com/json", "httpResponseBody": True, "geolocation": "AU", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) response_data = json.loads(http_response_body) print(response_data["countryCode"]) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class IPAPIComSpider(Spider): name = "ip_api_com" async def start(self): yield Request( "http://ip-api.com/json", meta={ "zyte_api_automap": { "geolocation": "AU", }, }, ) def parse(self, response): response_data = json.loads(response.body) country_code = response_data["countryCode"] ``` Output: ```none AU ``` ### IP type IP addresses can be categorized in one of the following types: - **Data center** IP addresses are server-hosted IP addresses provided by web hosting providers, ISPs, etc. - **Residential** IP addresses are IP addresses provided by end-user devices with explicit user consent for bandwidth sharing. > ###### SEE ALSO > > zapi-permissions-control The type of IP address of a request can influence the response content. Some websites return different content depending on the IP type, or only allow requests from device residential IP addresses. By default, Zyte API uses the most fitting IP type based on the target website. You can override the IP type used for a given request by setting the [ipType](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType) request field to either `datacenter` or `residential`. > ###### WARNING > > Setting [ipType](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType) explicitly to `residential`, > instead of letting Zyte API choose the right IP type based on the target > website, requires completing our KYC procedure and affects > request cost. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); string[] ipTypes = { "datacenter", "residential" }; for (int i = 0; i < ipTypes.Length; i++) { var input = new Dictionary(){ {"url", "https://www.whatismyisp.com/"}, {"httpResponseBody", true}, {"ipType", ipTypes[i]} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(httpResponseBody); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//h1/span/text()"); nodeIterator.MoveNext(); var isp = nodeIterator.Current.ToString(); Console.WriteLine(isp); } ``` #### CLI client input.jsonl ```json {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "datacenter"} {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "residential"} ``` ```shell zyte-api input.jsonl 2> /dev/null \ | xargs -d\\n -n 1 \ bash -c " jq --raw-output .httpResponseBody <<< \"\$0\" \ | base64 --decode \ | xmllint --html --xpath 'string(//h1/span/text())' --noblanks - 2> /dev/null " ``` #### curl input.jsonl ```json {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "datacenter"} {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "residential"} ``` ```shell cat input.jsonl \ | xargs -P 2 -d\\n -n 1 \ bash -c " curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data \"\$0\" \ --compressed \ https://api.zyte.com/v1/extract \ 2> /dev/null \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | xmllint --html --xpath 'string(//h1/span/text())' --noblanks - 2> /dev/null " ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { String[] ipTypes = {"datacenter", "residential"}; for (String ipType : ipTypes) { Map parameters = ImmutableMap.of( "url", "https://www.whatismyisp.com/", "httpResponseBody", true, "ipType", ipType); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); Document document = Jsoup.parse(httpResponseBody); String logout = document.select("h1 > span:first-of-type").text(); System.out.println(logout); return null; }); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') const ipTypes = ['datacenter', 'residential'] for (const ipType of ipTypes) { axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://www.whatismyisp.com/', httpResponseBody: true, ipType }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const $ = cheerio.load(httpResponseBody) const logout = $('h1 > span:first-of-type').text() console.log(logout) }) } ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://www.whatismyisp.com/', 'httpResponseBody' => true, 'ipType' => $ip_type, ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $doc = new DOMDocument(); $doc->loadHTML($http_response_body); $xpath = new DOMXPath($doc); $logout = $xpath->query('//h1/span/text()')->item(0)->nodeValue; echo $logout.PHP_EOL; } ``` #### Proxy mode With the proxy mode, use the zyte-iptype header. ```shell for ip_type in datacenter residential do curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --header "Zyte-IPType: $ip_type" \ --compressed \ https://www.whatismyisp.com/ \ 2> /dev/null \ | xmllint --html --xpath 'string(//h1/span/text())' --noblanks - 2> /dev/null done ``` #### Python ```python from base64 import b64decode import requests from parsel import Selector for ip_type in ("datacenter", "residential"): api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://www.whatismyisp.com/", "httpResponseBody": True, "ipType": ip_type, }, ) http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() logout = Selector(http_response_body).css("h1 > span::text").get() print(logout) ``` #### Python client ```python import asyncio from base64 import b64decode from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() for ip_type in ("datacenter", "residential"): api_response = await client.get( { "url": "https://www.whatismyisp.com/", "httpResponseBody": True, "ipType": ip_type, }, ) http_response_body_bytes = b64decode(api_response["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() logout = Selector(http_response_body).css("h1 > span::text").get() print(logout) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class WhatIsMyIspComSpider(Spider): name = "whatismyisp_com" async def start(self): for ip_type in ("datacenter", "residential"): yield Request( "https://www.whatismyisp.com/", meta={ "zyte_api_automap": { "ipType": ip_type, }, }, ) def parse(self, response): print(response.css("h1 > span::text").get()) ``` Output: ```none [A web hosting company] [An Internet service provider] ``` ### Cookies Some websites use [cookies]() to track sessions and user preferences like language, address, etc. Use the [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies) and [responseCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/responseCookies) request fields to set and get cookies. See **Example 1** below. A common usage pattern with cookies is to send a browser request with the [responseCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/responseCookies) request field set to `true` to a webpage that requires a browser to generate a valid session cookie, and then copy the [responseCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/responseCookies) response field value into the [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies) request field of follow-up HTTP requests. This allows using sessions on websites as long as the target website only checks for the cookie presence, which is often the case (if not, use sessions). See **Example 2** below. If you do not set request cookies, Zyte API may set some request cookies anyway to minimize bans. If you do not want that, set the [cookieManagement](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/cookieManagement) request field to `"discard"`; [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies) will still be used if defined. #### Example 1: Set a cookie and get it back > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. The following code example sends a cookie to [httpbin.org](https://httpbin.org) and prints the cookies that [httpbin.org](https://httpbin.org) reports to have received: #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/cookies"}, {"httpResponseBody", true}, { "requestCookies", new List>() { new Dictionary() { {"name", "foo"}, {"value", "bar"}, {"domain", "httpbin.org"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var result = System.Text.Encoding.UTF8.GetString(httpResponseBody); Console.WriteLine(result); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/cookies", "httpResponseBody": true, "requestCookies": [{"name": "foo", "value": "bar", "domain": "httpbin.org"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### curl input.json ```json { "url": "https://httpbin.org/cookies", "httpResponseBody": true, "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map cookies = ImmutableMap.of("name", "foo", "value", "bar", "domain", "httpbin.org"); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/cookies", "httpResponseBody", true, "requestCookies", Collections.singletonList(cookies)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/cookies', httpResponseBody: true, requestCookies: [ { name: 'foo', value: 'bar', domain: 'httpbin.org' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/cookies', 'httpResponseBody' => true, 'requestCookies' => [ [ 'name' => 'foo', 'value' => 'bar', 'domain' => 'httpbin.org', ], ], ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); echo $http_response_body; ``` #### Proxy mode With the proxy mode, the request `Cookie` header from your requests is used automatically to set cookies for the target URL domain. > ###### NOTE > > Setting cookies for additional domains is not supported. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Cookie: foo=bar" \ https://httpbin.org/cookies ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/cookies", "httpResponseBody": True, "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org", }, ], }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) print(http_response_body.decode()) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/cookies", "httpResponseBody": True, "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org", }, ], } ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/cookies", meta={ "zyte_api_automap": { "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org", }, ], }, }, ) def parse(self, response): print(response.text) ``` Output: ```json { "cookies": { "foo": "bar" } } ``` #### Example 2: Reuse browser cookies in HTTP requests Send a browser request to the home page of a website, and use its response cookies as request cookies in an HTTP request to a different URL of that website. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var browserInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"browserHtml", true}, {"responseCookies", true} }; var browserInputJson = JsonSerializer.Serialize(browserInput); var browserContent = new StringContent(browserInputJson, Encoding.UTF8, "application/json"); HttpResponseMessage browserResponse = await client.PostAsync("https://api.zyte.com/v1/extract", browserContent); var browserResponseBody = await browserResponse.Content.ReadAsByteArrayAsync(); var browserData = JsonDocument.Parse(browserResponseBody); var httpInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"httpResponseBody", true}, {"requestCookies", browserData.RootElement.GetProperty("responseCookies")} }; var httpInputJson = JsonSerializer.Serialize(httpInput); var httpContent = new StringContent(httpInputJson, Encoding.UTF8, "application/json"); HttpResponseMessage httpResponse = await client.PostAsync("https://api.zyte.com/v1/extract", httpContent); var httpResponseBody = await httpResponse.Content.ReadAsByteArrayAsync(); var httpData = JsonDocument.Parse(httpResponseBody); var base64HttpResponseBodyField = httpData.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyField = System.Convert.FromBase64String(base64HttpResponseBodyField); var result = System.Text.Encoding.UTF8.GetString(httpResponseBodyField); Console.WriteLine(result); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map browserParameters = ImmutableMap.of( "url", "https://toscrape.com/", "browserHtml", true, "responseCookies", true); String browserRequestBody = new Gson().toJson(browserParameters); HttpPost browserRequest = new HttpPost("https://api.zyte.com/v1/extract"); browserRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); browserRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); browserRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); browserRequest.setEntity(new StringEntity(browserRequestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( browserRequest, browserResponse -> { HttpEntity browserEntity = browserResponse.getEntity(); String browserApiResponse = EntityUtils.toString(browserEntity, StandardCharsets.UTF_8); JsonObject browserJsonObject = JsonParser.parseString(browserApiResponse).getAsJsonObject(); Map httpParameters = ImmutableMap.of( "url", "https://books.toscrape.com/", "httpResponseBody", true, "requestCookies", browserJsonObject.get("responseCookies")); String httpRequestBody = new Gson().toJson(httpParameters); HttpPost httpRequest = new HttpPost("https://api.zyte.com/v1/extract"); httpRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); httpRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); httpRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); httpRequest.setEntity(new StringEntity(httpRequestBody)); client.execute( httpRequest, httpResponse -> { HttpEntity httpEntity = httpResponse.getEntity(); String httpApiResponse = EntityUtils.toString(httpEntity, StandardCharsets.UTF_8); JsonObject httpJsonObject = JsonParser.parseString(httpApiResponse).getAsJsonObject(); String base64HttpResponseBody = httpJsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com/', browserHtml: true, responseCookies: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((browserResponse) => { axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://books.toscrape.com/', httpResponseBody: true, requestCookies: browserResponse.data.responseCookies }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((httpResponse) => { const httpResponseBody = Buffer.from( httpResponse.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com/', 'browserHtml' => true, 'responseCookies' => true, ], ]); $browser_data = json_decode($browser_response->getBody()); $http_response = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://books.toscrape.com/', 'httpResponseBody' => true, 'requestCookies' => $browser_data->responseCookies, ], ]); $http_data = json_decode($http_response->getBody()); $http_response_body = base64_decode($http_data->httpResponseBody); echo $http_response_body; ``` #### Python ```python from base64 import b64decode import requests browser_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com/", "browserHtml": True, "responseCookies": True, }, ) http_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://books.toscrape.com/", "httpResponseBody": True, "requestCookies": browser_response.json()["responseCookies"], }, ) http_response_body = b64decode(http_response.json()["httpResponseBody"]) print(http_response_body.decode()) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() browser_response = await client.get( { "url": "https://toscrape.com/", "browserHtml": True, "responseCookies": True, } ) http_response = await client.get( { "url": "https://books.toscrape.com/", "httpResponseBody": True, "requestCookies": browser_response["responseCookies"], } ) http_response_body = b64decode(http_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com/", callback=self.parse_browser, meta={ "zyte_api_automap": { "browserHtml": True, "responseCookies": True, }, }, ) def parse_browser(self, response): yield response.follow( "https://books.toscrape.com/", callback=self.parse_http, meta={ "zyte_api_automap": { "requestCookies": response.raw_api_response["responseCookies"], }, }, ) def parse_http(self, response): print(response.text) ``` ### Sessions In web scraping, a session is a set of request conditions (IP address, cookie jar, network stack, etc.) that, when shared by two or more requests, make those requests *seem* part of an organic web browsing session. For some websites, reusing cookies can be enough to maintain a session. But on other websites, sessions get invalidated when their requests do not share the same IP address, network stack, etc. Zyte API supports 2 different ways to define request sessions: - Client-managed sessions give you full control over session management. - Server-managed sessions let Zyte API handle session management for you. > ###### NOTE > > Sessions do *not* offer browser persistence. When two browser > requests use the same session, they are not actually using > the same browser tab, window, process or machine. > ###### TIP > > scrapy-zyte-api also implements an > alternative session management API, > similar to that of server-managed sessions, but built on top of client-managed > sessions. Zyte API sessions can be specially useful for: - Crawling stateful parts of websites, like multi-page forms, pagination or scrolling, where the time limit of actions can be a problem. > ###### NOTE > > Sessions do not maintain browser state, they only make it *seem* > so to target websites. In other words, when you send a 2nd request with > the same session, your request does not use the same browser instance > as the 1st request. > > Maintaining browser state between requests is a *planned* feature. - Optimizing scenarios where you need to set initial, session conditions (language, country, currency, address, etc.) shared by many follow-up requests. For example: - If you have multiple browser requests that all share a set of initial actions for basic session setup, such as using the `setLocation` action or similar, sessions can get you faster responses and give you extra run time for other actions. - If you have multiple HTTP requests that need cookies from an earlier browser request, and you need those follow-up requests to be sent with the same session as the browser request, sessions can give you that. #### Client-managed sessions To create a client-managed session, when sending a request, set [session.id](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session.id) to a [version 4 UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_(random)). When sending follow-up requests with the same session ID, the created session will be reused, i.e. all requests will share the same IP address, network stack, cookie jar, etc. Compared to server-managed sessions, client-managed sessions offer a lower-level API that lets you do more but also requires you to do more. For example: - You control the number of sessions being used. You decide how many sessions you want to use at a given time, you create those sessions, you rotate your pool of sessions among your requests, and you create new sessions as old sessions expire. - You can stop using a specific session, e.g. if you can tell from a response that the target website invalidated the session. See [session](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session) for details. #### Example 1: Same-session requests use the same IP address > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var sessionId = Guid.NewGuid().ToString(); for (int i = 0; i < 2; i++) { var input = new Dictionary(){ {"url", "https://httpbin.org/ip"}, {"httpResponseBody", true}, { "session", new Dictionary() { {"id", sessionId} } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); var responseData = JsonDocument.Parse(httpResponseBody); var ipAddress = responseData.RootElement.GetProperty("origin").ToString(); Console.WriteLine(ipAddress); } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/ip", "httpResponseBody": true, "session": {"id": "e07843b4-fd72-4a02-82b4-3376c6ceba92"}} {"url": "https://httpbin.org/ip", "httpResponseBody": true, "session": {"id": "e07843b4-fd72-4a02-82b4-3376c6ceba92"}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .origin ``` #### curl input.json ```json { "url": "https://httpbin.org/ip", "httpResponseBody": true, "session": { "id": "e07843b4-fd72-4a02-82b4-3376c6ceba92" } } ``` ```shell for i in {1..2} do curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .origin done ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import java.util.UUID; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { String sessionId = UUID.randomUUID().toString(); CloseableHttpClient client = HttpClients.createDefault(); for (int i = 0; i < 2; i++) { Map session = ImmutableMap.of("id", sessionId); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/ip", "httpResponseBody", true, "session", session); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String body = data.get("origin").getAsString(); System.out.println(body); return null; }); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const crypto = require('crypto') const sessionId = String(crypto.randomUUID()) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/ip', httpResponseBody: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).origin console.log(body) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/ip', httpResponseBody: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).origin console.log(body) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'session' => ['id' => $session_id], ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $body = json_decode($http_response_body)->origin; echo $body.PHP_EOL; } ``` #### Proxy mode With the proxy mode, use the `Zyte-Session-ID` header. ```shell for i in {1..2} do curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --header 'Zyte-Session-ID: e07843b4-fd72-4a02-82b4-3376c6ceba92' \ --compressed \ https://httpbin.org/ip \ | jq --raw-output .origin done ``` #### Python ```python import json from base64 import b64decode from uuid import uuid4 import requests session_id = str(uuid4()) for _ in range(2): api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/ip", "httpResponseBody": True, "session": {"id": session_id}, }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) body: str = json.loads(http_response_body)["origin"] print(body) ``` #### Python client ```python import asyncio import json from base64 import b64decode from uuid import uuid4 from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() session_id = str(uuid4()) for i in range(2): api_response = await client.get( { "url": "https://httpbin.org/ip", "httpResponseBody": True, "session": {"id": session_id}, }, ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() data = json.loads(http_response_body) print(data["origin"]) asyncio.run(main()) ``` #### Scrapy > ###### TIP > > scrapy-zyte-api also provides its own session management > API, similar to that of > server-managed sessions, but > built on top of client-managed sessions. ```python import json from uuid import uuid4 from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): session_id = str(uuid4()) yield Request( "https://httpbin.org/ip", cb_kwargs={"session_id": session_id}, meta={"zyte_api_automap": {"session": {"id": session_id}}}, ) def parse(self, response, session_id): print(json.loads(response.body)["origin"]) yield Request( "https://httpbin.org/ip", meta={"zyte_api_automap": {"session": {"id": session_id}}}, dont_filter=True, callback=self.parse2, ) def parse2(self, response): print(json.loads(response.body)["origin"]) ``` Output: ```none 203.0.113.122 203.0.113.122 ``` #### Example 2: Reuse browser cookies in HTTP requests Start a session with a browser request to the home page of a website, and reuse that session for an HTTP request to a different URL of that website. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var sessionId = Guid.NewGuid().ToString(); var browserInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"browserHtml", true}, { "session", new Dictionary() { {"id", sessionId} } } }; var browserInputJson = JsonSerializer.Serialize(browserInput); var browserContent = new StringContent(browserInputJson, Encoding.UTF8, "application/json"); await client.PostAsync("https://api.zyte.com/v1/extract", browserContent); var httpInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"httpResponseBody", true}, { "session", new Dictionary() { {"id", sessionId} } } }; var httpInputJson = JsonSerializer.Serialize(httpInput); var httpContent = new StringContent(httpInputJson, Encoding.UTF8, "application/json"); HttpResponseMessage httpResponse = await client.PostAsync("https://api.zyte.com/v1/extract", httpContent); var httpResponseBody = await httpResponse.Content.ReadAsByteArrayAsync(); var httpData = JsonDocument.Parse(httpResponseBody); var base64HttpResponseBodyField = httpData.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyField = System.Convert.FromBase64String(base64HttpResponseBodyField); var result = System.Text.Encoding.UTF8.GetString(httpResponseBodyField); Console.WriteLine(result); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import java.util.UUID; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { String sessionId = UUID.randomUUID().toString(); Map session = ImmutableMap.of("id", sessionId); Map browserParameters = ImmutableMap.of("url", "https://toscrape.com/", "browserHtml", true, "session", session); String browserRequestBody = new Gson().toJson(browserParameters); HttpPost browserRequest = new HttpPost("https://api.zyte.com/v1/extract"); browserRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); browserRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); browserRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); browserRequest.setEntity(new StringEntity(browserRequestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( browserRequest, browserResponse -> { Map httpParameters = ImmutableMap.of( "url", "https://books.toscrape.com/", "httpResponseBody", true, "session", session); String httpRequestBody = new Gson().toJson(httpParameters); HttpPost httpRequest = new HttpPost("https://api.zyte.com/v1/extract"); httpRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); httpRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); httpRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); httpRequest.setEntity(new StringEntity(httpRequestBody)); client.execute( httpRequest, httpResponse -> { HttpEntity httpEntity = httpResponse.getEntity(); String httpApiResponse = EntityUtils.toString(httpEntity, StandardCharsets.UTF_8); JsonObject httpJsonObject = JsonParser.parseString(httpApiResponse).getAsJsonObject(); String base64HttpResponseBody = httpJsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const crypto = require('crypto') const sessionId = String(crypto.randomUUID()) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com/', browserHtml: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((browserResponse) => { axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://books.toscrape.com/', httpResponseBody: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((httpResponse) => { const httpResponseBody = Buffer.from( httpResponse.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com/', 'browserHtml' => true, 'session' => ['id' => $session_id], ], ]); $http_response = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://books.toscrape.com/', 'httpResponseBody' => true, 'session' => ['id' => $session_id], ], ]); $http_data = json_decode($http_response->getBody()); $http_response_body = base64_decode($http_data->httpResponseBody); echo $http_response_body; ``` #### Python ```python from base64 import b64decode from uuid import uuid4 import requests session_id = str(uuid4()) browser_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com/", "browserHtml": True, "session": {"id": session_id}, }, ) http_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://books.toscrape.com/", "httpResponseBody": True, "session": {"id": session_id}, }, ) http_response_body = b64decode(http_response.json()["httpResponseBody"]) print(http_response_body.decode()) ``` #### Python client ```python import asyncio from base64 import b64decode from uuid import uuid4 from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() session_id = str(uuid4()) browser_response = await client.get( { "url": "https://toscrape.com/", "browserHtml": True, "session": {"id": session_id}, } ) http_response = await client.get( { "url": "https://books.toscrape.com/", "httpResponseBody": True, "session": {"id": session_id}, } ) http_response_body = b64decode(http_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from uuid import uuid4 from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): session_id = str(uuid4()) yield Request( "https://toscrape.com/", callback=self.parse_browser, cb_kwargs={"session_id": session_id}, meta={ "zyte_api_automap": { "browserHtml": True, "session": {"id": session_id}, }, }, ) def parse_browser(self, response, session_id): yield response.follow( "https://books.toscrape.com/", callback=self.parse_http, meta={ "zyte_api_automap": { "session": {"id": session_id}, }, }, ) def parse_http(self, response): print(response.text) ``` #### Server-managed sessions > ###### WARNING > > Pricing-wise, requests that do not > reuse a previous session and use > [sessionContextParameters.actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters.actions) count as browser requests, > including action costs. > ###### NOTE > > The proxy mode does not support > server-managed sessions. **Session contexts** let you request a server-managed session and define prerequisites for it. To assign a session context to a request: - Set [sessionContext](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContext) to an arbitrary array of name-value pair objects that uniquely identify your session context. - Set in [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters) your session prerequisites. > ###### TIP > > Before using [sessionContextParameters.actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters.actions), > make sure your actions work on the target > website, e.g. send a test browser request > with those actions, and check their outcome in the > [actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/actions) response field. > > `setLocation` is a good example of an action commonly used in > [sessionContextParameters.actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters.actions) that is not available > on every website. If you want to use if for a website for which the > action is not yet available, please [reach out to us](https://support.zyte.com/support/tickets/new). Every request that you send with the same value in [sessionContext](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContext) will use a session that was initialized with [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters). All those requests should also always include the [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters) request field with the same value. Zyte API handles creation, reuse, and deletion of sessions requested through [sessionContext](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContext), meaning: - When you send requests with the same [sessionContext](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContext), they may use the same session, or use separate sessions that were both initialized with [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters). - You cannot invalidate a specific session. You *can* change the value of [sessionContext](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContext), even if [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters) remains the same, so that your requests will not reuse sessions created with the previous value of [sessionContext](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContext), but you would be invalidating all sessions, not a single one. If you need to be able to invalidate specific sessions, e.g. based on response content, consider using client-managed sessions or scrapy-zyte-api’s session management API instead. #### Example 1: Set a cookie on all sessions > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "http://httpbin.org/cookies"}, {"httpResponseBody", true}, { "sessionContext", new List>() { new Dictionary() { {"name", "id"}, {"value", "cookies"} } } }, { "sessionContextParameters", new Dictionary() { { "actions", new List>() { new Dictionary() { {"action", "goto"}, {"url", "http://httpbin.org/cookies/set/foo/bar"}, } } } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); Console.WriteLine(httpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "http://httpbin.org/cookies", "httpResponseBody": true, "sessionContext": [{"name": "id", "value": "cookies"}], "sessionContextParameters": {"actions": [{"action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar"}]}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### curl input.json ```json { "url": "http://httpbin.org/cookies", "httpResponseBody": true, "sessionContext": [ { "name": "id", "value": "cookies" } ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar" } ] } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### Java ```java import com.google.common.collect.ImmutableList; import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "http://httpbin.org/cookies", "httpResponseBody", true, "sessionContext", ImmutableList.of(ImmutableMap.of("name", "id", "value", "cookies")), "sessionContextParameters", ImmutableMap.of( "actions", ImmutableList.of( ImmutableMap.of( "action", "goto", "url", "http://httpbin.org/cookies/set/foo/bar")))); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'http://httpbin.org/cookies', httpResponseBody: true, sessionContext: [ { name: 'id', value: 'cookies' } ], sessionContextParameters: { actions: [ { action: 'goto', url: 'http://httpbin.org/cookies/set/foo/bar' } ] } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'http://httpbin.org/cookies', 'httpResponseBody' => true, 'sessionContext' => [ [ 'name' => 'id', 'value' => 'cookies', ], ], 'sessionContextParameters' => [ 'actions' => [ [ 'action' => 'goto', 'url' => 'http://httpbin.org/cookies/set/foo/bar', ], ], ], ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); echo $http_response_body.PHP_EOL; ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "http://httpbin.org/cookies", "httpResponseBody": True, "sessionContext": [ { "name": "id", "value": "cookies", }, ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar", }, ], }, }, ) http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() print(http_response_body) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "http://httpbin.org/cookies", "httpResponseBody": True, "sessionContext": [ { "name": "id", "value": "cookies", }, ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar", }, ], }, }, ) http_response_body_bytes = b64decode(api_response["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy > ###### TIP > > scrapy-zyte-api also provides its own session management > API, similar to that of > server-managed sessions, but > built on top of client-managed sessions. ```python from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "http://httpbin.org/cookies", meta={ "zyte_api_automap": { "sessionContext": [ { "name": "id", "value": "cookies", }, ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar", }, ], }, }, }, ) def parse(self, response): print(response.text) ``` Output: ```json { "cookies": { "foo": "bar" } } ``` #### Example 2: Start sessions on a browser, use them in HTTP requests Set a no-op action in [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters) to force sessions to start with a browser request, but use HTTP requests. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com/"}, {"httpResponseBody", true}, { "sessionContext", new List>() { new Dictionary() { {"name", "id"}, {"value", "browser"} } } }, { "sessionContextParameters", new Dictionary() { { "actions", new List>() { new Dictionary() { {"action", "waitForTimeout"}, {"timeout", 0}, } } } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); Console.WriteLine(httpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com/", "httpResponseBody": true, "sessionContext": [{"name": "id", "value": "browser"}], "sessionContextParameters": {"actions": [{"action": "waitForTimeout", "timeout": 0}]}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### curl input.json ```json { "url": "https://toscrape.com/", "httpResponseBody": true, "sessionContext": [ { "name": "id", "value": "browser" } ], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0 } ] } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### Java ```java import com.google.common.collect.ImmutableList; import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://toscrape.com/", "httpResponseBody", true, "sessionContext", ImmutableList.of(ImmutableMap.of("name", "id", "value", "browser")), "sessionContextParameters", ImmutableMap.of( "actions", ImmutableList.of(ImmutableMap.of("action", "waitForTimeout", "timeout", 0)))); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com/', httpResponseBody: true, sessionContext: [ { name: 'id', value: 'browser' } ], sessionContextParameters: { actions: [ { action: 'waitForTimeout', timeout: 0 } ] } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com/', 'httpResponseBody' => true, 'sessionContext' => [ [ 'name' => 'id', 'value' => 'browser', ], ], 'sessionContextParameters' => [ 'actions' => [ [ 'action' => 'waitForTimeout', 'timeout' => 0, ], ], ], ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); echo $http_response_body.PHP_EOL; ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com/", "httpResponseBody": True, "sessionContext": [{"name": "id", "value": "browser"}], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0, }, ], }, }, ) http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() print(http_response_body) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() http_response = await client.get( { "url": "https://toscrape.com/", "httpResponseBody": True, "sessionContext": [{"name": "id", "value": "browser"}], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0, }, ], }, } ) http_response_body = b64decode(http_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://toscrape.com/", meta={ "zyte_api_automap": { "sessionContext": [ { "name": "id", "value": "browser", }, ], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0, }, ], }, }, }, ) def parse(self, response): print(response.text) ``` #### Session IP addresses Requests using the same session will normally share the same IP address. This may not be the case, though, in the following scenarios: - If Zyte API is using a device residential IP address for a session, and that IP address expires, new requests using the same session will get a different IP address. The new IP address will be in the same country as the original IP address. - When using client-managed sessions, if you send 2 or more requests in parallel with the same session ID, and the session does not exist already, each request may get a different IP address. You should create sessions with a single request and, once you get a response, you can send as many parallel requests as you want with that session. While requests in the same session are almost guaranteed to use the same IP address, requests from different sessions are not guaranteed to have different IP addresses, although they often will. #### Session cookie jars Requests using the same session share the same cookie jar. Cookies from the target websites received by session requests will be stored in the session cookie jar, and affect follow-up session requests. > ###### NOTE > > While you can use [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies) and > [responseCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/responseCookies) on requests using a session, those > parameters only affect the specific request where they are set, they do not > affect the session cookie jar. You cannot manually set cookies on the > session cookie jar or read the contents of the session cookie jar. ### Response headers Set the [httpResponseHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseHeaders) request field to `true` to get HTTP response headers in the [httpResponseHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseHeaders) response field. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"httpResponseHeaders", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var headerEnumerator = data.RootElement.GetProperty("httpResponseHeaders").EnumerateArray(); var headers = new Dictionary(); while (headerEnumerator.MoveNext()) { headers.Add( headerEnumerator.Current.GetProperty("name").ToString(), headerEnumerator.Current.GetProperty("value").ToString() ); } ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "httpResponseHeaders": true} ``` ```shell zyte-api input.jsonl \ | jq .httpResponseHeaders ``` #### curl input.json ```json { "url": "https://toscrape.com", "httpResponseHeaders": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq .httpResponseHeaders ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonArray; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://toscrape.com", "browserHtml", true, "httpResponseHeaders", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonArray httpResponseHeaders = jsonObject.get("httpResponseHeaders").getAsJsonArray(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(httpResponseHeaders)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', httpResponseHeaders: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseHeaders = response.data.httpResponseHeaders }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'httpResponseHeaders' => true, ], ]); $api = json_decode($response->getBody()); $http_response_headers = $api->httpResponseHeaders; ``` #### Proxy mode With the proxy mode, response headers are always included in the HTTP response, no need to ask for them explicitly. #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "httpResponseHeaders": True, }, ) http_response_headers = api_response.json()["httpResponseHeaders"] ``` #### Python client ```python import asyncio import json from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "httpResponseHeaders": True, } ) http_response_headers = api_response["httpResponseHeaders"] print(json.dumps(http_response_headers, indent=2)) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "httpResponseBody": False, "httpResponseHeaders": True, }, }, ) def parse(self, response): headers = response.headers ``` > ###### NOTE > > In transparent mode, > [httpResponseHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseHeaders) is sent by default for > httpResponseBody requests, but sending it > explicitly is still recommended, as future versions of > scrapy-zyte-api may stop sending it > by default. Output (first 5 lines): ```json [ { "name": "date", "value": "Fri, 25 Aug 2023 07:08:05 GMT" }, ``` > ###### NOTE > > Reading cookies from `Set-Cookie` response headers is not > recommended, because it only contains the cookies set by the final > response, it does not account for cookies set during redirection or during browser rendering. Better use [responseCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/responseCookies) as > described in zapi-cookies. ### Metadata Set the [echoData](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/echoData) request field to an arbitrary value, to get that value verbatim in the [echoData](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/echoData) response field. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; var inputData = new List>() { new List(){"https://toscrape.com", 1}, new List(){"https://books.toscrape.com", 2}, new List(){"https://quotes.toscrape.com", 3}, }; var output = new List(); var handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All, MaxConnectionsPerServer = 15 }; var client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var responseTasks = new List>(); foreach (var entry in inputData) { var input = new Dictionary(){ {"url", entry[0]}, {"browserHtml", true}, {"echoData", entry[1]} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); var responseTask = client.PostAsync("https://api.zyte.com/v1/extract", content); responseTasks.Add(responseTask); } while (responseTasks.Any()) { var responseTask = await Task.WhenAny(responseTasks); responseTasks.Remove(responseTask); var response = await responseTask; output.Add(response); } ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true, "echoData": 1} {"url": "https://books.toscrape.com", "browserHtml": true, "echoData": 2} {"url": "https://quotes.toscrape.com", "browserHtml": true, "echoData": 3} ``` ```shell zyte-api --n-conn 15 input.jsonl -o output.jsonl ``` #### curl input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true, "echoData": 1} {"url": "https://books.toscrape.com", "browserHtml": true, "echoData": 2} {"url": "https://quotes.toscrape.com", "browserHtml": true, "echoData": 3} ``` ```shell cat input.jsonl \ | xargs -P 15 -d\\n -n 1 \ bash -c " curl \ --user $ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data \"\$0\" \ --compressed \ https://api.zyte.com/v1/extract \ | jq .echoData \ | awk '{print \$1}' \ >> output.jsonl " ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import java.io.IOException; import java.util.ArrayList; import java.util.Base64; import java.util.List; import java.util.Map; import java.util.concurrent.ExecutionException; import java.util.concurrent.Future; import org.apache.hc.client5.http.async.methods.SimpleHttpRequest; import org.apache.hc.client5.http.async.methods.SimpleHttpResponse; import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient; import org.apache.hc.client5.http.impl.async.HttpAsyncClients; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManagerBuilder; import org.apache.hc.client5.http.ssl.ClientTlsStrategyBuilder; import org.apache.hc.core5.concurrent.FutureCallback; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.nio.ssl.TlsStrategy; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws ExecutionException, InterruptedException, IOException, ParseException { Object[][] input = { {"https://toscrape.com", 1}, {"https://bookstoscrape.com", 2}, {"https://quotes.toscrape.com", 3} }; List futures = new ArrayList(); List output = new ArrayList(); int concurrency = 15; // https://issues.apache.org/jira/browse/HTTPCLIENT-2219 final TlsStrategy tlsStrategy = ClientTlsStrategyBuilder.create().useSystemProperties().build(); PoolingAsyncClientConnectionManager connectionManager = PoolingAsyncClientConnectionManagerBuilder.create().setTlsStrategy(tlsStrategy).build(); connectionManager.setMaxTotal(concurrency); connectionManager.setDefaultMaxPerRoute(concurrency); CloseableHttpAsyncClient client = HttpAsyncClients.custom().setConnectionManager(connectionManager).build(); try { client.start(); for (int i = 0; i < input.length; i++) { Map parameters = ImmutableMap.of("url", input[i][0], "browserHtml", true, "echoData", input[i][1]); String requestBody = new Gson().toJson(parameters); SimpleHttpRequest request = new SimpleHttpRequest("POST", "https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setBody(requestBody, ContentType.APPLICATION_JSON); final Future future = client.execute( request, new FutureCallback() { public void completed(final SimpleHttpResponse response) { String apiResponse = response.getBodyText(); output.add(apiResponse); } public void failed(final Exception ex) {} public void cancelled() {} }); futures.add(future); } for (int i = 0; i < futures.size(); i++) { futures.get(i).get(); } } finally { client.close(); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const { ConcurrencyManager } = require('axios-concurrency') const axios = require('axios') const urls = [ ['https://toscrape.com', 1], ['https://books.toscrape.com', 2], ['https://quotes.toscrape.com', 3] ] const output = [] const client = axios.create() ConcurrencyManager(client, 15) Promise.all( urls.map((input) => client.post( 'https://api.zyte.com/v1/extract', { url: input[0], browserHtml: true, echoData: input[1] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => output.push(response.data)) ) ) ``` #### PHP ```php ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => $url_and_index[0], 'browserHtml' => true, 'echoData' => $url_and_index[1], ], ]; $request = new \GuzzleHttp\Psr7\Request('POST', 'https://api.zyte.com/v1/extract'); global $promises; $promises[] = $client->sendAsync($request, $options)->then(function ($response) { global $output; $output[] = json_decode($response->getBody()); }); } foreach ($promises as $promise) { $promise->wait(); } ``` #### Proxy mode With the proxy mode you cannot set request metadata. #### Python ```python import asyncio import aiohttp input_data = [ ("https://toscrape.com", 1), ("https://books.toscrape.com", 2), ("https://quotes.toscrape.com", 3), ] output = [] async def extract(client, url, index): response = await client.post( "https://api.zyte.com/v1/extract", json={"url": url, "browserHtml": True, "echoData": index}, auth=aiohttp.BasicAuth("YOUR_ZYTE_API_KEY"), ) output.append(await response.json()) async def main(): connector = aiohttp.TCPConnector(limit_per_host=15) async with aiohttp.ClientSession(connector=connector) as client: await asyncio.gather( *[extract(client, url, index) for url, index in input_data] ) asyncio.run(main()) ``` #### Python client ```python import asyncio import json from zyte_api import AsyncZyteAPI input_data = [ ("https://toscrape.com", 1), ("https://books.toscrape.com", 2), ("https://quotes.toscrape.com", 3), ] async def main(): client = AsyncZyteAPI(n_conn=15) queries = [ {"url": url, "browserHtml": True, "echoData": index} for url, index in input_data ] async with client.session() as session: for future in session.iter(queries): response = await future print(json.dumps(response)) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider input_data = [ ("https://toscrape.com", 1), ("https://books.toscrape.com", 2), ("https://quotes.toscrape.com", 3), ] class ToScrapeSpider(Spider): name = "toscrape_com" custom_settings = { "CONCURRENT_REQUESTS": 15, "CONCURRENT_REQUESTS_PER_DOMAIN": 15, } async def start(self): for url, index in input_data: yield Request( url, meta={ "zyte_api_automap": { "browserHtml": True, "echoData": index, }, }, ) def parse(self, response): yield { "index": response.raw_api_response["echoData"], "html": response.text, } ``` Alternatively, you can use Scrapy’s `Request.cb_kwargs` directly for a similar purpose: ```python async def start(self): for url, index in input_data: yield Request( url, cb_kwargs={"index": index}, meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response, index): yield { "index": index, "html": response.text, } ``` Output: ```json {"url": "https://quotes.toscrape.com/", "statusCode": 200, "browserHtml": "\n\t\n\tQuotes to Scrape\n \n \n\n\n
\n
\n
\n

\n Quotes to Scrape\n

\n
\n
\n

\n \n Login\n \n

\n
\n
\n \n\n
\n
\n\n
\n “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n change\n \n deep-thoughts\n \n thinking\n \n world\n \n
\n
\n\n
\n “It is our choices, Harry, that show what we truly are, far more than our abilities.”\n by J.K. Rowling\n (about)\n \n
\n Tags:\n \n \n abilities\n \n choices\n \n
\n
\n\n
\n “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n inspirational\n \n life\n \n live\n \n miracle\n \n miracles\n \n
\n
\n\n
\n “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\n by Jane Austen\n (about)\n \n
\n Tags:\n \n \n aliteracy\n \n books\n \n classic\n \n humor\n \n
\n
\n\n
\n “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”\n by Marilyn Monroe\n (about)\n \n
\n Tags:\n \n \n be-yourself\n \n inspirational\n \n
\n
\n\n
\n “Try not to become a man of success. Rather become a man of value.”\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n adulthood\n \n success\n \n value\n \n
\n
\n\n
\n “It is better to be hated for what you are than to be loved for what you are not.”\n by André Gide\n (about)\n \n
\n Tags:\n \n \n life\n \n love\n \n
\n
\n\n
\n “I have not failed. I've just found 10,000 ways that won't work.”\n by Thomas A. Edison\n (about)\n \n
\n Tags:\n \n \n edison\n \n failure\n \n inspirational\n \n paraphrased\n \n
\n
\n\n
\n “A woman is like a tea bag; you never know how strong it is until it's in hot water.”\n by Eleanor Roosevelt\n (about)\n \n
\n Tags:\n \n \n misattributed-eleanor-roosevelt\n \n
\n
\n\n
\n “A day without sunshine is like, you know, night.”\n by Steve Martin\n (about)\n \n
\n Tags:\n \n \n humor\n \n obvious\n \n simile\n \n
\n
\n\n \n
\n
\n \n

Top Ten tags

\n \n \n love\n \n \n \n inspirational\n \n \n \n life\n \n \n \n humor\n \n \n \n books\n \n \n \n reading\n \n \n \n friendship\n \n \n \n friends\n \n \n \n truth\n \n \n \n simile\n \n \n \n
\n
\n\n
\n \n\n", "echoData": 3} {"url": "https://books.toscrape.com/", "statusCode": 200, "browserHtml": "\n \n All products | Books to Scrape - Sandbox\n\n\n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n\n\n \n \n\n \n\n \n \n \n\n \n \n\n \n \n \n \n \n
\n
\n
\n
Books to Scrape We love being scraped!\n
\n\n \n
\n
\n
\n\n \n \n
\n
\n \n
    \n
  • \n Home\n
  • \n
  • All products
  • \n
\n\n
\n\n \n\n
\n \n
\n

All products

\n
\n \n\n \n\n\n\n
\n\n
\n\n\n
\n \n
\n\n \n
\n \n
\n \n \n
\n\n \n \n \n 1000 results - showing 1 to 20.\n \n \n \n \n
\n \n
\n
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
\n\n
\n
    \n \n
  1. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"A\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    A Light in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.77

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  2. \n \n
  3. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Tipping\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Tipping the Velvet

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £53.74

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  4. \n \n
  5. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Soumission\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Soumission

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £50.10

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  6. \n \n
  7. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sharp\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sharp Objects

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £47.82

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  8. \n \n
  9. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sapiens:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sapiens: A Brief History ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £54.23

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  10. \n \n
  11. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Requiem Red

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.65

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  12. \n \n
  13. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Dirty Little Secrets ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.34

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  14. \n \n
  15. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Coming Woman: A ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.93

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  16. \n \n
  17. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Boys in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.60

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  18. \n \n
  19. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Black Maria

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.15

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  20. \n \n
  21. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Starving\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Starving Hearts (Triangular Trade ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £13.99

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  22. \n \n
  23. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Shakespeare's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Shakespeare's Sonnets

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £20.66

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  24. \n \n
  25. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Set\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Set Me Free

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.46

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  26. \n \n
  27. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Scott\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Scott Pilgrim's Precious Little ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.29

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  28. \n \n
  29. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Rip\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Rip it Up and ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £35.02

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  30. \n \n
  31. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Our\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Our Band Could Be ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £57.25

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  32. \n \n
  33. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Olio\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Olio

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £23.88

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  34. \n \n
  35. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Mesaerion:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Mesaerion: The Best Science ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £37.59

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  36. \n \n
  37. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Libertarianism\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Libertarianism for Beginners

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.33

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  38. \n \n
  39. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"It's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    It's Only the Himalayas

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £45.17

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  40. \n \n
\n \n\n\n\n
\n
    \n \n
  • \n \n Page 1 of 50\n \n
  • \n \n
  • next
  • \n \n
\n
\n\n\n
\n
\n \n\n\n
\n\n
\n
\n
\n\n\n \n
\n \n \n \n
\n\n\n \n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n\n \n \n\n\n \n \n \n\n \n\n\n \n \n\n \n \n \n \n\n", "echoData": 2} {"url": "https://toscrape.com/", "statusCode": 200, "browserHtml": "\n \n Scraping Sandbox\n \n \n \n \n
\n
\n
\n
\n \n

Web Scraping Sandbox

\n
\n
\n\n
\n
\n
\n

Books

\n

A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com

\n
\n \n
\n
\n \n \n \n \n \n \n
Details
Amount of items 1000
Pagination
Items per page max 20
Requires JavaScript
\n
\n
\n
\n\n
\n
\n
\n

Quotes

\n

A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.

\n
\n \n
\n
\n \n \n \n \n \n \n \n \n \n \n
Endpoints
DefaultMicrodata and pagination
Scroll infinite scrolling pagination
JavaScript JavaScript generated content
Delayed Same as JavaScript but with a delay (?delay=10000)
Tableful a table based messed-up layout
Login login with CSRF token (any user/passwd works)
ViewState an AJAX based filter form with ViewStates
Random a single random quote
\n
\n
\n
\n
\n \n\n", "echoData": 1} ``` ### Permissions control By default, Zyte API may use different techniques to avoid bans. You may [order changes](https://support.zyte.com/support/tickets/new) on which features Zyte API can use for your account: - **CAPTCHA management** (default: enabled) - **Device residential IPs** (default: enabled) Disabling this limits your requests to data center IPs, ensuring transparency to website owners. However, it also disables the use of extended geolocations. If you only want to disable device residential IPs for *some* requests, set [ipType](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType) to `datacenter` on those requests instead. > ###### NOTE > > Disabling either option, or both, may increase the rate of ban > responses for some websites. ## Zyte API proxy mode To use Zyte API as a proxy, use the `api.zyte.com:8011` endpoint, with your [Zyte API key](https://app.zyte.com/o/zyte-api/api-access) and proxy headers: #### C# ```cs using System; using System.Net; using System.Net.Http; var proxy = new WebProxy("http://api.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var httpClientHandler = new HttpClientHandler { Proxy = proxy, }; var client = new HttpClient(handler: httpClientHandler, disposeHandler: true); var message = new HttpRequestMessage(HttpMethod.Get, "https://toscrape.com"); var response = client.Send(message); var body = await response.Content.ReadAsStringAsync(); Console.WriteLine(body); ``` #### curl ```bash curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### Java ```java import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.hc.client5.http.auth.AuthCache; import org.apache.hc.client5.http.auth.AuthScope; import org.apache.hc.client5.http.auth.CredentialsProvider; import org.apache.hc.client5.http.classic.methods.HttpGet; import org.apache.hc.client5.http.impl.auth.BasicAuthCache; import org.apache.hc.client5.http.impl.auth.BasicScheme; import org.apache.hc.client5.http.impl.auth.CredentialsProviderBuilder; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.routing.DefaultProxyRoutePlanner; import org.apache.hc.client5.http.protocol.HttpClientContext; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHost; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; class Example { public static void main(final String[] args) throws InterruptedException, IOException, ParseException { HttpHost proxy = new HttpHost("api.zyte.com", 8011); DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); CredentialsProvider credentialsProvider = CredentialsProviderBuilder.create() .add(new AuthScope(proxy), "YOUR_ZYTE_API_KEY", "".toCharArray()) .build(); AuthCache authCache = new BasicAuthCache(); BasicScheme basicAuth = new BasicScheme(); authCache.put(proxy, basicAuth); HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credentialsProvider); context.setAuthCache(authCache); CloseableHttpClient client = HttpClients.custom() .setRoutePlanner(routePlanner) .setDefaultCredentialsProvider(credentialsProvider) .build(); HttpGet request = new HttpGet("https://toscrape.com"); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String httpResponseBody = EntityUtils.toString(entity, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } } ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { proxy: { protocol: 'http', host: 'api.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'proxy' => 'http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python > ###### NOTE > > You need to install and configure our CA certificate for > the requests library. ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Ruby ```ruby # frozen_string_literal: true require 'net/http' url = URI('https://toscrape.com/') proxy_host = 'api.zyte.com' proxy_port = '8011' http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port, 'YOUR_ZYTE_API_KEY', '') http.use_ssl = true r = http.start do |h| h.request(Net::HTTP::Get.new(url)) end puts r.body ``` #### Scrapy When using [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy), set the `ZYTE_SMARTPROXY_URL` setting to `"http://api.zyte.com:8011"` and the `ZYTE_SMARTPROXY_APIKEY` setting to [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access) for Zyte API. > ###### NOTE > > **Important**: Use your **Zyte API key** here, not a Scrapy Cloud API key. Make sure you get this from the Zyte API access page. Then you can continue using Scrapy as usual and all requests will be proxied through Zyte API automatically. ```python from scrapy import Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): print(response.text) ``` ### Key differences The proxy mode makes it easier to migrate existing code that uses a proxy service. However, the proxy mode and the HTTP API have some key differences: | Feature | HTTP API | Proxy mode | |-------------------------|--------------|--------------------| | Parameter definition | Request body | Request headers | | Browser HTML | Yes | Yes (new!) | | Screenshots | Yes | No | | Browser actions | Yes | No | | Network capture | Yes | No | | Disable JS on browser | Yes | No | | Automatic extraction | Yes | No | | Server-managed sessions | Yes | No | | Echo data | Yes | No | | Overhead | Some | Minimum | | Cookie definition | Multi-domain | Target domain only | #### Overhead When using HTTP requests, the HTTP API introduces some overhead in responses due mainly to the base64-encoding of [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody), increasing network traffic and latency, and requiring base64-decoding on your end. In contrast, with proxy mode the only overhead you get is some additional response headers. > ###### SEE ALSO > > zapi-optimize #### Cookie definition With proxy mode, you can only set cookies for the domain of the target URL, you cannot manually set cookies for additional domains that may be reached through redirection. ### Request headers The following headers allow changing how a request is sent through Zyte API in proxy mode. #### Zyte-Browser-Html Sets browserHtml. This is not compatible with zyte-disable-follow-redirect. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### curl ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Browser-Html: true" \ https://toscrape.com ``` #### C# ```cs using System; using System.Net; using System.Net.Http; var proxy = new WebProxy("http://api.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var httpClientHandler = new HttpClientHandler { Proxy = proxy, }; var client = new HttpClient(handler: httpClientHandler, disposeHandler: true); client.DefaultRequestHeaders.Add("Zyte-Browser-Html", "true"); var message = new HttpRequestMessage(HttpMethod.Get, "https://toscrape.com"); var response = client.Send(message); var body = await response.Content.ReadAsStringAsync(); Console.WriteLine(body); ``` #### Java ```java import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.hc.client5.http.auth.AuthCache; import org.apache.hc.client5.http.auth.AuthScope; import org.apache.hc.client5.http.auth.CredentialsProvider; import org.apache.hc.client5.http.classic.methods.HttpGet; import org.apache.hc.client5.http.impl.auth.BasicAuthCache; import org.apache.hc.client5.http.impl.auth.BasicScheme; import org.apache.hc.client5.http.impl.auth.CredentialsProviderBuilder; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.routing.DefaultProxyRoutePlanner; import org.apache.hc.client5.http.protocol.HttpClientContext; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHost; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; class Example { public static void main(final String[] args) throws InterruptedException, IOException, ParseException { HttpHost proxy = new HttpHost("api.zyte.com", 8011); DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); CredentialsProvider credentialsProvider = CredentialsProviderBuilder.create() .add(new AuthScope(proxy), "YOUR_ZYTE_API_KEY", "".toCharArray()) .build(); AuthCache authCache = new BasicAuthCache(); BasicScheme basicAuth = new BasicScheme(); authCache.put(proxy, basicAuth); HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credentialsProvider); context.setAuthCache(authCache); CloseableHttpClient client = HttpClients.custom() .setRoutePlanner(routePlanner) .setDefaultCredentialsProvider(credentialsProvider) .build(); HttpGet request = new HttpGet("https://toscrape.com"); request.setHeader("Zyte-Browser-Html", "true"); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String httpResponseBody = EntityUtils.toString(entity, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } } ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { headers: { 'Zyte-Browser-Html': 'true' }, proxy: { protocol: 'http', host: 'api.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'headers' => [ 'Zyte-Browser-Html' => 'true', ], 'proxy' => 'http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python ```python import requests response = requests.get( "https://toscrape.com", headers={ "Zyte-Browser-Html": "true", }, proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Ruby ```ruby # frozen_string_literal: true require 'net/http' url = URI('https://toscrape.com/') proxy_host = 'api.zyte.com' proxy_port = '8011' http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port, 'YOUR_ZYTE_API_KEY', '') http.use_ssl = true request = Net::HTTP::Get.new(url) request['Zyte-Browser-Html'] = 'true' r = http.start do |h| h.request(request) end puts r.body ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request("https://toscrape.com", headers={"Zyte-Browser-Html": "true"}) def parse(self, response): print(response.text) ``` Output (first 5 lines): ```html Scraping Sandbox ``` #### Zyte-Client May be used to report to Zyte the software being used to access Zyte API. It should be formatted with the syntax of the [User-Agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) header, e.g. `curl/1.2.3`. #### Zyte-Cookie-Management Sets cookieManagement. #### Zyte-Device Sets device emulation. #### Zyte-Disable-Follow-Redirect When set to `true`, disables redirect following, which is enabled by default. #### Zyte-Geolocation Sets a geolocation. #### Zyte-IPType Sets ipType. #### Zyte-JobId Sets the ID of the Scrapy Cloud job that is sending the request. [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy) sets this header automatically when used from a Scrapy Cloud job. #### Zyte-Override-Headers Zyte API automatically sends some request headers for ban avoidance. Custom headers from your request will override most automatic headers, but not these: `Accept` `Accept-Encoding` `User-Agent` To override any of these 3 headers, set `Zyte-Override-Headers` to a comma-separated list of names of headers to override, e.g. `Zyte-Override-Headers: Accept,Accept-Encoding`. > ###### WARNING > > Overriding headers can break Zyte API ban avoidance. #### Zyte-Session-ID Sets [session.id](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session.id) for a client-managed session. #### Zyte-Tags Sets the [tags](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/tags) dictionary in the request. The header is a JSON object, such as `Zyte-Tags: {"foo": "bar", "baz": null, "435":"true"}`. Value MUST be valid ASCII under 512 bytes long. #### Invalid request headers The following headers are not allowed, and any request with one or more of them will result in an HTTP 400 response: `Client-IP` `Cluster-Client-IP` `Forwarded-For` `True-Client-IP` `Via` `X-Client-IP` `X-Forwarded` `X-Forwarded-For` `X-Forwarded-Host` `X-Host` `X-Original-URL` `X-Originating-IP` `X-ProxyUser-IO` `X-ProxyUser-IP` `X-Remote-Addr` `X-Remote-IP` ### Response headers Responses include some headers injected by Zyte API. Note that the response body of unsuccessful responses is always the actual JSON response from the HTTP API that provides error details. #### Zyte-Error-Title A short summary of the problem type. Written in English and readable for engineers, usually not suited for non-technical stakeholders, and not localized. It matches the `title` JSON field of the error response. #### Zyte-Error-Type A URI reference that uniquely identifies the problem type, only in the context of the provided API. Opposed to the specification in RFC-7807, it is neither recommended to be dereferencable and point to human-readable documentation nor globally unique for the problem type. It matches the `type` JSON field of the error response. #### Zyte-Request-ID A unique identifier of the request. When reporting an issue about the outcome of a request to us, please include the value of this response header when possible. ### HTTPS proxy > ###### TIP > > The main endpoint works both for HTTP and > HTTPS URLs, you do not need an HTTPS proxy interface to access HTTPS URLs. You can use the `api.zyte.com:8014` endpoint for an HTTPS proxy interface, provided your tech stack supports HTTPS proxies and you have installed our CA certificate: #### curl ```bash curl \ --proxy https://api.zyte.com:8014 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### JS ```js const HttpsProxyAgent = require('https-proxy-agent') const httpsAgent = new HttpsProxyAgent.HttpsProxyAgent('https://YOUR_ZYTE_API_KEY:@api.zyte.com:8014') const axiosDefaultConfig = { httpsAgent } const axios = require('axios').create(axiosDefaultConfig) axios .get('https://toscrape.com') .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### Python ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "https://YOUR_ZYTE_API_KEY:@api.zyte.com:8014" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` ### Use with browser automation tools The proxy mode is not optimized for use in combination with browser automation tools. Please, consider using Zyte API’s browser automation features instead. See zapi-browser-automation. ## Zyte API rate limits Zyte API limits the number of requests per minute (RPM). Each API key has a rate limit that can be increased. Additional rate limits may apply. Rate-limited requests receive a rate-limiting response at no cost and can be retried. ### API key rate limit Enterprise plans: Custom Standard plans: 3000 RPM To increase your rate limit, see rate-limit-increase. ### Other rate limits Additional rate limits may apply: - **Website limits**: Each website has its own rate limit to prevent issues on target sites. - **Account-website limits**: Each account has per-website limits to ensure fair access. - **Temporary limits**: Applied during high demand to ensure platform stability. > ###### SEE ALSO > > stats-rate-limiting at stats-api ### Increasing rate limits **Temporary increases** For short-term needs (single job, specific dates), [open a support ticket](https://support.zyte.com/support/tickets/new) at least 24 hours in advance with: - Target [API key](https://app.zyte.com/o/zyte-api/api-access) - Desired RPM - Target websites - Start and end dates **Permanent increases** Standard plans: [Contact sales](https://www.zyte.com/zyte-web-scraping-api/#form) to upgrade to an Enterprise plan Enterprise plan: Contact your account manager ### Concurrency Rate limits are based on requests per minute, not concurrent requests. Your maximum concurrency depends on: - Your RPM limit - Average response time of target sites - Request parameters (e.g. browser rendering, extraction) **Estimation formula:** > max concurrency ≈ RPM limit ÷ 60 × avg response time (seconds) **Examples:** | RPM limit | Avg. response time | Max. concurrency | |-------------|----------------------|--------------------| | 3000 | 0.2 s | 10 | | 3000 | 2 s | 100 | | 3000 | 20 s | 1000 | ## Optimizing Zyte API usage Here are some tips to optimize your use of Zyte API: - Send multiple requests in parallel. - For real-time scenarios, where a single request is sent at a time, there are a few tips you can follow to improve response times. - When targeting multiple websites at once, sort requests to spread the load. ### Sending multiple requests in parallel A Zyte API request can take tens of seconds to process. The response time depends on the target website and features used. For example, if you use a browser request, it is common to get a response in 10-30 seconds. Due to that, if you send requests sequentially, the throughput could be quite low, only a few responses per minute. To increase the throughput, send many requests in parallel, as shown in the example below. The number of parallel requests that is optimum for you depends on your API key rate limit and on your target websites. For example, if your rate limit is 3000 requests per minute, and the average response time you observe for your websites is 2 seconds, then to reach your rate limit you may set the number of parallel requests to 100 (`ceil(3000/60*2)`). If too many requests are being processed in parallel, you will be getting many rate-limiting responses, which you can retry. To maximize efficiency, please use a number of parallel requests that minimizes the number of rate-limiting responses. However, a small percentage of rate-limiting responses is normal and expected if you want to get close to your API key rate limit. For some websites, increasing parallel requests will slow down their responses and/or increase the ratio of unsuccessful responses. Zyte API does its best to prevent these issues, but if you notice this happening to you, please consider decreasing your parallel requests. #### Example > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; var urls = new string[2]; urls[0] = "https://books.toscrape.com/catalogue/page-1.html"; urls[1] = "https://books.toscrape.com/catalogue/page-2.html"; var output = new List(); var handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All, MaxConnectionsPerServer = 15 }; var client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var responseTasks = new List>(); foreach (var url in urls) { var input = new Dictionary(){ {"url", url}, {"browserHtml", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); var responseTask = client.PostAsync("https://api.zyte.com/v1/extract", content); responseTasks.Add(responseTask); } while (responseTasks.Any()) { var responseTask = await Task.WhenAny(responseTasks); responseTasks.Remove(responseTask); var response = await responseTask; output.Add(response); } ``` #### CLI client input.jsonl ```json {"url": "https://books.toscrape.com/catalogue/page-1.html", "browserHtml": true} {"url": "https://books.toscrape.com/catalogue/page-2.html", "browserHtml": true} ``` ```shell zyte-api --n-conn 15 input.jsonl -o output.jsonl ``` #### curl input.jsonl ```json {"url": "https://books.toscrape.com/catalogue/page-1.html", "browserHtml": true} {"url": "https://books.toscrape.com/catalogue/page-2.html", "browserHtml": true} ``` ```shell cat input.jsonl \ | xargs -P 15 -d\\n -n 1 \ bash -c " curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data \"\$0\" \ --compressed \ https://api.zyte.com/v1/extract \ | awk '{print \$1}' \ >> output.jsonl " ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.util.ArrayList; import java.util.Base64; import java.util.List; import java.util.Map; import java.util.concurrent.ExecutionException; import java.util.concurrent.Future; import org.apache.hc.client5.http.async.methods.SimpleHttpRequest; import org.apache.hc.client5.http.async.methods.SimpleHttpResponse; import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient; import org.apache.hc.client5.http.impl.async.HttpAsyncClients; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManagerBuilder; import org.apache.hc.client5.http.ssl.ClientTlsStrategyBuilder; import org.apache.hc.core5.concurrent.FutureCallback; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.nio.ssl.TlsStrategy; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws ExecutionException, InterruptedException, IOException, ParseException { String[] urls = { "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html" }; List futures = new ArrayList(); List output = new ArrayList(); int concurrency = 15; // https://issues.apache.org/jira/browse/HTTPCLIENT-2219 final TlsStrategy tlsStrategy = ClientTlsStrategyBuilder.create().useSystemProperties().build(); PoolingAsyncClientConnectionManager connectionManager = PoolingAsyncClientConnectionManagerBuilder.create().setTlsStrategy(tlsStrategy).build(); connectionManager.setMaxTotal(concurrency); connectionManager.setDefaultMaxPerRoute(concurrency); CloseableHttpAsyncClient client = HttpAsyncClients.custom().setConnectionManager(connectionManager).build(); try { client.start(); for (int i = 0; i < urls.length; i++) { Map parameters = ImmutableMap.of("url", urls[i], "browserHtml", true); String requestBody = new Gson().toJson(parameters); SimpleHttpRequest request = new SimpleHttpRequest("POST", "https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setBody(requestBody, ContentType.APPLICATION_JSON); final Future future = client.execute( request, new FutureCallback() { public void completed(final SimpleHttpResponse response) { String apiResponse = response.getBodyText(); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); output.add(browserHtml); } public void failed(final Exception ex) {} public void cancelled() {} }); futures.add(future); } for (int i = 0; i < futures.size(); i++) { futures.get(i).get(); } } finally { client.close(); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const { ConcurrencyManager } = require('axios-concurrency') const axios = require('axios') const urls = [ 'https://books.toscrape.com/catalogue/page-1.html', 'https://books.toscrape.com/catalogue/page-2.html' ] const output = [] const client = axios.create() ConcurrencyManager(client, 15) Promise.all( urls.map((url) => client.post( 'https://api.zyte.com/v1/extract', { url, browserHtml: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => output.push(response.data)) ) ) ``` #### PHP ```php ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => $url, 'browserHtml' => true, ], ]; $request = new \GuzzleHttp\Psr7\Request('POST', 'https://api.zyte.com/v1/extract'); global $promises; $promises[] = $client->sendAsync($request, $options)->then(function ($response) { global $output; $output[] = json_decode($response->getBody()); }); } foreach ($promises as $promise) { $promise->wait(); } ``` #### Python ```python import asyncio import aiohttp urls = [ "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html", ] output = [] async def extract(client, url): response = await client.post( "https://api.zyte.com/v1/extract", json={"url": url, "browserHtml": True}, auth=aiohttp.BasicAuth("YOUR_ZYTE_API_KEY"), ) output.append(await response.json()) async def main(): connector = aiohttp.TCPConnector(limit_per_host=15) async with aiohttp.ClientSession(connector=connector) as client: await asyncio.gather(*[extract(client, url) for url in urls]) asyncio.run(main()) ``` #### Python client ```python import asyncio from zyte_api import AsyncZyteAPI urls = [ "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html", ] async def main(): client = AsyncZyteAPI(n_conn=15) queries = [{"url": url, "browserHtml": True} for url in urls] async with client.session() as session: for future in session.iter(queries): response = await future print(response) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider urls = [ "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html", ] class ToScrapeSpider(Spider): name = "toscrape_com" custom_settings = { "CONCURRENT_REQUESTS": 15, "CONCURRENT_REQUESTS_PER_DOMAIN": 15, } async def start(self): for url in urls: yield Request( url, meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): yield { "url": response.url, "browserHtml": response.text, } ``` Output: ```json {"url": "https://books.toscrape.com/catalogue/page-1.html", "statusCode": 200, "browserHtml": "\n \n All products | Books to Scrape - Sandbox\n\n\n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n\n\n \n \n\n \n\n \n \n \n\n \n \n\n \n \n \n \n \n
\n
\n
\n
Books to Scrape We love being scraped!\n
\n\n \n
\n
\n
\n\n \n \n
\n
\n \n
    \n
  • \n Home\n
  • \n
  • All products
  • \n
\n\n
\n\n \n\n
\n \n
\n

All products

\n
\n \n\n \n\n\n\n
\n\n
\n\n\n
\n \n
\n\n \n
\n \n
\n \n \n
\n\n \n \n \n 1000 results - showing 1 to 20.\n \n \n \n \n
\n \n
\n
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
\n\n
\n
    \n \n
  1. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"A\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    A Light in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.77

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  2. \n \n
  3. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Tipping\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Tipping the Velvet

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £53.74

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  4. \n \n
  5. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Soumission\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Soumission

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £50.10

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  6. \n \n
  7. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sharp\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sharp Objects

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £47.82

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  8. \n \n
  9. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sapiens:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sapiens: A Brief History ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £54.23

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  10. \n \n
  11. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Requiem Red

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.65

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  12. \n \n
  13. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Dirty Little Secrets ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.34

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  14. \n \n
  15. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Coming Woman: A ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.93

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  16. \n \n
  17. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Boys in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.60

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  18. \n \n
  19. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Black Maria

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.15

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  20. \n \n
  21. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Starving\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Starving Hearts (Triangular Trade ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £13.99

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  22. \n \n
  23. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Shakespeare's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Shakespeare's Sonnets

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £20.66

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  24. \n \n
  25. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Set\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Set Me Free

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.46

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  26. \n \n
  27. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Scott\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Scott Pilgrim's Precious Little ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.29

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  28. \n \n
  29. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Rip\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Rip it Up and ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £35.02

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  30. \n \n
  31. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Our\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Our Band Could Be ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £57.25

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  32. \n \n
  33. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Olio\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Olio

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £23.88

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  34. \n \n
  35. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Mesaerion:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Mesaerion: The Best Science ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £37.59

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  36. \n \n
  37. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Libertarianism\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Libertarianism for Beginners

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.33

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  38. \n \n
  39. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"It's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    It's Only the Himalayas

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £45.17

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  40. \n \n
\n \n\n\n\n
\n
    \n \n
  • \n \n Page 1 of 50\n \n
  • \n \n
  • next
  • \n \n
\n
\n\n\n
\n
\n \n\n\n
\n\n
\n
\n
\n\n\n \n
\n \n \n \n
\n\n\n \n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n\n \n \n\n\n \n \n \n\n \n\n\n \n \n\n \n \n \n \n\n", "echoData": "https://books.toscrape.com/catalogue/page-1.html"} {"url": "https://books.toscrape.com/catalogue/page-2.html", "statusCode": 200, "browserHtml": "\n \n All products | Books to Scrape - Sandbox\n\n\n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n\n\n \n \n\n \n\n \n \n \n\n \n \n\n \n \n \n \n \n
\n
\n
\n
Books to Scrape We love being scraped!\n
\n\n \n
\n
\n
\n\n \n \n
\n
\n \n
    \n
  • \n Home\n
  • \n
  • All products
  • \n
\n\n
\n\n \n\n
\n \n
\n

All products

\n
\n \n\n \n\n\n\n
\n\n
\n\n\n
\n \n
\n\n \n
\n \n
\n \n \n
\n\n \n \n \n 1000 results - showing 21 to 40.\n \n \n \n \n
\n \n
\n
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
\n\n
\n
    \n \n
  1. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"In\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    In Her Wake

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £12.84

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  2. \n \n
  3. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"How\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    How Music Works

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £37.32

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  4. \n \n
  5. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Foolproof\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Foolproof Preserving: A Guide ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £30.52

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  6. \n \n
  7. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Chase\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Chase Me (Paris Nights ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £25.27

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  8. \n \n
  9. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Black\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Black Dust

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £34.53

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  10. \n \n
  11. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Birdsong:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Birdsong: A Story in ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £54.64

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  12. \n \n
  13. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"America's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    America's Cradle of Quarterbacks: ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.50

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  14. \n \n
  15. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Aladdin\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Aladdin and His Wonderful ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £53.13

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  16. \n \n
  17. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Worlds\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Worlds Elsewhere: Journeys Around ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £40.30

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  18. \n \n
  19. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Wall\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Wall and Piece

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £44.18

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  20. \n \n
  21. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Four Agreements: A ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.66

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  22. \n \n
  23. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Five Love Languages: ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £31.05

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  24. \n \n
  25. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Elephant Tree

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £23.82

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  26. \n \n
  27. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Bear and the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £36.89

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  28. \n \n
  29. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sophie's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sophie's World

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £15.94

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  30. \n \n
  31. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Penny\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Penny Maybe

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.29

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  32. \n \n
  33. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Maude\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Maude (1883-1993):She Grew Up ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £18.02

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  34. \n \n
  35. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"In\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    In a Dark, Dark ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £19.63

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  36. \n \n
  37. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Behind\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Behind Closed Doors

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.22

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  38. \n \n
  39. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"You\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    You can't bury them ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.63

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  40. \n \n
\n \n\n\n\n
\n
    \n \n
  • previous
  • \n \n
  • \n \n Page 2 of 50\n \n
  • \n \n
  • next
  • \n \n
\n
\n\n\n
\n
\n \n\n\n
\n\n
\n
\n
\n\n\n \n
\n \n \n \n
\n\n\n \n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n\n \n \n\n\n \n \n \n\n \n\n\n \n \n\n \n \n \n \n\n", "echoData": "https://books.toscrape.com/catalogue/page-2.html"} ``` ### Improving response times There are a few things you can try to improve the response time of your Zyte API requests: - Consider using HTTP requests instead of browser requests where possible. If you are only using browser requests to avoid bans, try using HTTP requests with a session context instead. - When using [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/browserHtml), consider disabling [javascript](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/javascript) where possible. - When using wait actions, avoid using the `waitForTimeout` action where possible. `waitForSelector` is best where feasible. Otherwise, `waitForResponse` or `waitForRequest` may work. - The proxy mode offers lower overhead than the HTTP API, so it might be worth considering if its feature differences are not a problem for your use case. ### Sorting requests If you target multiple websites, consider sorting your requests to spread the load. That is, if you target websites A, B, and C, do not send requests in AAABBBCCC order, send them in ABCABCABC order instead. If you use Scrapy, on top of sorting your start requests as described, you can change your `SCHEDULER_PRIORITY_QUEUE` to `"scrapy.pqueues.DownloaderAwarePriorityQueue"`. ## Zyte API error handling While using Zyte API, you may get the following type of responses: - Successful responses - Rate-limiting responses - Unsuccessful responses ### Successful responses Zyte API sends a successful response, i.e. a response with an HTTP status code of 200, when that response provides the requested data, ban-free. A Zyte API response is considered successful even in the following scenarios: - The response from the target website is a bad response for a reason *other* than a ban. - Some browser actions have failed. - The webpage content does not match the specified automatic extraction property. - The webpage content does not match what you get with an HTTP client program or library like `curl`. #### Bad website responses When a website sends a response with an HTTP status code other than 200, and that response is not the result of a ban, Zyte API sends that response to you. For example, if you send a request to [https://toscrape.com/not-found](https://toscrape.com/not-found), you get a successful response from Zyte API, where the value of the [statusCode](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/statusCode) response field is `404`. #### Browser action failures Browser action failures, e.g. timeouts, or bad responses received during action execution, e.g. after clicking a button, do not cause Zyte API to send an unsuccessful response. Zyte API returns your requested output (e.g. browser HTML, screenshot) the way it was after all actions were executed or the time to run actions run out, and the Zyte API response includes an [actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/actions) field with details about the outcome of each action. #### Automatic extraction mismatches Mismatches between the webpage content and the specified automatic extraction request field do not cause Zyte API to send an unsuccessful response. Zyte API returns your requested output, including the `metadata.probability` field that indicates the probability that the specified automatic extraction property matches the webpage content. #### HTTP client mismatches Some websites might return a different response to a browser than they send to a different type of HTTP client, like `curl`. Zyte API aims to provide the same responses that a browser would get, i.e. those responses you can see in the browser developer tools for network monitoring. If you specifically want the same response that a specific non-browser HTTP client gets, you can try setting the User-Agent header accordingly. However, some websites can tell browsers from non-browser HTTP clients, and since Zyte API aims to behave like a browser, getting the same response as a non-browser HTTP client might not be possible. ### Rate-limiting responses > ###### NOTE > > You are not charged for > rate-limiting responses. > ###### TIP > > scrapy-zyte-api and > python-zyte-api handle rate limiting > automatically. Zyte API may send a response with an HTTP status code of 429 or 503 for rate-limiting purposes. The right way to handle any rate-limiting response is to retry its request as many times as needed until you get a non-rate-limiting response. Rate-limiting responses are sent in the following scenarios: - You have exceeded your API key rate limit. ```json {"status": 429, "type": "/limits/over-user-limit"} ``` When making an efficient use of Zyte API, getting a small percentage of rate-limiting responses due to exceeding your API key rate limit is expected and normal. - The global rate limit for the target website has been exceeded. ```json {"status": 429, "type": "/limits/over-domain-limit"} ``` - You have exceeded your account rate limit for the target website. ```json {"status": 429, "type": "/limits/over-org-domain-limit"} ``` - Zyte API automatic extraction is overloaded. ```json {"status": 503, "type": "/extractor/over-global-limit"} ``` - Zyte API is overloaded. ```json {"status": 503, "type": "/limits/over-global-limit"} ``` > ###### SEE ALSO > > stats-rate-limiting at stats-api ### Unsuccessful responses > ###### NOTE > > You are not charged for > unsuccessful responses. Zyte API sends an unsuccessful response, i.e. a response with an HTTP status code of 400 or higher that is not a rate-limiting response, when Zyte API cannot provide the requested data. Zyte API sends unsuccessful responses in the following scenarios: - There has been a download error: a ban response, a permanent download error or a service error. - Your request is invalid. - Your account has been suspended. #### Ban responses > ###### TIP > > By default, scrapy-zyte-api and > python-zyte-api automatically retry > ban responses up to 3 times before giving up. Zyte API sends an HTTP 520 response when a temporary error, usually a ban that could not be avoided in a timely fashion, prevents downloading the requested URL. ```json {"status": 520, "type": "/download/temporary-error"} ``` On certain websites, it is normal to get these responses sometimes. When you do, retry your request until you get a successful response. We closely monitor the success rate for the most popular websites, but less popular websites may slip under our radar. If you get this response too often, follow zapi-max-success-rate to discard issues in your request parameters. If that does not help, [reach out to our expert anti-ban team](https://support.zyte.com/support/tickets/new). #### Permanent download errors Zyte API sends an HTTP 521 response when a permanent error prevents downloading the requested URL. ```json {"status": 521, "type": "/download/internal-error"} ``` You can wait for us to address the issue, or [ask to be notified when the issue is resolved](https://support.zyte.com/support/tickets/new). > ###### TIP > > For some websites, Zyte API may sometimes accidentally flag some ban > responses as permanent download errors. If sending the same Zyte API > request multiple times returns an HTTP 521 error only *sometimes*, you > might want to treat HTTP 521 errors as HTTP 520 errors for the target > website, i.e. retry them automatically, until we > resolve your issue report. #### Service errors If Zyte API sends an HTTP 500 response, it means that the request took too long or that there was an unexpected issue in Zyte API. ```json {"status": 500, "type": "/server/timed-out"} ``` ```json {"status": 500, "type": "/server/internal"} ``` If the issue persists, feel free to [ask to be notified when the issue is resolved](https://support.zyte.com/support/tickets/new). #### Invalid requests Zyte API may send a response with an HTTP status code of 400, 401, 421, 422 or 451 if there is an error in your request, including: - You are using invalid parameters or parameter values. ```json {"status": 400, "type": "/request/invalid"} ``` - Your request body is invalid JSON. ```json {"status": 400, "type": "/request/invalid-json"} ``` - Your API key is not properly specified, e.g. missing or malformed. ```json {"status": 401, "type": "/auth/not-valid"} ``` - Your API key is unknown, e.g. it might be the wrong API key. ```json {"status": 401, "type": "/auth/key-not-found"} ``` - The domain you are trying to download is unreachable. Please check domain name and verify that domain is valid before retrying. ```json {"status": 421, "type": "/website/domain-unreachable"} ``` - You are using incompatible parameters, such as mixing [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/browserHtml) and [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody). ```json {"status": 422, "type": "/request/unprocessable"} ``` - You are targeting a domain that Zyte API does not allow. ```json {"status": 451, "type": "/download/domain-forbidden"} ``` #### Account suspension Zyte API sends an HTTP 403 response if your Zyte API account is suspended. ```json {"status": 403, "type": "/auth/account-suspended"} ``` Causes of account suspension include: - Reaching the end of your trial. Setting a spending limit lifts your account suspension immediately. - Reaching your spending limit. Increasing your spending limit lifts your account suspension immediately. ### Retrying requests > ###### TIP > > scrapy-zyte-api and > python-zyte-api handle retries for > rate limiting and ban responses automatically. You should automatically retry requests that get a rate-limiting or ban response. When retrying requests automatically, please use an exponential backoff algorithm: wait for some random time before every retry, and use an exponentially longer time the more retries you have used for any given request. For rate-limiting responses, you should retry forever, but use generous retry times. For unsuccessful responses, you can use lower retry times, but you should cap the number of retries per request, to prevent an infinite loop from causing your code to hang. These are some example ranges of random wait times for different scenarios: | Retry | Rate-limiting responses | Unsuccessful responses | |---------|---------------------------|--------------------------| | 1st | 20-40 seconds | 3-9 seconds | | 2nd | 20-40 seconds | 3-11 seconds | | 3rd | 30-38 seconds | 3-15 seconds | | 4th | 30-46 seconds | 3-23 seconds | | 5th | 30-62 seconds | 3-39 seconds | | 6th | 30-94 seconds | 3-62 seconds | | 7th | 30-158 seconds | 3-62 seconds | | 8th | 30-286 seconds | 3-62 seconds | | 9th | 30-542 seconds | 3-62 seconds | | 10th+ | 30-630 seconds | 3-62 seconds | ### Ban handling A banned response is a response from a website that is different from the response anyone would get in a browser. Zyte API handles banned responses automatically and transparently where possible, so that you never get a banned response. For a given request, if Zyte API cannot avoid a ban in a reasonable time, Zyte API sends you a ban response, for which you are not charged. You can then retry your request as many times as needed until Zyte API succeeds. We monitor and proactively work on improving the success rate and response times of Zyte API for the most popular websites, but less popular websites may slip under our radar. If you encounter too many bans, please [reach out to our expert anti-ban team](https://support.zyte.com/support/tickets/new). If you ever get a successful Zyte API response that you believe is the result of a ban, please [report it to our expert anti-ban team](https://support.zyte.com/support/tickets/new). Zyte API uses many different techniques to avoid bans. However, Zyte API does *not* log into websites automatically. Zyte API cannot automatically get you data that is *always* locked behind a user login. > ###### SEE ALSO > > zapi-permissions-control #### Maximizing your success rate Some request parameters can lower the success rate of Zyte API on some websites. To maximize your success rate, i.e. minimize the rate of ban responses that you get: - **Ensure your URL is valid** Make sure your URL works when you use it in a browser set in incognito mode. Mind that some URLs may stop working after their website changes. Also ensure that [query string](https://en.wikipedia.org/wiki/Query_string) parameters, parameter order and values match what you get when you access that webpage manually from a browser. For complex URLs, you can alternatively use a browser request with actions to get to the target URL from a simpler URL. - **Do not set a Referer header** It is often best not to set any value for the `Referer` request header, unless you are building an API request that expects it. If you set the header in the past because it was improving your success rate, but your success rate has lowered now, see if removing the header makes a difference. - **Set headers for API requests** When targeting API endpoints, set the right request headers and cookies, i.e. those your browser sets when sending the same request. Mind that some of those values, such as session cookies or [CSRF](https://en.wikipedia.org/wiki/Cross-site_request_forgery) tokens, might expire with time or need to be read from responses to earlier requests. You may also need sessions to maximize your success rate in request chains. Alternatively, consider using a browser request instead. You can use actions if needed to trigger specific API requests, and either read the result from the browser HTML, if the API response data is loaded onto the webpage, or read the actual API response with network capture. ## Zyte API reference documentation This is the complete reference documentation of the HTTP API of Zyte API. For topic-based usage documentation, see zapi-usage. All requests require [basic authentication](https://datatracker.ietf.org/doc/html/rfc7617#section-2). Use your [Zyte API key](https://app.zyte.com/o/zyte-api/api-access) as username, and no password. For example, if your Zyte API key is `foo`, base64-encode `foo:` as `Zm9vOg==` and send the `Authorization` header with value `Basic Zm9vOg==`: ```none Authorization: Basic Zm9vOg== ``` ```yaml openapi: 3.0.3 info: title: Web Data Extraction API version: 1.0.0 description: A single API for web scraping contact: name: Zyte (Formerly Scrapinghub) url: https://www.zyte.com servers: - url: https://api.zyte.com/v1 description: Zyte Extraction API Production server security: - BasicAuth: [] paths: /extract: post: operationId: extract summary: Process a single URL, return the result description: | Process a single URL, return the result. This endpoint blocks until the result is ready. It is intended for short-running operations. At least one of the following request fields must be set to true: - browserHtml - httpResponseBody - httpResponseHeaders - screenshot - An automatic extraction request field: - article - articleList - articleNavigation - forumThread - jobPosting - jobPostingNavigation - pageContent - product - productList - productNavigation - serp All automatic extraction data types support performing extraction using either a browser request or an HTTP request. Choose which using the corresponding `extractFrom` option, e.g. productOptions.extractFrom when extracting a product. When no option is specified, currently automatic extraction defaults to using a browser request, except for serp, where an HTTP request is used by default instead. In the future, however, the default value may depend on the target website. When automatic extraction uses a browser request, it can be combined with any fields compatible with browserHtml, e.g. screenshot. When automatic extraction uses an HTTP request, it can be combined with any fields compatible with httpResponseBody. serp cannot be combined with any other fields besides serpOptions and url. You cannot combine multiple automatic extraction request fields (e.g. product and productList) on the same request. You cannot combine httpResponseBody with a request field that is exclusive of browser requests (e.g. httpResponseBody and browserHtml). httpResponseHeaders can be requested alone or with any other valid combination of request fields except for serp. The request body size limit is 5MiB. requestBody: required: true description: An extraction request body content: application/json: schema: $ref: '#/components/schemas/ExtractRequest' examples: DownloadHttp: summary: Retrieve raw HTTP content from a page value: url: https://example.com httpResponseBody: true DownloadCustomHttpRequestHeaders: summary: Retrieve raw HTTP content from a page, using custom HTTP headers value: url: https://example.com httpResponseBody: true customHttpRequestHeaders: - name: X-APOLLO-OPERATION-NAME value: nearByNodes DownloadHttpPostWithHttpRequestBody: summary: Retrieve raw HTTP content from a page using a POST request value: url: https://example.com httpResponseBody: true httpRequestMethod: POST httpRequestBody: WyJCV0kiLCAiRkxMIl0= customHttpRequestHeaders: - name: Content-Type value: application/json DownloadHttpPostWithHttpRequestText: summary: Retrieve raw HTTP content from a page using a POST request value: url: https://example.com httpResponseBody: true httpRequestMethod: POST httpRequestText: '{"name":"John Doe","email":"johndoe@example.com","age":32,"address":{"street":"123 Main St","city":"Anytown","state":"CA","zip":"12345"},"phone_numbers":["+1 555 555 1212","+1 555 555 1313"]}' customHttpRequestHeaders: - name: Content-Type value: application/json DownloadHttpHeaders: summary: Retrieve HTTP headers from a page using a POST request value: url: https://example.com httpResponseHeaders: true DownloadHtml: summary: Open a page in a browser, return HTML value: url: https://example.com browserHtml: true DownloadHtmlWithRequestHeaders: summary: Open a page in a browser and return HTML, setting a proper Referer header value: url: https://example.com browserHtml: true requestHeaders: referer: https://www.google.com DownloadScreenshot: summary: | Open a page in a browser and return a JPEG screenshot of the content visible on the browser window value: url: https://example.com screenshot: true DownloadScreenshotFullPagePng: summary: | Open a page in a browser and return a full-page PNG screenshot value: url: https://example.com screenshot: true screenshotOptions: fullPage: true format: png DownloadHtmlEchoData: summary: echoData and jobId example description: | Open a page in a browser, return HTML. Pass the echoData and jobId fields through - they'll be returned unchanged in the output. value: url: https://example.com/foo browserHtml: true echoData: seedUrl: https://example.com foo: bar jobId: 123/234/12 DownloadHtmlActions: summary: actions example description: | 1. Open the target page in a browser 2. Type "Zyte" in the search box 3. Click the Search button 4. Wait for the results page to load value: url: https://example.com/search browserHtml: true actions: - action: type selector: value: '#searchbox' type: css text: Zyte - action: click selector: value: '#searchbtn' type: css ExtractProduct: summary: Extract Product information description: | Extract Product information from a page: price, name, etc. value: url: https://example.com/foo product: true ExtractProductWithHtml: summary: Extract Product information, as well as browser HTML description: | Extract Product information, as well as browser HTML. Make a request from Spanish geolocation. value: url: https://example.com/foo product: true browserHtml: true geolocation: ES ExtractProductRaw: summary: Extract Product information using an HTTP request description: | Extract Product information using an HTTP request. value: url: https://example.com/foo product: true productOptions: extractFrom: httpResponseBody ExtractProductRawWithBody: summary: Extract Product information, as well as httpResponseBody description: | Extract Product information using an HTTP request, as well as httpResponseBody and httpResponseHeaders. Make a request from Spain. value: url: https://example.com/foo product: true productOptions: extractFrom: httpResponseBody httpResponseBody: true httpResponseHeaders: true geolocation: ES ExtractArticleCustomAttributes: summary: Extract Custom Attributes along with Article information description: | Extract Custom Attributes along with Article information. value: url: https://example.com/foo article: true customAttributes: summary: type: string description: A two sentence article summary article_sentiment: type: string enum: - positive - negative - neutral responses: '200': description: Successful response. Contains the output requested. content: application/json: schema: $ref: '#/components/schemas/Response200' '400': description: | Malformed request. See the error details to identify the exact problem in the request. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Unrecognized field: value: type: /request/invalid title: Bad Request status: 400 detail: >- Unrecognized field "foo" Invalid JSON value: value: type: /request/invalid-json title: Invalid JSON status: 400 detail: >- The submitted request body is not a valid JSON. Location: line 2, column 26. Details: Unrecognized token 'False': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false') '401': headers: WWW-Authenticate: schema: type: string description: | Authentication problem. See the error details. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Invalid authentication data: value: type: /auth/not-valid title: Authentication Info Invalid status: 401 detail: >- No valid authentication info found in the request. Check the documentation for the correct authentication schema. Invalid API key: value: type: /auth/key-not-found title: Authentication Key Not Found status: 401 detail: >- The authentication key is not valid or can't be matched. '403': description: | Your account is suspended or not allowed to make the request. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Account suspended: value: type: /auth/account-suspended title: Account Suspended status: 403 detail: >- Account is suspended, check billing details. '421': description: | The request failed and should not be retried content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Incompatible parameters: value: type: /website/domain-unreachable title: Domain Unreachable status: 421 detail: >- The domain is invalid or unreachable. Please check the domain name and try again. Verify the domain name and ensure it is registered and valid before restarting the crawl. '422': description: | The request couldn't be processed. Check the details. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Incompatible parameters: value: type: /request/unprocessable title: Unprocessable Request status: 422 detail: >- Incompatible parameters were found in the request. Check details '429': headers: Retry-After: schema: type: integer format: int32 minimum: 0 description: | Too many requests, see the details. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Domain limit: value: type: /limits/over-domain-limit title: Over Domain Requests Limit status: 429 detail: >- Too many requests. Retry in N seconds from 'Retry-After' header. User limit: value: type: /limits/over-user-limit title: Over User Requests Limit status: 429 detail: >- Too many requests to a specific domain. Retry in N seconds from 'Retry-After' header. Organisation limit: value: type: /limits/over-org-domain-limit title: Over Organisation Requests limit for the requested domain status: 429 detail: >- Too many requests to a specific domain. Retry in N seconds from 'Retry-After' header. '451': description: | Extraction for the domain is forbidden. content: application/problem+json: schema: $ref: '#/components/schemas/ForbiddenDomainProblem' examples: Domain forbidden: value: type: /download/domain-forbidden title: Domain Forbidden status: 451 detail: >- Extraction for the domain is forbidden. blockedDomain: blocked-domain.example '500': description: | Request timeout or internal server error. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Internal error: value: type: /server/internal title: Internal Server Error status: 500 detail: >- The server encountered an internal error. Please contact support or wait for us to resolve the issue. Timeout: value: type: /server/timed-out title: Request Timed Out status: 500 detail: >- The request took too long and timed out. Try it again. Contact support if it fails consistently. '503': description: | System overload. See the details. headers: Retry-After: schema: type: integer format: int32 minimum: 0 content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Global request limit: value: type: /limits/over-global-limit title: Global Requests Limit Reached status: 503 detail: >- Too many requests to the service. Retry in N seconds from 'Retry-After' header. '520': description: | A downloading error, possibly requiring user action. headers: Retry-After: schema: type: integer format: int32 minimum: 0 content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Website ban: value: type: /download/temporary-error title: Website Ban status: 520 detail: >- Zyte API could not get a ban-free response in a reasonable time. See https://docs.zyte.com/zyte-api/usage/errors.html#ban-responses '521': description: | Permanent downloading error. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' examples: Internal download error: value: type: /download/internal-error title: Internal Downloading Error status: 521 detail: >- Server encountered a problem while downloading. Check request and contact support. default: description: | Error. Check the code and problem object for additional information. Note: The client should be ready for the absence of a problem object. In this case, the HTTP status code should be used. content: application/problem+json: schema: $ref: '#/components/schemas/Problem' components: securitySchemes: BasicAuth: type: http scheme: basic schemas: HTTPHeader: type: object description: A header name and value. required: - name - value properties: name: type: string description: The name of the header minLength: 1 value: type: string description: The value of the header example: name: Content-Type value: text/html; charset=utf-8 ExtractRequest: type: object required: - url properties: url: description: | An absolute URL to extract data from. The host name must be a domain name, it cannot be an IP address. example: https://example.com/item-page type: string maxLength: 8192 requestHeaders: $ref: '#/components/schemas/RequestHeaders' tags: type: object nullable: true description: | Assign arbitrary key-value pairs to the request that you can use for filtering in the [Stats API](/zyte-api/usage/stats.md). Keys must be strings. Values must be strings or `null`. For example: `{"tags": {"foo": "bar", "baz": null}}`. additionalProperties: type: string ipType: description: | [Type of IP address](/zyte-api/usage/features.md) from which the request should be sent. If not specified, Zyte API will use an IP type that, for the target website, does not cause bans or unexpected response data. If you believe Zyte API is using the wrong default IP type for a website, please [reach out to our expert anti-ban team](https://support.zyte.com/support/tickets/new). [See an example](/zyte-api/usage/features.md). type: string enum: - datacenter - residential httpRequestMethod: description: | Request [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). Can only be used in combination with httpResponseBody. [See an example](/zyte-api/usage/http.md). See also: httpRequestText, httpRequestBody, customHttpRequestHeaders, httpResponseHeaders. type: string enum: - GET - POST - PUT - DELETE - OPTIONS - TRACE - PATCH - HEAD httpRequestBody: description: | [Base64](https://en.wikipedia.org/wiki/Base64)-encoded data to send as request body. Can only be used in combination with httpResponseBody. It usually needs to be used in combination with httpRequestMethod. If you only need to send UTF-8-encoded text, use httpRequestText instead to skip Base64-encoding. Note that you cannot combine both fields on the same request. [See an example](/zyte-api/usage/http.md). See also: customHttpRequestHeaders. type: string format: byte maxLength: 400000 httpRequestText: description: | UTF-8 text to send as request body. Can only be used in combination with httpResponseBody. It usually needs to be used in combination with httpRequestMethod. If you need to send a binary or non-UTF-8 request body, use httpRequestBody instead. Note that you cannot combine both fields on the same request. [See an example](/zyte-api/usage/http.md). See also: customHttpRequestHeaders. type: string minLength: 1 maxLength: 400000 example: '{"name":"John Doe","email":"johndoe@example.com","age":32,"address":{"street":"123 Main St","city":"Anytown","state":"CA","zip":"12345"},"phone_numbers":["+1 555 555 1212","+1 555 555 1313"]}' customHttpRequestHeaders: description: | HTTP request headers. Can only be used in combination with httpResponseBody. To set headers with other outputs, see requestHeaders. Setting HTTP request headers has some caveats: - Zyte API sends some headers automatically for [ban avoidance](/zyte-api/usage/errors.md), and may silently override or drop some of your custom headers for that purpose. However, your custom headers may override those automatic headers, and in doing so they can break the ban avoidance capabilities of Zyte API, as some websites may ban based on the presence, values, or order of certain headers. - You cannot set the `Cookie` header. Use requestCookies instead. - If you set multiple headers with the same name, only the last header value will be sent. To overcome this limitation, [join the header values with a comma into a single header value](https://stackoverflow.com/a/4371395). For example, replace `"customHttpRequestHeaders": [{"name": "foo", "value": "bar"}, {"name": "foo", "value": "baz"}]` with `"customHttpRequestHeaders": [{"name": "foo", "value": "bar,baz"}]`. [See an example](/zyte-api/usage/http.md). See also: httpRequestMethod, httpRequestText, httpRequestBody, httpResponseHeaders. type: array maxItems: 200 items: $ref: '#/components/schemas/CustomHttpRequestHeader' httpResponseBody: description: | Set to `true` to get the HTTP response body in the httpResponseBody response field. This field is not compatible with [browser automation](/zyte-api/usage/browser.md). [See an example](/zyte-api/usage/http.md). See also: httpRequestMethod, httpRequestText, httpRequestBody, customHttpRequestHeaders. type: boolean default: false httpResponseHeaders: description: | Set to `true` to get the HTTP response headers in the httpResponseHeaders response field. [See an example](/zyte-api/usage/features.md). See also: customHttpRequestHeaders, requestHeaders. type: boolean default: false browserHtml: description: | Set to `true` to get the [browser HTML](/zyte-api/usage/browser.md) in the browserHtml response field. This field is not compatible with [HTTP requests](/zyte-api/usage/http.md). If you use actions, the browser HTML is generated *after* action execution has finished or timed out. By default, [iframes](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) are empty. See includeIframes. To access content from the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM), check out the corresponding example in the [actions documentation](/zyte-api/usage/browser.md). [See an example](/zyte-api/usage/browser.md). See also: screenshot, requestHeaders. type: boolean default: false screenshot: description: | Set to `true` to get a page screenshot in the screenshot response field. This field is not compatible with [HTTP requests](/zyte-api/usage/http.md). To adjust the screenshot contents you can use screenshotOptions and viewport. If you use actions, the screenshot is generated *after* action execution has finished or timed out. [See an example](/zyte-api/usage/browser.md). See also: browserHtml, requestHeaders. type: boolean default: false screenshotOptions: $ref: '#/components/schemas/ScreenshotOptions' article: description: | Set to `true` to get article data in the article response field. The target page should only contain a single article, such as a blog post or a news article. For pages with multiple articles consider using articleList instead. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set articleOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), articleNavigation, browserHtml, screenshot, requestHeaders. default: false type: boolean articleOptions: description: | Additional options for article extraction. $ref: '#/components/schemas/ExtractionOptions' articleList: description: | Set to `true` to get article list data in the articleList response field. The target page should contain multiple articles, usually as links or short snippets. Examples of such pages are main or category pages of news sites, main pages of blogs showing multiple posts, and other pages with multiple articles. Article list data is especially useful to get basic information about articles on a website, like a headline and a link to the article details, using a smaller number of requests, when article attributes are extracted directly from a article list page, without making individual article requests. To implement article crawling from article list pages, use articleNavigation, which also enables navigation through pagination links. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set articleListOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), browserHtml, screenshot, requestHeaders. type: boolean default: false articleListOptions: description: | Additional options for articleList extraction. $ref: '#/components/schemas/ExtractionOptions' articleNavigation: description: | Set to `true` to get article navigation data in the articleNavigation response field. The target page should contain multiple articles and/or subcategories that can be followed. Article navigation data is especially useful for implementing article crawling, i.e. following links to article pages, as well as to subcategories and pagination that can in turn link to more article pages. Article navigation data can also be used to get basic information of articles and subcategories on a website, obtaining the URLs and link names of the articles and subcategories, without making individual requests for those articles. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set articleNavigationOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), article, articleList, browserHtml, screenshot, requestHeaders. type: boolean default: false articleNavigationOptions: description: | Additional options for articleNavigation extraction. $ref: '#/components/schemas/ExtractionOptions' forumThread: description: | Set to `true` to get forum threads data in the forumThread response field. The target page should contain an individual forum thread page on a forum website. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set forumThread.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), article, browserHtml, screenshot, requestHeaders. type: boolean default: false forumThreadOptions: description: | Additional options for forumThread extraction. $ref: '#/components/schemas/ExtractionOptions' jobPosting: description: | Set to `true` to get job posting data in the jobPosting response field. The target page should contain individual job posting page on a company website or on a job website. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set jobPostingOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), browserHtml, screenshot, requestHeaders. type: boolean default: false jobPostingOptions: description: | Additional options for jobPosting extraction. $ref: '#/components/schemas/ExtractionOptions' jobPostingNavigation: description: | Set to `true` to get job posting navigation data in the jobPostingNavigation response field. The target page should contain multiple job postings and/or subcategories that can be followed. Job posting navigation data is especially useful for implementing job posting crawling, i.e. following links to job posting pages, as well as pagination that can in turn link to more job posting pages. Job posting navigation data can also be used to get basic information of job postings on a website, obtaining the URLs and link names of the job postings, without making individual requests for them. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set jobPostingNavigationOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), jobPosting, browserHtml, screenshot, requestHeaders. type: boolean default: false jobPostingNavigationOptions: description: | Additional options for jobPostingNavigation extraction. $ref: '#/components/schemas/ExtractionOptions' pageContent: description: | Set to `true` to get page content data in the pageContent response field. The target page can contain any type of data. Page content data is especially useful for understanding the layout and hierarchy of information on a page, enabling advanced processing such as content extraction, user experience analysis, and automated page summarization. Page content data can also be used to capture the main content intended for users, along with auxiliary navigation components such as headers, footers, sidebars, and pagination controls. This makes it possible to distinguish core content from supporting links used for site-wide navigation. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set pageContentOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), browserHtml, screenshot, requestHeaders. type: boolean default: false pageContentOptions: description: | Additional options for pageContent extraction. $ref: '#/components/schemas/ExtractionOptions' product: description: | Set to `true` to get product data in the product response field. The target page should only contain a single product. For pages with multiple products consider using productList instead. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set productOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. [See an example](/zyte-api/usage/extract.md). See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), productNavigation, browserHtml, screenshot, requestHeaders. type: boolean default: false productOptions: description: | Additional options for product extraction. type: object properties: extractFrom: $ref: '#/components/schemas/ExtractionOptions/properties/extractFrom' model: type: string enum: - '2024-02-01' - '2024-09-16' description: | Model version to use for product extraction. If not specified, the "2024-09-16" version is used. Available product models: - "2024-02-01" - "2024-09-16" See [Model pinning](/zyte-api/usage/extract/index.md). productList: description: | Set to `true` to get product list data in the productList response field. The target page should contain a list or a grid of products. Product list data is especially useful to get basic information about products on a website using a smaller number of requests, when product attributes are extracted directly from a product list page, without making individual product requests. To implement product crawling from product list pages, use productNavigation, which also enables navigation through pagination links. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set productListOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), browserHtml, screenshot, requestHeaders. type: boolean default: false productListOptions: description: | Additional options for productList extraction. $ref: '#/components/schemas/ExtractionOptions' productNavigation: description: | Set to `true` to get product navigation data in the productNavigation response field. The target page should contain multiple products and/or subcategories that can be followed. Product navigation data is especially useful for implementing product crawling, i.e. following links to product pages, as well as to subcategories and pagination that can in turn link to more product pages. Product navigation data can also be used to get basic information of products and subcategories on a website, obtaining the URLs and link names of the products and subcategories, without making individual requests for those products. To combine this field with [HTTP requests](/zyte-api/usage/http.md), set productNavigationOptions.extractFrom to `"httpResponseBody"`. If you use actions, data extraction happens *after* action execution has finished or timed out. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md), product, productList, browserHtml, screenshot, requestHeaders. type: boolean default: false productNavigationOptions: description: | Additional options for productNavigation extraction. $ref: '#/components/schemas/ExtractionOptions' customAttributes: type: object description: | Schema of the custom attributes to extract. This is a subset of the OpenAPI specification, using JSON syntax. Zyte custom attributes extraction uses a Large Language Model (LLM) operated by Zyte to obtain any structured data specified by this schema from any unstructured web page. This allows to perform extraction similar to standard schemas, such as article or product, but much more flexibly. When this field is specified, the customAttributes.values field in the response would contain the extracted data. When custom attributes extraction is requested, a standard extraction field must also be specified (e.g. product). This determines the part of the web page which would be passed to the LLM for custom attributes extraction, e.g. when a web page is a product, we're only going to pass the product information, ignoring other parts of the page, such as menu or footer, which makes extraction cheaper and more accurate. [See detailed documentation](/zyte-api/usage/custom-attributes.md). Additionally, to see a request example, scroll up to the right-hand sidebar **Request samples**, and select “Extract Custom Attributes along with Article information” under **Example**. nullable: true additionalProperties: $ref: '#/components/schemas/CustomAttribute' maxProperties: 20 customAttributesOptions: type: object description: Additional options for custom attributes extraction. properties: method: type: string description: | Method to use for custom attributes extraction: * "generate" (default) generates extracted data with the help of a generative Large Language Model (LLM). It is the most powerful and versatile extraction method, but also the most expensive one, with [variable per-request cost](/zyte-api/pricing.md). * "extract" locates extracted data in the requested web page with the help of a non-generative LLM. It only supports a subset of the schema (only string, integer and number types), and can't perform generative tasks such as summarization or data transformation. It is however much cheaper compared to the generative method and has a [fixed per-request cost](/zyte-api/pricing.md). enum: - generate - extract default: generate maxInputTokens: type: integer minimum: 1 description: | Limit on the number of input tokens for custom attribute extraction with the "generate" method. This includes the schema as well, but not our internal fixed prompt with the LLM instruction. When the number of tokens for schema and page text is above the specified maxInputTokens, we truncate the page text to fit in maxInputTokens. This may result in quality degradation or data not extracted from the page because it was truncated. Tokens are words or word pieces, for example ``{"price": "2.00 $"}`` is 9 tokens: ``{"``, ``price``, ``":``, `` "``, ``2``, ``.``, ``00``, `` $``, ``"}``. maxOutputTokens: type: integer minimum: 1 description: | Limit on the number of output tokens for extracted custom attributes with the "generate" method. This field can be set to limit the extraction cost, but may result in quality degradation. See an example of token counting in the maxInputTokens field above. geolocation: $ref: '#/components/schemas/CountryCode' javascript: description: | Forces JavaScript execution on a [browser request](/zyte-api/usage/browser.md) to be enabled (`true`) or disabled (`false`). By default Zyte API enables or disables JavaScript execution for a request depending on which option makes it easier to avoid bans. Use this request field to override that choice. Passing this request field when requesting automatic extraction ( product, article, etc.) may impact the quality of the returned data, as it might override the optimal value for automatic extraction. This field is not compatible with [HTTP requests](/zyte-api/usage/http.md). [See an example](/zyte-api/usage/browser.md). type: boolean actions: $ref: '#/components/schemas/ActionSequence' jobId: description: | ID of the [Scrapy Cloud](/scrapy-cloud/get-started.md) job from which this request has been sent, to be returned in the jobId response field. This field is meant to help with request tracking. [scrapy-zyte-api](https://scrapy-zyte-api.readthedocs.io/en/latest/index.html) fills this request field automatically. [See an example](/zyte-api/usage/features.md). See also: echoData. type: string maxLength: 100 example: example-job-1 echoData: description: | This field is returned in the echoData response field, verbatim. This field can be useful, for example, to keep track of the original request order when [sending multiple requests in parallel](/zyte-api/usage/optimize.md). The request can be rejected if the data is too big. [See an example](/zyte-api/usage/features.md). See also: jobId. viewport: $ref: '#/components/schemas/Viewport' followRedirect: description: | Whether to follow [HTTP redirection](https://developer.mozilla.org/en-US/docs/Web/HTTP/Redirections) or not. Only supported in [HTTP requests](/zyte-api/usage/http.md), [browser requests always follow redirection](/zyte-api/usage/browser.md). type: boolean sessionContext: $ref: '#/components/schemas/SessionContext' sessionContextParameters: $ref: '#/components/schemas/SessionContextParameters' session: $ref: '#/components/schemas/Session' networkCapture: $ref: '#/components/schemas/NetworkCaptureFilterSequence' device: description: | Type of device to emulate during your request. A desktop device is emulated by default. Can only be used in combination with httpResponseBody. type: string enum: - desktop - mobile cookieManagement: description: | Cookie management method It determines how to handle user cookies, defined through requestCookies, and automatic cookies, cookies automatically generated by Zyte API. `auto` (default) uses user cookies if defined, or automatic cookies otherwise. `discard` uses user cookies if defined, or no cookies otherwise. enum: - auto - discard default: auto requestCookies: type: array description: | A list of cookies to be sent with a request. You can use the contents of the responseCookies response field as a value for this request field. [See an example](/zyte-api/usage/features.md). items: $ref: '#/components/schemas/Cookie' maxItems: 100 responseCookies: description: | Set to `true` to get the list of cookies set during a request in the responseCookies response field. [See an example](/zyte-api/usage/features.md). See also: requestCookies. type: boolean default: false serp: type: boolean description: | Set to `true` to get the data of a search engine results page (SERP) in the serp response field. The target URL should be a search URL that belongs to a [Google domain](https://www.google.com/supported_domains). Currently, you cannot combine this field with any other request fields besides serpOptions and url. See also: [List of all automatic extraction request fields](/zyte-api/usage/extract.md). serpOptions: $ref: '#/components/schemas/SerpOptions' includeIframes: type: boolean description: | Whether to add the content of [iframes](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) into browserHtml. Note that iframes are visible in screenshots even if this is set to `false`. See also: browserHtml. default: false additionalProperties: false Response200: required: - url properties: url: type: string description: | URL the data was extracted from. Could be different from the input URL in case of [redirection](/zyte-api/usage/http.md). See also: statusCode. example: https://example.com/item-page/ statusCode: type: integer description: | The HTTP status code retrieved from the target page. If [redirection is followed](/zyte-api/usage/http.md), this is the status code of the response *after* redirection. See also: url. example: 200 httpResponseBody: description: | [Base64-encoded](https://en.wikipedia.org/wiki/Base64) HTTP response body. To get this response field, set the httpResponseBody request field to `true`. Unlike browserHtml, this field supports binary response bodies, such as image files or PDF files. This is the reason why this field is Base64-encoded, JSON does not support binary data. [See an example](/zyte-api/usage/http.md). type: string format: byte httpResponseHeaders: description: | HTTP response headers. To get this response field, set the httpResponseHeaders request field to `true`. The `Content-Encoding` header value (e.g. `gzip`, `br`, etc.) should not be used to decompress httpResponseBody, Zyte API already decompresses the body of compressed responses. The `Set-Cookie` header value, when present, contains the header value received from the main HTTP response. These cookies could have changed later on, e.g. during browser rendering. Usually you will want to ignore this header in favor of responseCookies, which provides the *final* cookies. [See an example](/zyte-api/usage/features.md). type: array items: $ref: '#/components/schemas/HTTPHeader' browserHtml: description: | [Browser HTML](/zyte-api/usage/browser.md). To get this response field, set the browserHtml request field to `true`. Browser HTML does not include the contents of [iframes](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) or the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM). [See an example](/zyte-api/usage/browser.md). type: string example: Downloaded data. session: $ref: '#/components/schemas/Session' screenshot: description: | [Base64-encoded](https://en.wikipedia.org/wiki/Base64) page screenshot file data. To get this response field, set the screenshot request field to `true`. screenshotOptions.format determines the file format of the screenshot data. [See an example](/zyte-api/usage/browser.md). type: string format: byte article: allOf: - $ref: '#/components/schemas/Article' - description: | Article data. To get this response field, set the article request field to `true`. articleList: allOf: - $ref: '#/components/schemas/ArticleList' - description: | Article list data. To get this response field, set the articleList request field to `true`. articleNavigation: allOf: - $ref: '#/components/schemas/ArticleNavigation' - description: | Article navigation data. To get this response field, set the articleNavigation request field to `true`. forumThread: allOf: - $ref: '#/components/schemas/ForumThread' - description: | Forum thread data. To get this response field, set the forumThread request field to `true`. jobPosting: allOf: - $ref: '#/components/schemas/JobPosting' - description: | Job posting data. To get this response field, set the jobPosting request field to `true`. jobPostingNavigation: allOf: - $ref: '#/components/schemas/JobPostingNavigation' - description: | Job posting navigation data. To get this response field, set the jobPostingNavigation request field to `true`. pageContent: allOf: - $ref: '#/components/schemas/PageContent' - description: | Page content data. To get this response field, set the pageContent request field to `true`. product: allOf: - $ref: '#/components/schemas/Product' - description: | Product data. To get this response field, set the product request field to `true`. productList: allOf: - $ref: '#/components/schemas/ProductList' - description: | Product list data. To get this response field, set the productList request field to `true`. productNavigation: allOf: - $ref: '#/components/schemas/ProductNavigation' - description: | Product navigation data. To get this response field, set the productNavigation request field to `true`. customAttributes: type: object properties: values: type: object additionalProperties: true description: | Values of extracted custom attributes, extracted according to the requested customAttributes schema. metadata: type: object properties: inputTokens: type: integer description: | Total number of used input tokens, excluding our internal fixed prompt with the LLM instruction, when using the "generate" method. outputTokens: type: integer description: | Total number of used output tokens, when using the "generate" method. textInputTokens: type: integer description: | Total number of input tokens used for the text of the web page, excluding the schema and our internal fixed prompt with the LLM instruction, when using the "generate" method. Already included in the customAttributes.metadata.inputTokens field. textInputTokensBeforeTruncation: type: integer description: | textInputTokens before the text was truncated to fit into the input limits, either set via customAttributesOptions.maxInputTokens or due to the model limitation returned in customAttributes.metadata.maxInputTokens, when using the "generate" method. maxInputTokens: type: integer description: | Maximum number of allowed input tokens for the model, when using the "generate" method. excludedPIIAttributes: type: array items: type: string description: | A list of all attributes dropped from the output due to a risk of PII (Personally Identifiable Information) extraction. error: type: string description: | * The ``extraction/unparsable-response`` error is given when the LLM response could not be parsed or recovered. If this error happens, we suggest simplifying the task or reducing the number of attributes. * The ``extraction/schema-size-exceeded`` error is given when the schema did not fit into the input limits, leaving no space for the input text, and therefore the LLM could not be used. If this error happens, we suggest either making the schema smaller (fewer attributes and/or shorter descriptions), or increasing customAttributesOptions.maxInputTokens. echoData: description: | Arbitrary data set on the echoData request field. [See an example](/zyte-api/usage/features.md). type: object jobId: description: | [Scrapy Cloud](/scrapy-cloud/get-started.md) job ID set on the jobId request field. [See an example](/zyte-api/usage/features.md). type: string maxLength: 100 example: example-job-1 actions: description: | Debug information about the execution of the action sequence set in the actions request field. Action order in the response always matches that of the request. type: array items: $ref: '#/components/schemas/ActionResult' responseCookies: description: | List of cookies set during the request. To get this response field, set the responseCookies request field to `true`. [See an example](/zyte-api/usage/features.md). See also: requestCookies. type: array items: $ref: '#/components/schemas/Cookie' networkCapture: type: array description: | Responses captured by filters specified in the networkCapture request parameter. items: $ref: '#/components/schemas/CapturedResponse' serp: $ref: '#/components/schemas/SearchResultsPage' Problem: type: object properties: type: type: string format: uri-reference description: | A URI reference that uniquely identifies the problem type, only in the context of the provided API. Opposed to the specification in RFC-7807, it is neither recommended to be dereferenceable and point to human-readable documentation nor globally unique for the problem type. default: about:blank example: /problem/connection-error title: type: string description: > A short summary of the problem type. Written in English and readable for engineers, usually not suited for non-technical stakeholders, and not localized. example: Service Unavailable status: type: integer format: int32 description: > The HTTP status code generated by Zyte API for this occurrence of the problem. minimum: 100 maximum: 600 exclusiveMaximum: true example: 503 detail: type: string description: > A human-readable explanation specific to this occurrence of the problem that is helpful to locate the source of the problem and gives advice on how to proceed. Written in English and readable for engineers, usually not suited for non-technical stakeholders, and not localized. example: Connection to database timed out ForbiddenDomainProblem: allOf: - $ref: '#/components/schemas/Problem' properties: blockedDomain: type: string description: > The domain which extraction cannot be performed. example: forbiddendomain.com SessionContext: description: | User-defined name-value pairs to [request a server-managed session](/zyte-api/usage/features.md) initialized with sessionContextParameters). For every subsequent request with the same session context, Zyte API will either reuse an available session created for the same session context or create a new session using sessionContextParameters). Server-managed sessions expire after 4 hours or 3 ban responses. If you are targeting websites that silently expire their sessions before the 4-hour mark, i.e. they revert the effects of your sessionContextParameters but requests continue working as expected otherwise, consider using [client-managed sessions](/zyte-api/usage/features.md) for higher session control. [See an example](/zyte-api/usage/features.md). See also: requestCookies, responseCookies. type: array items: type: object maxItems: 10 properties: name: type: string description: Name of the context identifier. minLength: 1 maxLength: 30 nullable: false value: type: string description: Value of the context identifier. minLength: 1 maxLength: 100 nullable: false required: - name - value SessionContextParameters: description: | Parameters to create a server-managed session for a given sessionContext). [See an example](/zyte-api/usage/features.md). See also: actions. type: object properties: actions: $ref: '#/components/schemas/SessionContextActionSequence' ActionResult: description: | Returns detailed information about the elapsed time and errors for a particular action. type: object properties: action: description: The type of action submitted type: string example: waitForSelector elapsedTime: description: Elapsed time in seconds type: number status: description: | Status of execution of a particular action * success - When the action finishes execution successfully without any errors * continued - When the action fails, but the execution of the action sequence is continued * returned - When the action fails and stops execution * notExecuted - When a a prior action has failed, thereby not executing the current action type: string enum: - success - continued - returned - notExecuted example: success error: description: Detailed information about the underlying error. type: string example: Request timeout while waiting for selector '#form-input' interactionLogs: description: | Messages logged with `console.log()` from [browser scripts](/zyte-api/ide/index.md). type: array items: $ref: '#/components/schemas/InteractionLogEntry' required: - action - elapsedTime - status InteractionLogEntry: description: Interaction log entry type: object properties: time: description: The ISO 8601 format of the time type: string level: description: The log level type: string enum: - debug - info - warning - error - warn message: description: The log message type: string ActionTimeout: description: Maximum wait time in seconds. type: number minimum: 0.0 default: 5.0 maximum: 15.0 UrlPattern: description: | A string to compare with a URL according to `urlMatchingOptions`. type: string example: - https://example.com/api - /api/store/fulfilment PatternMatchingOptions: description: | How to compare a user-defined string with a target string: - `contains` matches if the user-defined string is a substring of the target string. - `exact` matches if the user-defined string is an exact match of the target string. - `startsWith` matches if the target string starts with the user-defined string. - `endsWith` matches if the target string ends with the user-defined string. Comparisons are case-sensitive. Regular expressions or wildcard characters are not supported. type: string enum: - startsWith - endsWith - contains - exact default: contains ActionSelector: description: | A CSS or XPath selector to search for an element. properties: type: description: The type of selector - CSS or XPath type: string enum: - css - xpath value: type: string minLength: 1 maxLength: 500 state: description: | State can be either of the following values and defaults to visible * 'visible' - The element has a non-empty bounding box and no visibility:hidden. Note that an element without content or with display:none has an empty bounding box, and is not considered visible. * 'hidden' - The element is either detached from the DOM, or has an empty bounding box or visibility:hidden. This is the opposite of the 'visible' option. * 'attached' - The element is present in the DOM; it can be visible or hidden type: string enum: - attached - visible - hidden default: visible required: - type - value onError: description: | Handle errors encountered while executing a particular action. * continue - When a particular action fails, the action sequence continues, executing the next actions * return - When a particular actions fails, the action sequence stops, not executing any more actions When an action sequence finishes prematurely the service will return the entire response body up until the point of execution. type: string enum: - continue - return default: return ActionSequence: description: | Sequence of browser actions to execute. Select an action below to see its API reference. When using actions, you get the actions response field with debug information about action execution. [See an example](/zyte-api/usage/browser.md). type: array items: oneOf: - $ref: '#/components/schemas/click' - $ref: '#/components/schemas/doubleClick' - $ref: '#/components/schemas/evaluate' - $ref: '#/components/schemas/goto' - $ref: '#/components/schemas/hide' - $ref: '#/components/schemas/hover' - $ref: '#/components/schemas/interaction' - $ref: '#/components/schemas/keyPress' - $ref: '#/components/schemas/reload' - $ref: '#/components/schemas/scrollBottom' - $ref: '#/components/schemas/scrollTo' - $ref: '#/components/schemas/searchKeyword' - $ref: '#/components/schemas/select' - $ref: '#/components/schemas/setLocation' - $ref: '#/components/schemas/type' - $ref: '#/components/schemas/waitForNavigation' - $ref: '#/components/schemas/waitForRequest' - $ref: '#/components/schemas/waitForResponse' - $ref: '#/components/schemas/waitForSelector' - $ref: '#/components/schemas/waitForTimeout' Action: description: Action to perform. type: object properties: onError: $ref: '#/components/schemas/onError' example: action: click selector: type: css value: '#main' GoToOptions: description: Used to customise navigation options type: object properties: waitUntil: description: | When to consider navigation succeeded, defaults to load. Events can be either: * load - consider navigation to be finished when the load event is fired. * networkidle0 - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms * domcontentloaded - consider navigation to be finished when the DOMContentLoaded event is fired. type: string enum: - load - networkidle0 - domcontentloaded default: load timeout: description: Maximum navigation time in seconds, defaults to 30 seconds. Pass 0 to disable timeout. type: integer default: 30 minimum: 0 ScreenshotOptions: description: | Options for the screenshot taken when the screenshot request field is `true`. type: object properties: format: description: | File format. JPEG screenshots are taken with a quality of 75%. type: string enum: - png - jpeg default: jpeg fullPage: description: | When `true`, the screenshot features the full page. When `false`, it features only what is visible on the browser window (viewport). Full page screenshots: - Are only available in JPEG format. - Have a minimum resolution of 1920x1080, i.e. for pages smaller than 1920x1080, the screenshot looks the same regardless of the value of `fullPage`. - Any image exceeding 5000 (width) x 10000 (height) pixels will be clipped to those dimensions. type: boolean default: false ExtractionOptions: description: | Options for automatic extraction. type: object properties: extractFrom: type: string enum: - httpResponseBody - browserHtml - browserHtmlOnly description: | [Extraction source](/zyte-api/usage/extract/index.md). `httpResponseBody` extracts from httpResponseBody. It is usually faster and cheaper. `browserHtmlOnly` extracts from browserHtml. It typically improves quality over `httpResponseBody` on JavaScript-heavy web pages. `browserHtml` extracts from both browserHtml and visual features of the rendered web page. It typically improves quality over `browserHtmlOnly`, but is not as robust in case of rendering issues. If not specified, `browserHtml` is currently used by default for [AI extraction](/zyte-api/usage/extract/index.md), while `httpResponseBody` is used by default for [non-AI extraction](/zyte-api/usage/extract/index.md). In the future, the default value may depend on the target website. RequestHeaders: description: | HTTP request headers. Can only be used in a [browser request](/zyte-api/usage/browser.md). For [HTTP requests](/zyte-api/usage/http.md), see customHttpRequestHeaders. At the moment it only supports the [Referer header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer). [See an example](/zyte-api/usage/browser.md). properties: referer: description: | [Referer header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer). type: string example: https://www.google.com/ CustomHttpRequestHeader: properties: name: type: string maxLength: 200 example: X-APOLLO-OPERATION-NAME value: type: string maxLength: 7000 example: nearByNodes Session: description: | Parameters to create or reuse a [client-managed session](/zyte-api/usage/features.md). If `id` does not match one of your running sessions, a new session is created with that session ID. Otherwise, the matching running session is reused. Client-managed sessions may expire due to any of the following: - 15 minutes (900 seconds) have passed since the session was created. - 2 minutes (120 seconds) have passed since the session use. - For 3 times in a row, requests using this session got banned. For 5-10 minutes after a session expires, Zyte API keeps track of the expired session and does not allow re-using it. After that time, attempts to reuse the session will instead create a new session. [See an example](/zyte-api/usage/features.md). example: id: ab837d21-f848-42b2-8e88-47ea9d84bad0 properties: id: description: | User-defined session ID. It must be a [version 4 UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_(random)), i.e. a randomly-generated UUID. type: string type: object Cookie: type: object properties: name: type: string maxLength: 4085 description: Cookie name value: type: string maxLength: 4085 description: Cookie value domain: type: string maxLength: 253 description: Domain the cookie belongs to path: type: string description: Path the cookie belongs to expires: type: integer format: int64 description: Unix time in seconds. httpOnly: type: boolean secure: type: boolean sameSite: type: string enum: - Strict - Lax - Extended - None required: - name - value - domain PostalAddress: description: Postal address to be set type: object properties: addressCountry: description: The country code in [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) type: string example: US addressRegion: description: The region in which the address is. This value is specific to the website. type: string example: California streetAddress: description: The street address. type: string postalCode: description: The postal code. type: string NetworkCaptureFilterSequence: type: array maxItems: 10 description: | Filters to capture browser network responses. HTTP responses received during browser rendering (including action execution) will be returned in the networkCapture response field if they match any of the filters defined here. You can capture up to 10 responses, provided the sum of their bodies does not exceed 5 MiB. If they do exceed that limit, only the first captured responses within the limit are returned. [See an example](/zyte-api/usage/browser.md). items: $ref: '#/components/schemas/NetworkCaptureFilter' NetworkCaptureFilter: type: object # Note: In the rendered docs, this description only appears in the # networkCapture.filter response field, so the wording is tailored for # that. description: | Filter defined in the networkCapture request field that matched the captured response. properties: filterType: type: string enum: - url - resourceType httpResponseBody: type: boolean default: false description: | Set to `true` to get the body of the captured response in the [networkCapture[].httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/networkCapture.httpResponseBody) response field. required: - filterType discriminator: propertyName: filterType mapping: url: '#/components/schemas/UrlFilter' resourceType: '#/components/schemas/ResourceTypeFilter' UrlFilter: description: An object specifying how to capture responses by matching URL allOf: - $ref: '#/components/schemas/NetworkCaptureFilter' - properties: value: type: string minLength: 3 maxLength: 8192 description: | A string to compare with the URL of network responses according to `matchType`. matchType: $ref: '#/components/schemas/PatternMatchingOptions' - required: - value - matchType ResourceTypeFilter: description: An object specifying how to capture responses by resource type allOf: - $ref: '#/components/schemas/NetworkCaptureFilter' - properties: resourceType: type: string enum: - document - xhr description: | A resource type for a network response to match: - `document` is the source HTML document, which might change during browser rendering or through actions. - `xhr` is a response obtained using [XMLHttpRequest](http://devdoc.net/web/developer.mozilla.org/en-US/docs/XMLHttpRequest.1.html). - required: - resourceType CapturedResponse: type: object properties: interceptionStatus: type: object description: | Exit status of the network capture. If `interceptionStatus.status` is `error`, `httpResponseBody` is not delivered. Possible causes of error include all matching responses exceeding the maximum total body size of 5 MiB. properties: status: type: string enum: - success - error error: type: string description: | Error message. This field is only present if `interceptionStatus.status` is `error`. statusCode: type: integer description: HTTP status code of the captured response. httpResponseBody: type: string format: byte description: | [Base64](https://en.wikipedia.org/wiki/Base64)-encoded body of the captured response. To get this response field, set the [networkCapture[].httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/networkCapture.httpResponseBody) request field to `true`. url: type: string format: uri description: Captured response URL. headers: type: object description: Captured response headers. filter: $ref: '#/components/schemas/NetworkCaptureFilter' request: type: object description: Captured request that got the captured response. properties: url: type: string description: URL of the captured request. headers: type: object description: Headers of the captured request. method: type: string description: HTTP method of the captured request. body: type: string description: Body of the captured request, if any. Viewport: type: object description: | [Browser viewport](https://developer.mozilla.org/en-US/docs/Glossary/Viewport). properties: width: type: integer description: Viewport width, in pixels. default: 1920 minimum: 320 maximum: 5120 height: type: integer description: Viewport height, in pixels. default: 1080 minimum: 360 maximum: 4096 CountryCode: description: | [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) code of a country from which the request should be sent, i.e. the request [geolocation](/zyte-api/usage/features.md). If not specified, Zyte API will use a geolocation that, for the target website, does not cause bans or unexpected locale changes in the response data, such as the wrong language, currency, date format, time zone, etc. If you believe Zyte API is using the wrong default geolocation for a website, please [reach out to our expert anti-ban team](https://support.zyte.com/support/tickets/new). For some websites, however, you might want to set a custom geolocation. For example, you may be interested in visiting the same URL from different locations. Zyte API provides 2 sets of geolocations. Standard geolocations are `AU`, `BE`, `BR`, `CA`, `CN`, `DE`, `ES`, `FR`, `GB`, `IN`, `IT`, `JP`, `KR`, `MX`, `NL`, `PL`, `RU`, `TR`, `US`, and `ZA`. All other geolocations are [extended geolocations](/zyte-api/usage/features.md). [See an example](/zyte-api/usage/features.md). example: US type: string enum: - AW - AF - AO - AI - AX - AL - AD - AE - AR - AM - AS - AQ - TF - AG - AU - AT - AZ - BI - BE - BJ - BQ - BF - BD - BG - BH - BS - BA - BL - BY - BZ - BM - BO - BR - BB - BN - BT - BV - BW - CF - CA - CC - CH - CL - CN - CI - CM - CD - CG - CK - CO - KM - CV - CR - CU - CW - CX - KY - CY - CZ - DE - DJ - DM - DK - DO - DZ - EC - EG - ER - EH - ES - EE - ET - FI - FJ - FK - FR - FO - FM - GA - GB - GE - GG - GH - GI - GN - GP - GM - GW - GQ - GR - GD - GL - GT - GF - GU - GY - HK - HM - HN - HR - HT - HU - ID - IM - IN - IO - IE - IR - IQ - IS - IL - IT - JM - JE - JO - JP - KZ - KE - KG - KH - KI - KN - KR - KW - LA - LB - LR - LY - LC - LI - LK - LS - LT - LU - LV - MO - MF - MA - MC - MD - MG - MV - MX - MH - MK - ML - MT - MM - ME - MN - MP - MZ - MR - MS - MQ - MU - MW - MY - YT - NA - NC - NE - NF - NG - NI - NU - NL - NO - NP - NR - NZ - OM - PK - PA - PN - PE - PH - PW - PG - PL - PR - KP - PT - PY - PS - PF - QA - RE - RO - RU - RW - SA - SD - SN - SG - GS - SH - SJ - SB - SL - SV - SM - SO - PM - RS - SS - ST - SR - SK - SI - SE - SZ - SX - SC - SY - TC - TD - TG - TH - TJ - TK - TM - TL - TO - TT - TN - TR - TV - TW - TZ - UG - UA - UM - UY - US - UZ - VA - VC - VE - VG - VI - VN - VU - WF - WS - YE - ZA - ZM - ZW OrganicResult: type: object properties: description: type: string description: Result excerpt. example: > Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently- ... name: type: string description: Result title. example: squid-cache.org url: $ref: '#/components/schemas/OrganicResultURL' rank: type: integer example: 1 description: | Result position among organic results in the search page. The first result of a search page is always 1, regardless of the value of serp.pageNumber. Metadata: type: object description: Metadata. properties: displayedQuery: type: string description: Search query as seen in the web page. example: squid proxy searchedQuery: type: string description: Search query as specified in the input URL. example: squid proxy totalOrganicResults: type: integer format: int64 description: | Total number of organic results reported by the search engine. minimum: 0 example: 10000 dateDownloaded: type: string description: | The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ" example: '2024-02-29T13:01:54Z' SearchResultsPage: type: object description: | Search engine results page data. To get this response field, set the serp request field to `true`. properties: organicResults: type: array description: | List of search results excluding paid results. items: $ref: '#/components/schemas/OrganicResult' url: $ref: '#/components/schemas/SearchURL' pageNumber: type: integer description: Page number. minimum: 1 metadata: $ref: '#/components/schemas/Metadata' product: $ref: '#/components/schemas/Product' OrganicResultURL: type: string pattern: ^https?://[\S]+$ description: Result URL. example: https://www.squid-cache.org/ additionalProperties: false SearchURL: type: string pattern: ^https?://[\S]+$ description: | Search URL. Should match url. example: https://www.google.pl/search?q=squid+proxy additionalProperties: false URL: type: string pattern: ^https?://[\S]+$ additionalProperties: false SerpOptions: type: object description: Options for SERP extraction. properties: extractFrom: type: string enum: - browserHtml - httpResponseBody description: | Input to use for extraction, either httpResponseBody or browserHtml. If not specified, `httpResponseBody` is currently used by default. In the future, the default value may depend on the target website. click: allOf: - properties: action: enum: - click description: Click on an element. selector: $ref: '#/components/schemas/ActionSelector' button: description: Mouse button to click type: string enum: - left - right - middle default: left delay: description: Time to wait between mousedown and mouseup, in seconds. type: number minimum: 0 maximum: 3 default: 0 waitForNavigationTimeout: description: | Maximum waiting time in seconds for the navigation event during the click action. If navigation happens within the defined duration, then waiting is halted and the next action is executed after the new is page is loaded. If the page loading does not finish then the next action ends with an error, and following actions may not be executed, depending on the onError property. If no navigation happens within the defined duration then the next action is executed. type: number minimum: 0 maximum: 20 default: 0 required: - selector - action - $ref: '#/components/schemas/Action' doubleClick: allOf: - properties: action: enum: - doubleClick description: Double click on an element. selector: $ref: '#/components/schemas/ActionSelector' required: - selector - action - $ref: '#/components/schemas/Action' evaluate: allOf: - properties: action: enum: - evaluate description: | Run JavaScript code in the page context. This is a very powerful action. Use cases include: - Sending an API request from the page context, and writing the response somewhere in the DOM, so that the [browser HTML](/zyte-api/usage/browser.md) output includes it. source: description: JavaScript code to run. type: string maxLength: 2000 required: - source - action - $ref: '#/components/schemas/Action' goto: allOf: - properties: action: enum: - goto description: | Navigate to a new page. This action waits until page load event is fired with a default timeout of 30 seconds. url: description: URL to navigate page to. The url should include scheme type: string options: $ref: '#/components/schemas/GoToOptions' required: - url - action - $ref: '#/components/schemas/Action' hide: allOf: - properties: action: enum: - hide description: Hide an element. selector: $ref: '#/components/schemas/ActionSelector' required: - action - selector - $ref: '#/components/schemas/Action' hover: allOf: - properties: action: enum: - hover description: | Hover over a visible element. Elements that are either hidden or not present will cause the action to exit with an error. selector: $ref: '#/components/schemas/ActionSelector' required: - selector - action - $ref: '#/components/schemas/Action' interaction: allOf: - properties: action: enum: - interaction description: | Execute a [browser script](//zyte-api/ide/index.md). id: description: Script identifier type: string args: description: Input arguments type: object required: - id - action - $ref: '#/components/schemas/Action' keyPress: allOf: - properties: action: enum: - keyPress description: Press a key on the keyboard. key: type: string maxLength: 14 description: | Key to press. A single character or special key from the [list of supported keys](/zyte-api/ide/api/index.md). Key names are case-sensitive. Only one key can be executed at a time. Key combinations are not supported. required: - key - action - $ref: '#/components/schemas/Action' reload: allOf: - properties: action: enum: - reload description: | Reload the page. This action waits until page load event is fired with a default timeout of 30 seconds. options: $ref: '#/components/schemas/GoToOptions' required: - action - $ref: '#/components/schemas/Action' scrollBottom: allOf: - properties: action: enum: - scrollBottom description: | Continuously scroll down the page while it keeps loading more content. The action halts if any of the following conditions are met: - the timeout or the total browser execution time is reached - the page does not load any new content for the duration of maxScrollDelay - maxPageHeight or maxScrollCount have been reached timeout: description: Maximum wait time, in seconds. type: number minimum: 0.0 default: 15.0 maximum: 30.0 maxScrollDelay: description: | The maximum amount of time to wait for each scroll to complete, in seconds. If the page does not not load any content during this time, the action is deemed to have been completed. type: number default: 1.5 minimum: 0.5 maximum: 10 maxPageHeight: description: Maximum height (in pixels) until which the browser keeps scrolling down the page type: integer maxScrollCount: description: | The maximum number of scrolls to perform. If the page does not yield any fresh content, then the action will finish execution before maxScrollCount is reached. type: integer scrollStep: description: | The number of pixels for each scroll. It can be used for gradual scrolling. If it's specified, maxScrollDelay will be used as fixed time waiting instead of waiting for new contents. type: integer default: 0 minimum: 0 required: - action - $ref: '#/components/schemas/Action' scrollTo: allOf: - properties: action: enum: - scrollTo description: | Scroll the window to a particular place in the document. To set the target location, use one (and only one) of the following: - `top` and `left`, to set the target coordinates in pixels. - `selector`, to target the center of an HTML element. top: description: Specifies the number of pixels along the Y axis to scroll the window. type: integer left: description: Specifies the number of pixels along the X axis to scroll the window. type: integer default: 0 selector: allOf: - $ref: '#/components/schemas/ActionSelector' - description: If passed scrolls to specified selector instead of scrolling to specified coordinates within page. If selector is not found no scroll is performed. If more than one elements match selector it scrolls to the first one. required: - action - $ref: '#/components/schemas/Action' searchKeyword: allOf: - properties: action: enum: - searchKeyword description: | Perform keyword search on the page. This action uses website-specific knowledge to find and use a search box. It may not work on some websites. If that’s the case, please [reach out to us](https://support.zyte.com/support/tickets/new). If there is no search box on a page, an error is returned. keyword: description: The keyword to be searched for type: string required: - keyword - action - $ref: '#/components/schemas/Action' select: allOf: - properties: action: enum: - select description: | Pick single or multiple values from a `` has the multiple attribute, all values are considered, otherwise only the first one is taken into account. type: array items: type: string required: - selector - values - action - $ref: '#/components/schemas/Action' setLocation: allOf: - properties: action: enum: - setLocation description: | Configure a physical address on the website. This action uses website-specific knowledge to find and fill a location form. It may not work on some websites. If that’s the case, please [reach out to us](https://support.zyte.com/support/tickets/new). address: $ref: '#/components/schemas/PostalAddress' required: - action - $ref: '#/components/schemas/Action' type: allOf: - properties: action: enum: - type description: Type text into an element. selector: $ref: '#/components/schemas/ActionSelector' text: description: | Text to type into a focused element. To press a special key, use the `keyPress` action instead. type: string delay: description: Time to wait between key presses, in seconds. type: number minimum: 0 default: 0 required: - selector - text - action - $ref: '#/components/schemas/Action' waitForNavigation: allOf: - properties: action: enum: - waitForNavigation description: | Wait until the page navigates to a new URL or reloads. If `waitForNavigation` is the first action, the specified options will be applied to the initial navigation. Use it if the default timeout of 30 seconds or the default `waitUntil` value (`load`) is not sufficient for the initial navigation. Mind, however, that using `waitForNavigation` as the first action has an important drawback: any error with the initial navigation will nonetheless result in a successful API response, as any other [browser action failure](/zyte-api/usage/errors.md). timeout: type: number maximum: 45.0 minimum: 31.0 default: 31.0 description: | Maximum wait time, in seconds. waitUntil: default: load description: | When to consider that navigation succeeded: - `load` - [load event](https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event), default. - `domcontentloaded` - [DOMContentLoaded event](https://developer.mozilla.org/en-US/docs/Web/API/Window/DOMContentLoaded_event). - `networkidle0` - no ongoing network connections for at least 0.5 seconds. type: string enum: - load - domcontentloaded - networkidle0 required: - timeout - action nullable: false - $ref: '#/components/schemas/Action' waitForRequest: allOf: - properties: action: enum: - waitForRequest description: Wait until the request to a specific URL has been sent. urlPattern: $ref: '#/components/schemas/UrlPattern' urlMatchingOptions: $ref: '#/components/schemas/PatternMatchingOptions' timeout: $ref: '#/components/schemas/ActionTimeout' example: # To wait for a request to https://example.org/store/api/ref=sspa_dk_left_sx_aax_0 - urlPattern: https://example.org/store/api # To wait for a request to https://cdn123.example.org/api/store?q=1234 - urlPattern: api/store urlMatchingOptions: contains # To wait for a request to https://example.org/afsk123/ref=sspa_dk_left_sx_aax_0 - urlPattern: https://example.org/ urlMatchingOptions: startsWith required: - urlPattern - action - $ref: '#/components/schemas/Action' waitForResponse: allOf: - properties: action: enum: - waitForResponse description: Wait until the response from a specific URL has been received. urlPattern: $ref: '#/components/schemas/UrlPattern' urlMatchingOptions: $ref: '#/components/schemas/PatternMatchingOptions' timeout: $ref: '#/components/schemas/ActionTimeout' example: # To wait for a response from https://cdn123.example.org/store/api?q=1234 - urlPattern: /store/api urlMatchingOptions: contains # To wait for a response from https://example.org/store/ref=sspa_dk_left_sx_aax_0 - urlPattern: https://example.org/store/ urlMatchingOptions: startsWith required: - urlPattern - action - $ref: '#/components/schemas/Action' waitForSelector: allOf: - properties: action: enum: - waitForSelector description: | Wait for the selector to appear. If at the moment of calling the method the selector already exists, the action will return immediately. Also, the action will return immediately after the first matching selector appears. For a usage example, see the [web scraping tutorial](/web-scraping/tutorial/js.md). selector: $ref: '#/components/schemas/ActionSelector' timeout: $ref: '#/components/schemas/ActionTimeout' required: - selector - action - $ref: '#/components/schemas/Action' waitForTimeout: allOf: - properties: action: enum: - waitForTimeout description: | Pause script execution for the given number of seconds before continuing. If the value of timeout is greater than the remaining browser execution time, then this action ends with an error. timeout: $ref: '#/components/schemas/ActionTimeout' required: - action - $ref: '#/components/schemas/Action' SessionContextActionSequence: description: | Actions to run to initialize a server-managed session for a given sessionContext). type: array items: oneOf: - $ref: '#/components/schemas/click' - $ref: '#/components/schemas/doubleClick' - $ref: '#/components/schemas/evaluate' - $ref: '#/components/schemas/goto' - $ref: '#/components/schemas/hide' - $ref: '#/components/schemas/hover' - $ref: '#/components/schemas/interaction' - $ref: '#/components/schemas/keyPress' - $ref: '#/components/schemas/reload' - $ref: '#/components/schemas/scrollBottom' - $ref: '#/components/schemas/scrollTo' - $ref: '#/components/schemas/searchKeyword' - $ref: '#/components/schemas/select' - $ref: '#/components/schemas/setLocation' - $ref: '#/components/schemas/type' - $ref: '#/components/schemas/waitForNavigation' - $ref: '#/components/schemas/waitForRequest' - $ref: '#/components/schemas/waitForResponse' - $ref: '#/components/schemas/waitForSelector' - $ref: '#/components/schemas/waitForTimeout' CustomAttribute: type: object properties: description: type: string maxLength: 300 type: type: string enum: - boolean - string - number - integer - array - object discriminator: propertyName: type mapping: boolean: '#/components/schemas/CustomAttributeBoolean' string: '#/components/schemas/CustomAttributeString' number: '#/components/schemas/CustomAttributeNumber' integer: '#/components/schemas/CustomAttributeInteger' array: '#/components/schemas/CustomAttributeArray' object: '#/components/schemas/CustomAttributeObject' CustomAttributeBoolean: allOf: - $ref: '#/components/schemas/CustomAttribute' type: object required: - type CustomAttributeString: allOf: - $ref: '#/components/schemas/CustomAttribute' - properties: enum: type: array minItems: 2 maxItems: 100 items: type: string maxLength: 50 minLength: 1 format: type: string enum: - html - uri - html-text - xpath required: - type CustomAttributeNumber: allOf: - $ref: '#/components/schemas/CustomAttribute' - properties: enum: type: array minItems: 2 maxItems: 10 items: type: number type: object required: - type CustomAttributeInteger: allOf: - $ref: '#/components/schemas/CustomAttribute' - properties: enum: type: array minItems: 2 maxItems: 10 items: type: integer type: object required: - type CustomAttributeArray: allOf: - $ref: '#/components/schemas/CustomAttribute' properties: items: $ref: '#/components/schemas/CustomAttribute' type: object required: - type - items CustomAttributeObject: allOf: - $ref: '#/components/schemas/CustomAttribute' properties: properties: type: object additionalProperties: $ref: '#/components/schemas/CustomAttribute' type: object required: - type - properties Article: type: object properties: headline: description: Article headline or title. type: string example: Article headline articleBody: description: | Clean text of the article, including sub-headings, with newline separators. type: string example: Article body ... articleBodyHtml: description: | Simplified and standardized HTML of the article body, including sub-headings, image captions and embedded content (videos, tweets, etc.). type: string example:

Article body ...

...
description: description: | A short summary of the article. It can be either human-provided (if available), or auto-generated. type: string example: Article summary datePublished: description: | Publication date. ISO-formatted with 'T' separator, may contain a timezone. If the actual publication date is not found, "dateModified" value is taken. type: string example: '2019-06-19T00:00:00' datePublishedRaw: description: | Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website. type: string example: June 19, 2019 dateModified: description: | The date when the article was most recently modified. ISO-formatted with 'T' separator, may contain a timezone. type: string example: '2019-06-21T00:00:00' dateModifiedRaw: description: | Same date as "dateModified", but before parsing/normalization, i.e. as it appears on the website. type: string example: June 21, 2019 authors: description: Authors of the article. type: array items: $ref: '#/components/schemas/Author' example: - name: Alice nameRaw: Alice and Bob - name: Bob nameRaw: Alice and Bob inLanguage: description: | Language of the article, as an ISO 639-1 language code. Example: "en". Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see "webPageInfo". type: string example: en breadcrumbs: description: | A list of breadcrumbs (a specific navigation element) with optional `name` and `url`. example: - name: Home url: https://example.com/ - name: Cell Phones url: https://example.com/cell-phones - name: Cell Phones & Accessories type: array items: $ref: '#/components/schemas/Breadcrumb' mainImage: $ref: '#/components/schemas/Image' description: The main image of the item. images: description: All images of the item (may include the main image). type: array items: $ref: '#/components/schemas/Image' videos: description: A list of all videos inside the article body. type: array items: type: object properties: url: description: Absolute URL of the video. type: string example: https://example.com/video.mp4 required: - url audios: description: A list of all audios inside the article body. type: array items: type: object properties: url: description: Absolute URL of the audio. type: string example: https://example.com/audio.mp3 required: - url url: description: URL of a page where this article was extracted. type: string example: https://example.com/article/ canonicalUrl: description: Canonical URL of the article, if available. type: string example: https://example.com/article metadata: $ref: '#/components/schemas/Metadata__metadata' required: - url - metadata Author: description: Author of the article. type: object properties: name: description: Full name of the author, e.g. "Alice". type: string nameRaw: description: Text from which this author name was extracted, e.g. "Alice and Bob". type: string example: name: Alice nameRaw: Alice and Bob required: - name Breadcrumb: description: | Breadcrumb item (a specific navigation element) with optional `name` and `url`. example: name: Home url: https://example.com/ type: object properties: name: description: Text of the breadcrumb, as it appears on the website. type: string url: description: Absolute URL of the breadcrumb. type: string Image: description: Image. type: object properties: url: description: URL of an image. type: string example: http://example.com/item-1/image1.jpeg required: - url Metadata__metadata: description: Extracted item metadata for single-item data types. type: object properties: probability: description: | Probability that extracted item is of requested data type. It is closer to 0 in case this page does not contain requested data type. For example, when single product extraction is requested with "product: true", but a page does not contain a product, probability would be close to 0. If an item of requested type can be extracted from a page, then probability is closer to 1. Recommended probability threshold is 0.5, but we will return extracted data even if probability is very low. type: number minimum: 0.0 maximum: 1.0 example: 0.87 dateDownloaded: $ref: '#/components/schemas/DateDownloaded' required: - probability - dateDownloaded DateDownloaded: description: | The timestamp at which the data was downloaded. Timezone: UTC. Format: ISO 8601 format: "YYYY-MM-DDThh:mm:ssZ" type: string example: '2019-06-19T08:27:43Z' ArticleList: type: object properties: articles: description: List of articles available on this page. type: array items: type: object properties: url: description: | URL of a detailed article page. Pass this URL with "article: true" in the request to extract detailed information about the article. type: string example: https://example.com/articles/1/ headline: description: Article headline or title. type: string example: Article headline articleBody: description: | Text of the article as it appears on the list page, including sub-headings, with newline separators. type: string example: Article body ... datePublished: description: | Publication date. ISO-formatted with 'T' separator, may contain a timezone. type: string example: '2019-06-19T00:00:00' datePublishedRaw: description: | Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website. type: string example: June 19, 2019 authors: description: Authors of the article. type: array items: $ref: '#/components/schemas/Author' example: - name: Alice nameRaw: Alice and Bob - name: Bob nameRaw: Alice and Bob inLanguage: description: | Language of the article, as an ISO 639-1 language code. Example: "en". Sometimes article language is not the same as the web page overall language; to get the detected web page languages, see "webPageInfo". type: string example: en mainImage: $ref: '#/components/schemas/Image' description: The main image of the item. images: description: All images of the item (may include the main image). type: array items: $ref: '#/components/schemas/Image' metadata: $ref: '#/components/schemas/MetadataListItem' required: - metadata url: description: URL of a page where this article list was extracted. type: string example: https://example.com/articles/ metadata: $ref: '#/components/schemas/MetadataList' required: - url - metadata MetadataListItem: description: Item-level metadata for list data types. properties: probability: description: | Probability that extracted item in a list is a valid item. Items which are unlikely to be valid are not returned, so normally no extra thresholding is needed for list items. This probability is not calibrated. type: number minimum: 0.0 maximum: 1.0 example: 0.34 required: - probability MetadataList: description: Top-level metadata for list data types. properties: dateDownloaded: $ref: '#/components/schemas/DateDownloaded' required: - dateDownloaded ArticleNavigation: type: object properties: nextPage: $ref: '#/components/schemas/PaginationNext' pageNumber: $ref: '#/components/schemas/PageNumber' items: description: List of articles available on this page. type: array items: type: object properties: url: description: | URL of a detailed article page. Pass this URL with "article: true" in the request to extract detailed information about the article. type: string example: https://example.com/articles/1/ name: description: The name of the article or article link text. type: string example: Article name datePublished: description: | Publication date. ISO-formatted with 'T' separator, may contain a timezone. type: string example: '2019-06-19T00:00:00' datePublishedRaw: description: | Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website. type: string example: June 19, 2019 metadata: $ref: '#/components/schemas/MetadataListItem' required: - url - metadata url: description: URL of a page containing the list of articles. type: string example: https://example.com/articles/ metadata: $ref: '#/components/schemas/MetadataList' required: - url - metadata PaginationNext: description: A link to the next page in the list. type: object properties: url: description: URL of the next page in the list. type: string example: http://example.com/foo?p=3 name: description: Text of the link to the next page, if available. type: string example: '3' required: - url PageNumber: description: Integer describing the current page number. Starts at 1. type: integer example: 2 ForumThread: type: object properties: topic: description: Topic that is discussed on the page. type: object properties: name: description: Name of the topic. type: string example: How do you cook rice? required: - name posts: description: List of posts available on this page, including the first or top post. type: array items: type: object properties: text: description: | Text of the post. type: string example: Cooking rice is a hobby of mine. Here is how I cook it. datePublished: description: | Publication date. ISO-formatted with 'T' separator, may contain a timezone. type: string example: '2019-06-19T00:00:00' datePublishedRaw: description: | Same date as "datePublished", but before parsing/normalization, i.e. as it appears on the website. type: string example: June 19, 2019 reactions: description: Details of reactions to this post. type: object properties: likes: description: | Number of up-votes or likes/stars received by the post. type: integer minimum: 0 example: 3 replies: description: | Number of replies received by the post. type: integer minimum: 0 example: 2 metadata: $ref: '#/components/schemas/MetadataListItem' required: - metadata url: description: URL of a page where this forum post list was extracted. type: string example: https://example.com/forum/thread/1/ metadata: $ref: '#/components/schemas/MetadataList' required: - url - metadata JobPosting: type: object properties: jobTitle: description: The title of the job. type: string example: Regional Manager datePublished: description: | Publication date of the job posting. ISO-formatted with 'T' separator, may contain a timezone. type: string example: '2019-06-19T00:00:00' datePublishedRaw: description: | Same date as 'datePublished', but before parsing/normalization, i.e. as it appears on the website. type: string example: 19 June 2019 validThrough: description: | The date after which the job posting is not valid, e.g. the end of an offer. ISO-formatted with ‘T’ separator, may contain a timezone. type: string example: '2019-08-20T00:00:00' description: description: | A description of the job posting including sub-headings, with newline separators. type: string example: Job Description ... descriptionHtml: description: | Simplified HTML of the description, including sub-headings, image captions and embedded content. type: string example:
HTML for Job Description ... employmentType: description: | Type of employment (e.g. full-time, part-time, contract, temporary, seasonal, internship). type: string example: Full-time hiringOrganization: description: Information about the organization offering the job position. type: object properties: name: description: Name of the organization. type: string example: ACME Corp. required: - name baseSalary: description: | The base salary of the job or of an employee in the proposed role. type: object properties: raw: description: Salary amount as it appears on the website. example: $53,251 a year type: string valueMax: description: | The maximum value of the base salary as a number string. In case of only one value given for the salary instead of a range, valueMax is used to represent it. example: '53251.0' type: string currency: description: | Currency associated with the salary amount. ISO 4217 standard. type: string example: USD currencyRaw: description: Currency associated with the salary amount, without normalization. type: string example: $ jobLocation: description: | A (typically single) geographic location associated with the job position. type: object properties: raw: description: Job location as it appears on the website. type: string example: West New York, NJ 07093 required: - raw url: description: URL of a page where this job posting was extracted. type: string example: https://example.com/job metadata: $ref: '#/components/schemas/Metadata__metadata' required: - url - metadata JobPostingNavigation: type: object properties: nextPage: $ref: '#/components/schemas/PaginationNext' pageNumber: $ref: '#/components/schemas/PageNumber' items: description: List of job postings available on this page. type: array items: type: object properties: url: description: | URL of a detailed job posting page. Pass this URL with "jobPosting: true" in the request to extract detailed information about the job posting. type: string example: https://example.com/jobs/1/ name: description: The name of the job posting or job posting link text. type: string example: Job posting name metadata: $ref: '#/components/schemas/MetadataListItem' required: - metadata - url url: description: URL a of page. type: string example: https://example.com/jobs/ metadata: $ref: '#/components/schemas/MetadataList' required: - url - metadata PageContent: type: object properties: breadcrumbs: description: | A list of breadcrumbs (a specific navigation element). example: - name: Home url: https://example.com/ - name: Category url: https://example.com/category - name: Subcategory type: array items: $ref: '#/components/schemas/Breadcrumb' headline: description: A page headline. type: string example: Example page headline title: description: A page title extracted from the `` tag of the page. type: string example: Example page title itemMain: description: | Text of the primary content of the page. It does not include navigation elements (headers, footers, sidebars or pagination links). type: string example: Example content snippet showing part of the page’s main text… itemMainXPath: description: | XPath for `itemMain`. It is an XPath 1.0 expression that points to the smallest HTML element that contains all of `itemMain`. The expression may only work with an HTML5-compliant parser. type: string example: //*[@id='homepage-container']/*[1] navigationHeader: description: | Navigation items from the header. They are typically for site-wide navigation, not page-specific. type: array items: type: object properties: url: description: URL. type: string example: https://example.com/category/ name: description: Name. type: string example: Category name required: - url navigationFooter: description: | Navigation items from the footer. They are typically for site-wide navigation, not page-specific. type: array items: type: object properties: url: description: URL. type: string example: https://example.com/policy/ name: description: Name. type: string example: Privacy Policy required: - url navigationSidebar: description: | Navigation items from the sidebars. They are typically for site-wide navigation, not page-specific. type: array items: type: object properties: url: description: URL. type: string example: https://example.com/sidebar-link/ name: description: Name. type: string example: Sidebar link required: - url pagination: description: | Pagination items. Items to navigate content pages, either relative to the current page (e.g. current, next, previous) or absolute (e.g. first, last, specific page number). type: array items: type: object properties: url: description: URL. type: string example: https://example.com/?page=2 name: description: Name. type: string example: Next required: - url nextPage: $ref: '#/components/schemas/PaginationNext' url: description: URL of the page. type: string example: https://example.com/example-page/ canonicalUrl: description: Canonical URL of the page, if available. type: string example: https://example.com/canonical-url-page/ metadata: $ref: '#/components/schemas/Metadata__metadata' required: - url - metadata Product: type: object required: - url - metadata properties: name: $ref: '#/components/schemas/Name' price: $ref: '#/components/schemas/Price' currency: $ref: '#/components/schemas/Currency' currencyRaw: $ref: '#/components/schemas/CurrencyRaw' regularPrice: $ref: '#/components/schemas/RegularPrice' availability: $ref: '#/components/schemas/Availability' sku: $ref: '#/components/schemas/Sku' mpn: $ref: '#/components/schemas/Mpn' gtin: description: | Standardized GTIN product identifier which is unique for a product across different sellers. type: array items: $ref: '#/components/schemas/Gtin' brand: description: | Brand or manufacturer of the product. type: object properties: name: description: Name of the brand. type: string example: Product brand required: - name breadcrumbs: description: | A list of breadcrumbs (a specific navigation element) with optional `name` and `url`. example: - name: Home url: https://example.com/ - name: Cell Phones url: https://example.com/cell-phones - name: Cell Phones & Accessories type: array items: $ref: '#/components/schemas/Breadcrumb' mainImage: $ref: '#/components/schemas/Image' description: The main image of the item. images: description: All images of the item (may include the main image). type: array items: $ref: '#/components/schemas/Image' description: description: Description of the product. type: string example: product description descriptionHtml: description: > Simplified HTML of the description, including sub-headings, image captions and embedded content. type: string example: <article>HTML description for Product ... aggregateRating: description: | The overall rating, based on a collection of reviews or ratings. ![](https://docs.zyte.com/_static/images/schemas/rating.png) type: object properties: ratingValue: description: The average rating value. type: number example: 4.0 bestRating: description: The highest value allowed in this rating system. type: number example: 5.0 reviewCount: description: The total number of reviews or ratings for the product. type: integer minimum: 0 example: 24 color: $ref: '#/components/schemas/Color' size: $ref: '#/components/schemas/Size' weight: $ref: '#/components/schemas/Weight' material: description: | The materials from which the product is made. Contains all product materials on the page. type: string example: Metal, Plastic style: $ref: '#/components/schemas/Style' additionalProperties: description: | A list of properties or characteristics. * name field contains the property name, * value field contains the property value. ![](https://docs.zyte.com/_static/images/schemas/product_info.png) type: array items: $ref: '#/components/schemas/AdditionalProperty' features: description: | A list of features of the Product. The features of a Product can be found generally on the product page arranged in a list, which is usually bulleted. type: array items: type: string example: - Multi-System Compatible - HD Ready 1366 x 768 LED Panel - REFRESH RATE 100Hz PQI url: $ref: '#/components/schemas/Url' canonicalUrl: $ref: '#/components/schemas/CanonicalUrl' metadata: $ref: '#/components/schemas/Metadata__metadata' variants: description: | Array of product variants, using the same Product schema. Represents extra information available about the variants of a product. All variants are included into this array, including the variant shown on the page. If some field in this array is empty, it means that either the value is the same as in the top-level product, or that extraction API did not manage to extract it. type: array items: type: object properties: name: $ref: '#/components/schemas/Name' price: $ref: '#/components/schemas/Price' currency: $ref: '#/components/schemas/Currency' currencyRaw: $ref: '#/components/schemas/CurrencyRaw' regularPrice: $ref: '#/components/schemas/RegularPrice' availability: $ref: '#/components/schemas/Availability' sku: $ref: '#/components/schemas/Sku' mpn: $ref: '#/components/schemas/Mpn' gtin: description: | Standardized GTIN product identifier which is unique for a product across different sellers. type: array items: $ref: '#/components/schemas/Gtin' mainImage: $ref: '#/components/schemas/Image' description: The main image of the item. images: description: All images of the item (may include the main image). type: array items: $ref: '#/components/schemas/Image' color: $ref: '#/components/schemas/Color' size: $ref: '#/components/schemas/Size' style: $ref: '#/components/schemas/Style' additionalProperties: description: | A list of properties or characteristics. * name field contains the property name, * value field contains the property value. ![](https://docs.zyte.com/_static/images/schemas/product_info.png) type: array items: $ref: '#/components/schemas/AdditionalProperty' url: $ref: '#/components/schemas/Url' canonicalUrl: $ref: '#/components/schemas/CanonicalUrl' Name: description: The name of the product. type: string example: Product name Price: description: > The price at which the product is being offered. If there is only one price associated with the offer, it is returned in this field. type: string pattern: ^[0-9]+(\.[0-9]+)?$ example: '149' Currency: description: > The ISO 4217 standard of the currency in which the price is in. type: string pattern: ^[A-Z]{3}$ example: USD CurrencyRaw: description: > The currency as given on the website, without extra normalization (for example, both "$" and "USD" are possible currencies). type: string example: $ RegularPrice: description: > The price before any discount or special offer. type: string pattern: ^[0-9]+(\.[0-9]+)?$ example: '199.00' Availability: description: > Availability, as a string. Allowed values: * `"InStock"` - includes limited availability, presale, preorder, and in-store only. * `"OutOfStock"` - includes discontinued and sold out. example: InStock type: string enum: - InStock - OutOfStock Sku: description: | The Stock Keeping Unit (SKU), i.e. a merchant-specific identifier for the product - identifier assigned by the seller. ![](https://docs.zyte.com/_static/images/schemas/sku.png) example: A123DK9823 type: string Mpn: description: The Manufacturer Part Number (MPN) of the product. It is issued by the manufacturer, and is the same across different e-commerce websites. type: string example: code-123 Gtin: type: object description: > Standardized GTIN product identifier which is unique for a product across different sellers. example: type: isbn13 value: 9781933624341 properties: type: description: | `gtin14` corresponds to former names *EAN/UCC-14*, *SCC-14*, *DUN-14*, *UPC Case Code*, *UPC Shipping Container Code*. `gtin13` also includes the *jan* (japanese article number). enum: - gtin8 - gtin13 - gtin14 - isbn10 - isbn13 - ismn - issn - upc type: string value: description: The GTIN value as a string. type: string required: - type - value Color: description: Color of the product. type: string example: Red Size: description: | A standardized size of a product, specified through a simple textual string (for example "XL", "32Wx34L"). A single product dimension (height, width) is not considered as the size. type: string example: XL Weight: type: object properties: value: description: | A weight value expressed as a floating point number. type: number example: 120.0 unit: description: | A normalized unit of weight, like kilogram / ounce / pound and others. type: string example: kilogram rawUnit: description: | A unit of weight without normalization - how it was extracted from the page. Normalized version of the rawUnit is in 'unit' attribute. type: string example: kg Style: description: | Style of the product. It can also be referred as pattern/finish on the product page. Example values: "Polka dots", "Striped", "Nickel finish with Translucent glass", etc. type: string example: Striped AdditionalProperty: description: | Additional propertiy or characteristics. * name field contains the property name, * value field contains the property value. ![](https://docs.zyte.com/_static/images/schemas/product_info.png) example: name: batteries value: 1 Lithium ion batteries required. (included) type: object properties: name: description: Property name. type: string value: description: Property value. type: string required: - name Url: description: URL of a page where this product was extracted. type: string example: https://example.com/product/ CanonicalUrl: description: Canonical URL of the product, if available. type: string example: https://example.com/product/ ProductList: type: object properties: breadcrumbs: description: | A list of breadcrumbs (a specific navigation element) with optional `name` and `url`. example: - name: Home url: https://example.com/ - name: Cell Phones url: https://example.com/cell-phones - name: Cell Phones & Accessories type: array items: $ref: '#/components/schemas/Breadcrumb' products: description: List of products available on this page. type: array items: type: object properties: url: description: | URL of a detailed product page. Pass this URL with "product: true" in the request to extract detailed information about the product. type: string example: https://example.com/products/1/ name: description: The name of the product. type: string example: Product name price: $ref: '#/components/schemas/Price' currencyRaw: $ref: '#/components/schemas/CurrencyRaw' currency: $ref: '#/components/schemas/Currency' regularPrice: $ref: '#/components/schemas/RegularPrice' mainImage: $ref: '#/components/schemas/Image' description: The main image of the item. metadata: $ref: '#/components/schemas/MetadataListItem' required: - metadata url: description: URL of a page where this product list was extracted. type: string example: https://example.com/products/ metadata: $ref: '#/components/schemas/MetadataList' categoryName: description: Name of the category in which the listed products are. type: string example: Sports & Outdoors required: - url - metadata ProductNavigation: type: object properties: categoryName: description: Name of the category in which the listed products are found. type: string example: Sports & Outdoors nextPage: $ref: '#/components/schemas/PaginationNext' pageNumber: $ref: '#/components/schemas/PageNumber' items: description: List of products available on this page. type: array items: type: object properties: url: description: | URL of a detailed product page. Pass this URL with "product: true" in the request to extract detailed information about the product. type: string example: https://example.com/products/1/ name: description: The name of the product or product link text. type: string example: Product name metadata: $ref: '#/components/schemas/MetadataListItem' required: - metadata - url subCategories: description: List of subcategory links found on this page. type: array items: type: object properties: url: description: | URL of the subcategory. type: string example: https://example.com/category/1/ name: description: The name of the subcategory or subcategory link text. type: string example: Category name metadata: $ref: '#/components/schemas/MetadataListItem' required: - metadata - url url: description: URL a of page. type: string example: https://example.com/products/ metadata: $ref: '#/components/schemas/MetadataList' required: - url - metadata ``` ## Zyte API stats API > ###### TIP > > For the reference documentation of the HTTP API of Zyte API itself, > see zapi-reference. The [Zyte dashboard](https://app.zyte.com) has a [Stats](https://app.zyte.com/o/stats/usage) page that lets you monitor different aspects of your Zyte API requests, including cost, response time, or features used. Zyte API also offers an HTTP API to query your Zyte API requests. ### Authentication All requests require [basic authentication](https://datatracker.ietf.org/doc/html/rfc7617#section-2), with your [Zyte dashboard API key](https://app.zyte.com/o/settings) (not your Zyte API key) as username, and no password. For example, if your API key is `foo`, you base64-encode `foo:` as `Zm9vOg==` and send the `Authorization` header with value `Basic Zm9vOg==`. ```none Authorization: Basic Zm9vOg== ``` ![](zyte-api/usage/images/account_settings.png) ### Basic usage The most basic request only requires an organization ID. To find your organization ID, open the [Zyte dashboard](https://app.zyte.com) and copy your organization ID from the browser address bar. For example, if the URL is `https://app.zyte.com/o/000000`, `000000` is your organization ID. ```bash curl \ --user YOUR_STATS_API_KEY: \ --compressed \ https://zyte-api-stats.zyte.com/api/stats?organization_id=000000 ``` ```json { "page": 1, "page_size": 500, "results": [ { "cost_microusd_avg": "1335.10", "cost_microusd_p80": "2040.00", "cost_microusd_total": "584773.00", "organization_id": 000000, "request_count": 438, "response_time_sec_avg": "5.49", "response_time_sec_p80": "6.40", "status_codes": [ { "code": null, "count": 3 }, { "code": 200, "count": 432 }, { "code": 404, "count": 3 } ] } ], "total_result_count": 1 } ``` ### Rate limiting The stats API has a rate limit of 20 requests per minute. Anything above that will trigger a 429 response. ### Grafana dashboard Follow the steps below to replicate the shown [Grafana](https://grafana.com/grafana/) dashboard to visualize data from the stats API. 1. First, install the [Infinity](https://grafana.com/grafana/plugins/yesoreyeram-infinity-datasource/) plugin on your Grafana instance. 2. Add the newly installed data source into the [Data Sources](https://grafana.com/docs/plugins/yesoreyeram-infinity-datasource/latest/setup/configuration/) section and configure it to fetch data from [https://zyte-api-stats.zyte.com](https://zyte-api-stats.zyte.com) with your Stats dashboard API key. ![](zyte-api/usage/images/infinity_auth.png)![](zyte-api/usage/images/infinity_url.png) 3. [Impot the dashboard](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/) from the file [stats_api_demo.json](https://docs.zyte.com/_static/stats_api_demo.json). 4. Paste your organization ID into the “organization_id” field as shown in the screenshot below. ![](zyte-api/usage/images/grafana_screenshot.png) ### Google Looker Studio dashboard Follow the steps below to replicate the shown [Google Looker Studio](https://lookerstudio.google.com/) dashboard to visualize data from the stats API. 1. First, connect to the [Zyte API Stats Connector](https://datastudio.google.com/datasources/create?connectorId=AKfycbyh_6V56h157XDWT97HQZEyc_cQlE7bdL633-9AfUWaOWwTkHJ-UyRTMvmUVzunKm_JYQ&authuser=0). 2. It will ask for the API key - provide your your Stats dashboard API key (not your Zyte API key). 3. Check all of the “Allow … to be modified in reports.” checkboxes. 4. Paste your organization ID into the “organization_id” parameter. 5. Click the “Connect”, “Allow”, “Create report” and “Create report” buttons. ![](zyte-api/usage/images/lookerstudio_screenshot.png) ### Reference ```yaml components: schemas: HTTPError: properties: detail: type: object message: type: string type: object StatsResponse: properties: page: minimum: 1 type: integer page_size: maximum: 500 minimum: 1 type: integer results: items: $ref: '#/components/schemas/StatsResult' type: array total_result_count: minimum: 1 type: integer required: - page - page_size - total_result_count type: object StatsResult: properties: cost_microusd_avg: minimum: '0.00' type: number cost_microusd_p80: minimum: '0.00' type: number cost_microusd_total: minimum: '0.00' type: number day: format: date-time type: string domain: maxLength: 256 minLength: 0 type: string domain_health: description: "Domain health information. Returned only when `include_domain_health=true` and `groupby_domain=true`.\n\nIt aims to show detailed stats from your Top 100 most requested domains in the last 7 days. Domains not recently used or not within the Top 100 domains will have a `null` value. These stats are not real-time; they are calculated once every 3 hours." type: object properties: global_avg_success_rate_24h: type: string global_avg_success_rate_7d: type: string my_avg_price_microusd_24h: type: string my_avg_price_microusd_7d: type: string my_avg_response_time_24h: type: string my_avg_response_time_7d: type: string my_requests_24h: type: integer my_requests_7d: type: integer my_success_rate_24h: type: string my_success_rate_7d: type: string status: type: string enum: - healthy - possible_misconfiguration - issue_under_investigation - possible_performance_issue total_spent_microusd_24h: type: string total_spent_microusd_7d: type: string total_successful_requests_24h: type: integer total_successful_requests_7d: type: integer hour: format: date-time type: string month: format: date-time type: string organization_id: type: integer request_count: minimum: 1 type: integer response_time_sec_avg: minimum: '0.00' type: number response_time_sec_p80: minimum: '0.00' type: number status_codes: items: additionalProperties: minimum: 0 nullable: true type: integer type: object type: array year: format: date-time type: string required: - cost_microusd_avg - cost_microusd_p80 - cost_microusd_total - organization_id - request_count - response_time_sec_avg - response_time_sec_p80 type: object ValidationError: properties: detail: properties: <location>: properties: <field_name>: items: type: string type: array type: object type: object message: type: string type: object securitySchemes: BasicAuth: scheme: basic type: http info: title: APIFlask version: 0.1.0 openapi: 3.0.3 paths: /api/stats: get: parameters: - in: query name: organization_id required: true schema: type: integer - in: query name: page required: false schema: default: 1 minimum: 1 type: integer - in: query name: page_size required: false schema: default: 500 maximum: 500 minimum: 1 type: integer - description: "The start date and time in\n[ISO 8601-1](https://en.wikipedia.org/wiki/ISO_8601) format (e.g. `2024-09-10T00:00:00Z`).\n\nIt defaults to 7 days in the past." in: query name: start_time required: false schema: format: date-time type: string - description: "The end date and time in\n[ISO 8601-1](https://en.wikipedia.org/wiki/ISO_8601) format (e.g. `2024-09-17T00:00:00Z`).\n\nIt defaults to the current date and time." in: query name: end_time required: false schema: format: date-time type: string - in: query name: domains required: false schema: maxLength: 64 minLength: 0 - in: query name: apikey_labels required: false schema: maxLength: 64 minLength: 0 - in: query name: response_codes required: false schema: maxLength: 64 minLength: 0 - in: query name: requested_features required: false schema: enum: - actions - browserHtml - fileDownload - httpResponseBody - networkCapture - screenshot - sessionContext - extendedGeolocation - in: query name: extraction_type required: false schema: enum: - article - articleList - articleNavigation - forumThread - jobPosting - jobPostingNavigation - pageContent - product - productList - productNavigation - serp - in: query name: extraction_from required: false schema: enum: - httpResponseBody - browserHtml - description: "Filter requests by\n[tags](/zyte-api/usage/reference.md).\n \nIt must be a comma-separated list of values, where each value can be:\n \n- A key-value pair separated by a colon, i.e. ``<tag>:<value>``, to include only\nrequests where the specified tag has the specified value.\n- A tag, to include only requests where the specified tag exists.\n\nOnly requests that match *all* the specified tag filters will be\nincluded in the results." in: query name: tags required: false schema: maxLength: 64 minLength: 0 - in: query name: groupby_time required: false schema: default: enum: - hour - day - month - year - nullable: true type: string - description: "Group results by domain.\n\nWhen set to `true`, the response will include a `domain` field for each result." in: query name: groupby_domain required: false schema: default: false type: boolean - description: "Include domain health information in the response.\n\nRequires `groupby_domain=true`. If `include_domain_health=true` is specified without `groupby_domain=true`, a validation error will be returned." in: query name: include_domain_health required: false schema: default: false type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/StatsResponse' description: Successful response '401': content: application/json: schema: $ref: '#/components/schemas/HTTPError' description: Authentication error '422': content: application/json: schema: $ref: '#/components/schemas/ValidationError' description: Validation error security: - BasicAuth: [] summary: Stats servers: - name: Production Server url: https://zyte-api-stats.zyte.com ``` ## Zyte Search API The **Search API** provides a typed interface for search engine queries. Send a keyword and a domain, get back structured organic results — no URL construction or HTML parsing needed on the client side. `POST` to `https://api.zyte.com/v1/search` with your [Zyte API key](https://app.zyte.com/o/zyte-api/api-access): ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data '{ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"] }' \ https://api.zyte.com/v1/search ``` > ##### Quickstart > > Get your first search results in minutes. > > Get started > ##### Request parameters > > Full reference for `domain`, `query`, `include`, > `maxResults`, and `queryParameters`. > > Parameters > ##### Response schema > > Response fields including `organicResults`, `html`, > `fetchedAt`, and `meta`. > > Response > ##### Geo-targeting > > Target specific countries, languages, and search domains. > > Geo-targeting ## Quickstart This guide shows how to make your first Search API request and get structured organic results back. ### Prerequisites You need a [Zyte API key](https://app.zyte.com/o/zyte-api/api-access). ### Basic request Send a `POST` request to `https://api.zyte.com/v1/search` with `domain` and `query`. Use `include` to control what you get back. #### curl input.json ```json { "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/search \ | jq .organicResults ``` #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary<string, object>(){ {"domain", "search.engine.com"}, {"query", "web scraping tools"}, {"include", new[] {"organic"}} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/search", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var organicResults = data.RootElement.GetProperty("organicResults").ToString(); Console.WriteLine(organicResults); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonArray; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.List; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map<String, Object> parameters = ImmutableMap.of( "domain", "search.engine.com", "query", "web scraping tools", "include", List.of("organic")); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/search"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonArray organicResults = jsonObject.get("organicResults").getAsJsonArray(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(organicResults)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/search', { domain: 'search.engine.com', query: 'web scraping tools', include: ['organic'] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const organicResults = response.data.organicResults console.log(organicResults) }) ``` #### PHP ```php <?php $client = new GuzzleHttp\Client(); $response = $client->request('POST', 'https://api.zyte.com/v1/search', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'domain' => 'search.engine.com', 'query' => 'web scraping tools', 'include' => ['organic'], ], ]); $data = json_decode($response->getBody()); $organicResults = json_encode($data->organicResults); echo $organicResults.PHP_EOL; ``` #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], }, ) organic_results = api_response.json()["organicResults"] print(organic_results) ``` #### Python client ```python import requests api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], }, ) organic_results = api_response.json()["organicResults"] print(organic_results) ``` The response contains a structured `organicResults` array: ```json { "status": "success", "url": "https://www.example-engine.com/search?q=web+scraping+tools", "fetchedAt": "2026-05-11T09:36:57Z", "meta": { "requestedAt": "2026-05-11T09:36:39Z" }, "organicResults": [ { "rank": 1, "title": "Zyte - Web Scraping API", "url": "https://www.zyte.com/", "snippet": "The leading web scraping platform...", "displayedUrl": "zyte.com" } ] } ``` ### Getting raw HTML Use `include: ["html"]` to get the raw rendered HTML instead of parsed results. You can also request both at once: ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data '{ "domain": "search.engine.com", "query": "web scraping tools", "include": ["html", "organic"] }' \ https://api.zyte.com/v1/search ``` ### More results Set `maxResults` to get up to 100 results in a single call. The platform fetches multiple pages automatically and returns them in one `organicResults` array: #### curl input.json ```json { "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], "maxResults": 100 } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/search \ | jq .organicResults ``` #### C# ```cs var input = new Dictionary<string, object>(){ {"domain", "search.engine.com"}, {"query", "web scraping tools"}, {"include", new[] {"organic"}}, {"maxResults", 100} }; ``` #### Java ```java Map<String, Object> parameters = ImmutableMap.of( "domain", "search.engine.com", "query", "web scraping tools", "include", List.of("organic"), "maxResults", 100); ``` #### JS ```js axios.post( 'https://api.zyte.com/v1/search', { domain: 'search.engine.com', query: 'web scraping tools', include: ['organic'], maxResults: 100 }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { console.log(response.data.organicResults) }) ``` #### PHP ```php $response = $client->request('POST', 'https://api.zyte.com/v1/search', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'json' => [ 'domain' => 'search.engine.com', 'query' => 'web scraping tools', 'include' => ['organic'], 'maxResults' => 100, ], ]); ``` #### Python ```python api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], "maxResults": 100, }, ) organic_results = api_response.json()["organicResults"] ``` #### Python client ```python api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], "maxResults": 100, }, ) organic_results = api_response.json()["organicResults"] ``` ### Geo-targeting Pass `queryParameters` to target a specific country and language: #### curl input.json ```json { "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], "queryParameters": { "style": "engineSpecific", "gl": "us", "hl": "en" } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/search \ | jq .organicResults ``` #### C# ```cs var input = new Dictionary<string, object>(){ {"domain", "search.engine.com"}, {"query", "web scraping tools"}, {"include", new[] {"organic"}}, {"queryParameters", new Dictionary<string, object>(){ {"style", "engineSpecific"}, {"gl", "us"}, {"hl", "en"} }} }; ``` #### Java ```java Map<String, Object> parameters = ImmutableMap.of( "domain", "search.engine.com", "query", "web scraping tools", "include", List.of("organic"), "queryParameters", ImmutableMap.of( "style", "engineSpecific", "gl", "us", "hl", "en")); ``` #### JS ```js axios.post( 'https://api.zyte.com/v1/search', { domain: 'search.engine.com', query: 'web scraping tools', include: ['organic'], queryParameters: { style: 'engineSpecific', gl: 'us', hl: 'en' } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { console.log(response.data.organicResults) }) ``` #### PHP ```php $response = $client->request('POST', 'https://api.zyte.com/v1/search', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'json' => [ 'domain' => 'search.engine.com', 'query' => 'web scraping tools', 'include' => ['organic'], 'queryParameters' => [ 'style' => 'engineSpecific', 'gl' => 'us', 'hl' => 'en', ], ], ]); ``` #### Python ```python api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], "queryParameters": { "style": "engineSpecific", "gl": "us", "hl": "en", }, }, ) organic_results = api_response.json()["organicResults"] ``` #### Python client ```python api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["organic"], "queryParameters": { "style": "engineSpecific", "gl": "us", "hl": "en", }, }, ) organic_results = api_response.json()["organicResults"] ``` For city-level targeting, add a `uule` value. Generate one from a city name using the `uule_grabber` Python library: ```python import uule_grabber uule_grabber.uule("Chicago, USA") # w+CAIQ... ``` ### AI Overview Add `"aiOverview"` to `include` to trigger full browser rendering. The AI Overview block will be present in the raw `html` field: #### curl input.json ```json { "domain": "search.engine.com", "query": "web scraping tools", "include": ["aiOverview", "organic", "html"] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/search ``` #### C# ```cs var input = new Dictionary<string, object>(){ {"domain", "search.engine.com"}, {"query", "web scraping tools"}, {"include", new[] {"aiOverview", "organic", "html"}} }; ``` #### Java ```java Map<String, Object> parameters = ImmutableMap.of( "domain", "search.engine.com", "query", "web scraping tools", "include", List.of("aiOverview", "organic", "html")); ``` #### JS ```js axios.post( 'https://api.zyte.com/v1/search', { domain: 'search.engine.com', query: 'web scraping tools', include: ['aiOverview', 'organic', 'html'] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const { organicResults, html } = response.data console.log(organicResults) }) ``` #### PHP ```php $response = $client->request('POST', 'https://api.zyte.com/v1/search', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'json' => [ 'domain' => 'search.engine.com', 'query' => 'web scraping tools', 'include' => ['aiOverview', 'organic', 'html'], ], ]); ``` #### Python ```python api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["aiOverview", "organic", "html"], }, ) data = api_response.json() organic_results = data["organicResults"] html = data["html"] # parse AI Overview from here ``` #### Python client ```python api_response = requests.post( "https://api.zyte.com/v1/search", auth=("YOUR_ZYTE_API_KEY", ""), json={ "domain": "search.engine.com", "query": "web scraping tools", "include": ["aiOverview", "organic", "html"], }, ) data = api_response.json() organic_results = data["organicResults"] html = data["html"] # parse AI Overview from here ``` > ###### NOTE > > Parsed `aiOverview` extraction is coming in a future release. For now, > the AI Overview block is available in the raw `html` field. ### Next steps - request — full parameter reference - response — response field details - geo — geo-targeting by country, language, and domain ## Request parameters Send a `POST` request to `https://api.zyte.com/v1/search`. | Field | Type | Required | Description | |-------------------|----------|------------|--------------------------------------------------------------------------------------------------------------------------| | `domain` | string | Yes | A supported search domain, e.g. `google.com`, `google.co.uk`, `google.de`. Unsupported domains return a 400 error. | | `query` | string | Yes | The search keywords (1-2048 characters). URL-encoded automatically. | | `include` | string[] | No | What to return: `"html"` (raw HTML), `"organic"` (parsed results), `"aiOverview"` (coming soon). Defaults to `["html"]`. | | `maxResults` | integer | No | Number of organic results: 10, 20, 30 … 100. Default: `10`. Request weight = `max(1, maxResults / 10)`. | | `queryParameters` | object | No | Additional engine-specific query parameters. See below. | ### queryParameters Controls geo-targeting and other search modifiers. Two styles are supported via the `style` field. #### Engine-specific style Pass engine-native parameters directly. Set `style` to `"engineSpecific"`. ```json { "domain": "search.engine.com", "query": "web scraping tools", "queryParameters": { "style": "engineSpecific", "gl": "us", "hl": "en" } } ``` Supported engine-specific fields: | Field | Example | Purpose | |---------|---------------|-------------------------------------------| | `gl` | `"us"` | Geographic limit (country) | | `hl` | `"en"` | Interface language | | `cr` | `"countryUS"` | Country restrict | | `lr` | `"lang_en"` | Language restrict | | `safe` | `"active"` | SafeSearch (`"active"` or `"off"`) | | `nfpr` | `1` | Disable autocorrect | | `uule` | `"w+CAI..."` | Encoded location for city-level targeting | #### Generic style A portable, engine-agnostic interface. Set `style` to `"generic"`. ```json { "domain": "search.engine.com", "query": "web scraping tools", "queryParameters": { "style": "generic", "geolocation": "US", "locale": "en-US" } } ``` | Field | Example | Maps to | |---------------|-----------|------------| | `geolocation` | `"US"` | `gl=US` | | `locale` | `"en-US"` | `hl=en-US` | ### Error responses | HTTP | Type | When | |--------|---------------------------------|-----------------------------------| | 400 | `/request/domain-not-supported` | Domain is not supported | | 400 | `/request/max-results-invalid` | `maxResults` is not a valid value | | 401 | `/auth/key-not-found` | Missing or invalid API key | | 429 | `/limits/over-search-limit` | Rate limit exceeded | ## Response schema All responses are JSON. The fields returned depend on what you request via `include`. ### Top-level fields | Field | Always present | Description | |------------------|------------------|----------------------------------------------------------------------------------------------------------------------| | `status` | Yes | `"success"` on a successful response. | | `url` | Yes | The search URL that was fetched. | | `fetchedAt` | Yes | ISO-8601 timestamp of when the page was fetched. | | `meta` | Yes | Request metadata. Includes `requestedAt`, and geo fields if `queryParameters` were passed (`geolocation`, `locale`). | | `html` | When requested | Raw rendered HTML of the SERP page. Returned when `"html"` is in `include`. | | `organicResults` | When requested | Array of parsed organic results. Returned when `"organic"` is in `include`. | ### organicResults Each item in the `organicResults` array has the following fields: | Field | Type | Always present | Description | |----------------|---------|------------------|-------------------------------------------------------| | `rank` | integer | Yes | Position in results, starting at 1. | | `title` | string | Yes | Result title as shown on the SERP. | | `url` | string | Yes | Link to the result page. | | `snippet` | string | No | Description text shown on the SERP. | | `displayedUrl` | string | No | URL as displayed on the SERP (may differ from `url`). | ### Example response ```json { "status": "success", "url": "https://www.example-engine.com/search?q=web+scraping+tools", "fetchedAt": "2026-05-11T09:36:57Z", "meta": { "requestedAt": "2026-05-11T09:36:39Z", "geolocation": "US", "locale": "en-US" }, "organicResults": [ { "rank": 1, "title": "Zyte - Web Scraping API", "url": "https://www.zyte.com/", "snippet": "The leading web scraping platform...", "displayedUrl": "zyte.com" }, { "rank": 2, "title": "ScraperAPI", "url": "https://www.scraperapi.com/", "snippet": "Scale data collection with a simple API.", "displayedUrl": "scraperapi.com" } ] } ``` ## Geo-targeting The Search API supports two levels of geo-targeting: country/language via `queryParameters`, and domain selection via `domain`. ### Country and language Pass `queryParameters` to control the country and language of results. #### Engine-specific (recommended) Pass engine-native parameters using `style: "engineSpecific"`: ```json { "domain": "search.engine.com", "query": "pizza restaurants", "queryParameters": { "style": "engineSpecific", "gl": "us", "hl": "en" } } ``` #### Generic Use the portable `geolocation` and `locale` fields: ```json { "domain": "search.engine.com", "query": "pizza restaurants", "queryParameters": { "style": "generic", "geolocation": "US", "locale": "en-US" } } ``` ### Targeting a regional domain To get results from a country-specific domain, change the `domain` field. The platform fetches from that domain directly: ```json { "domain": "search.engine.com", "query": "pizza restaurants", "queryParameters": { "style": "engineSpecific", "gl": "de", "hl": "de" } } ``` Supported domains include regional variants such as `google.com`, `google.co.uk`, `google.de`, `google.fr`, `google.co.jp`, `google.com.br`. Unsupported domains return a 400 error. ### City-level targeting For city-level precision, pass a `uule` value in `queryParameters`: ```json { "domain": "search.engine.com", "query": "pizza restaurants", "queryParameters": { "style": "engineSpecific", "uule": "w+CAIQICINY2hpY2Fnbywgsuited" } } ``` > ###### NOTE > > The `uule` parameter uses the canonical `w+` format. Generate a > value from a city name using the `uule_grabber` Python library: > > ```python > import uule_grabber > uule_grabber.uule("Chicago, USA") # w+CAIQ... > ``` ## Code examples The Zyte API documentation features code examples for many different technologies. You can find those examples at the end of relevant topics in pages like zapi-http, zapi-browser, zapi-extract or zapi-shared-features, or find them all below. > ###### TIP > > The right-hand sidebar of the Zyte API reference contains additional examples of Zyte API parameters. ### Requirements Select a technology tab below to learn how to install and configure the requirements to run code examples for that technology: #### C# **C#** code examples use C# 9.0. To run **C#** code examples, install: - [.NET SDK](https://dotnet.microsoft.com/en-us/download) 5.x or later - [Html Agility Pack](https://www.nuget.org/packages/HtmlAgilityPack/), for HTML parsing #### CLI client **CLI client** code examples feature the command-line interface of [python-zyte-api](http://python-zyte-api.readthedocs.io/), the official Python client of Zyte API, along with other command-line tools. To run **CLI client** code examples, install: - [python-zyte-api](http://python-zyte-api.readthedocs.io/), for requests. Requires [installing Python](https://wiki.python.org/moin/BeginnersGuide/Download) first. - [jq](https://stedolan.github.io/jq/download/), for JSON parsing. - [base64](https://www.gnu.org/software/coreutils/manual/html_node/base64-invocation.html#base64-invocation), for base64 encoding and decoding. - On **Windows**, you can [use chocolatey to install GNU Core Utilities](https://community.chocolatey.org/packages/gnuwin32-coreutils.install), which includes a `base64` command-line application. - **macOS** comes with a `base64` command-line application pre-installed. - Most **Linux** distributions come with [GNU Core Utilities](https://www.gnu.org/software/coreutils/) pre-installed, or make it easy to install it. GNU Core Utilities includes a `base64` command-line application. - [xmllint](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home), for HTML parsing. - On **Windows**, [install libxml2](https://www.zlatkovic.com/projects/libxml/index.html), which provides `xmllint`. - **macOS** comes with `xmllint` pre-installed. - Most **Linux** distributions make it easy to install [libxml2](https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/README.md), which provides `xmllint`. - [xargs](https://www.gnu.org/software/findutils/manual/html_node/find_html/Invoking-xargs.html), for parallelization. - On **Windows**, you can [use chocolatey to install GNU findutils](https://community.chocolatey.org/packages/findutils), which includes a `xargs` command-line application. - **macOS** comes with a `xargs` command-line application pre-installed. - **Most** Linux distributions come with [GNU findutils](https://www.gnu.org/software/findutils/) pre-installed, or make it easy to install it. GNU findutils includes a `xargs` command-line application. #### curl **curl** code examples feature [curl](https://everything.curl.dev/) and other command-line tools. To run **curl** code examples, install: - [curl](https://everything.curl.dev/), for requests. > ###### NOTE > > curl comes pre-installed in many operating systems. - [jq](https://stedolan.github.io/jq/download/), for JSON parsing. - [base64](https://www.gnu.org/software/coreutils/manual/html_node/base64-invocation.html#base64-invocation), for base64 encoding and decoding. - On **Windows**, you can [use chocolatey to install GNU Core Utilities](https://community.chocolatey.org/packages/gnuwin32-coreutils.install), which includes a `base64` command-line application. - **macOS** comes with a `base64` command-line application pre-installed. - Most **Linux** distributions come with [GNU Core Utilities](https://www.gnu.org/software/coreutils/) pre-installed, or make it easy to install it. GNU Core Utilities includes a `base64` command-line application. - [xmllint](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home), for HTML parsing. - On **Windows**, [install libxml2](https://www.zlatkovic.com/projects/libxml/index.html), which provides `xmllint`. - **macOS** comes with `xmllint` pre-installed. - Most **Linux** distributions make it easy to install [libxml2](https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/README.md), which provides `xmllint`. - [xargs](https://www.gnu.org/software/findutils/manual/html_node/find_html/Invoking-xargs.html), for parallelization. - On **Windows**, you can [use chocolatey to install GNU findutils](https://community.chocolatey.org/packages/findutils), which includes a `xargs` command-line application. - **macOS** comes with a `xargs` command-line application pre-installed. - **Most** Linux distributions come with [GNU findutils](https://www.gnu.org/software/findutils/) pre-installed, or make it easy to install it. GNU findutils includes a `xargs` command-line application. #### Java **Java** code examples use Java SE 8. To run **Java** code examples, install: - [JDK 8u202](https://www.oracle.com/en_us/java/technologies/javase/javase8-archive-downloads.html) or [later](https://www.oracle.com/en_us/java/technologies/downloads/). - [HttpClient 5.1 from Apache HttpComponents](https://hc.apache.org/httpcomponents-client-5.1.x/) - [Gson 2.9.1](https://github.com/google/gson). #### JS **JS** code examples use JavaScript. To run **JS** code examples, install: - [Node.js](https://nodejs.org/en/). - [axios](https://github.com/axios/axios), for requests. - [cheerio](https://github.com/cheeriojs/cheerio), for HTML parsing. - [https-proxy-agent](https://www.npmjs.com/package/https-proxy-agent), for proxy mode. #### PHP **PHP** code examples use PHP 7.4. To run **PHP** code examples, install: - [PHP](https://www.php.net/manual/en/install.php) 7.4. - [Guzzle](https://docs.guzzlephp.org/en/stable/index.html), for requests. - The `dom` [extension](https://www.php.net/manual/en/install.pecl.php), for HTML parsing. #### Proxy mode **Proxy mode** code examples use curl with Zyte API as a proxy. See the **curl** tab for code example requirement details. See zapi-proxy to learn how to use Zyte API as a proxy with other technologies. #### Python **Python** code examples use Python 3. To run **Python** code examples, install: - [Python](https://wiki.python.org/moin/BeginnersGuide/Download). - [Requests](https://docs.python-requests.org/), for single requests. - [aiohttp](https://docs.aiohttp.org/en/stable/index.html), for concurrent requests. - [Parsel](https://parsel.readthedocs.io/en/latest/), for HTML parsing. #### Python client **Python client** code examples feature the asyncio API of [python-zyte-api](http://python-zyte-api.readthedocs.io/), the official Python client of Zyte API. To run **Python client** code examples, install: - [Python](https://wiki.python.org/moin/BeginnersGuide/Download). - [python-zyte-api](http://python-zyte-api.readthedocs.io/), for requests. - [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. #### Ruby **Ruby** code examples use [Ruby 3.x](https://www.ruby-lang.org/en/). #### Scrapy **Scrapy** code examples feature [Scrapy](https://docs.scrapy.org/en/latest/) with the scrapy-zyte-api plugin configured in transparent mode. To run **Scrapy** code examples, install: - [Scrapy](https://docs.scrapy.org/en/latest/). - scrapy-zyte-api. After installing scrapy-zyte-api, you must also configure it in your Scrapy project. If you configure it enabling its components separately instead of enabling the add-on, you also need to set `ZYTE_API_TRANSPARENT_MODE` to `True`. > ###### TIP > > The web scraping tutorial covers installing > and configuring the requirements for **Scrapy** code examples. ### All examples #### Running the `scrollBottom` action > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary<string, object>(){ {"url", "https://quotes.toscrape.com/scroll"}, {"browserHtml", true}, { "actions", new List<Dictionary<string, object>>() { new Dictionary<string, object>() { {"action", "scrollBottom"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var quoteCount = (double)navigator.Evaluate("count(//*[@class=\"quote\"])"); ``` #### CLI client input.jsonl ```json {"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{"action": "scrollBottom"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null ``` #### curl input.json ```json { "url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [ { "action": "scrollBottom" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map<String, Object> action = ImmutableMap.of("action", "scrollBottom"); Map<String, Object> parameters = ImmutableMap.of( "url", "https://quotes.toscrape.com/scroll", "browserHtml", true, "actions", Collections.singletonList(action)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); int quoteCount = document.select(".quote").size(); System.out.println(quoteCount); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/scroll', browserHtml: true, actions: [ { action: 'scrollBottom' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml const $ = cheerio.load(browserHtml) const quoteCount = $('.quote').length }) ``` #### PHP ```php <?php $client = new GuzzleHttp\Client(); $response = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/scroll', 'browserHtml' => true, 'actions' => [ ['action' => 'scrollBottom'], ], ], ]); $data = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($data->browserHtml); $xpath = new DOMXPath($doc); $quote_count = $xpath->query("//*[@class='quote']")->count(); ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, ) browser_html = api_response.json()["browserHtml"] quote_count = len(Selector(browser_html).css(".quote")) ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, ) browser_html = api_response["browserHtml"] quote_count = len(Selector(browser_html).css(".quote")) print(quote_count) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield Request( "https://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, }, ) def parse(self, response): quote_count = len(response.css(".quote")) ``` Output: ```none 100 ``` #### Getting an HTTP response body > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary<string, object>(){ {"url", "https://toscrape.com"}, {"httpResponseBody", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "httpResponseBody": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ > output.html ``` #### curl input.json ```json { "url": "https://toscrape.com", "httpResponseBody": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ > output.html ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map<String, Object> parameters = ImmutableMap.of("url", "https://toscrape.com", "httpResponseBody", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', httpResponseBody: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) }) ``` #### PHP ```php <?php $client = new GuzzleHttp\Client(); $response = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'httpResponseBody' => true, ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); ``` #### Proxy mode With the proxy mode, you always get a response body. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com \ > output.html ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "httpResponseBody": True, }, ) http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"]) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "httpResponseBody": True, } ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy In transparent mode, when you target a text resource (e.g. HTML, JSON), regular Scrapy requests work out of the box: ```python from scrapy import Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): http_response_text: str = response.text ``` While regular Scrapy requests also work for binary responses at the moment, they may stop working in future versions of scrapy-zyte-api, so passing [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody) is recommended when targeting binary resources: ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "httpResponseBody": True, }, }, ) def parse(self, response): http_response_body: bytes = response.body ``` Output (first 5 lines): ```html <!DOCTYPE html> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Scraping Sandbox ``` #### Setting a `Referer` header in a browser request > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"browserHtml", true}, { "requestHeaders", new Dictionary() { {"referer", "https://example.org/"} } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//text()"); nodeIterator.MoveNext(); var responseJson = nodeIterator.Current.ToString(); var responseData = JsonDocument.Parse(responseJson); var headerEnumerator = responseData.RootElement.GetProperty("headers").EnumerateObject(); var headers = new Dictionary(); while (headerEnumerator.MoveNext()) { headers.Add( headerEnumerator.Current.Name.ToString(), headerEnumerator.Current.Value.ToString() ); } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "browserHtml": true, "requestHeaders": {"referer": "https://example.org/"}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//text()' - 2> /dev/null \ | jq .headers ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "browserHtml": true, "requestHeaders": { "referer": "https://example.org/" } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//text()' - 2> /dev/null \ | jq .headers ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map requestHeaders = ImmutableMap.of("referer", "https://example.org/"); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "browserHtml", true, "requestHeaders", requestHeaders); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); JsonObject data = JsonParser.parseString(document.text()).getAsJsonObject(); JsonObject headers = data.get("headers").getAsJsonObject(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(headers)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', browserHtml: true, requestHeaders: { referer: 'https://example.org/' } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const $ = cheerio.load(response.data.browserHtml) const data = JSON.parse($.text()) const headers = data.headers }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'browserHtml' => true, 'requestHeaders' => [ 'referer' => 'https://example.org/', ], ], ]); $api = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($api->browserHtml); $data = json_decode($doc->textContent); $headers = $data->headers; ``` #### Python ```python import json import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "browserHtml": True, "requestHeaders": { "referer": "https://example.org/", }, }, ) browser_html = api_response.json()["browserHtml"] selector = Selector(browser_html) response_json = selector.xpath("//text()").get() response_data = json.loads(response_json) headers = response_data["headers"] ``` #### Python client ```python import asyncio import json from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "browserHtml": True, "requestHeaders": { "referer": "https://example.org/", }, } ) browser_html = api_response["browserHtml"] selector = Selector(browser_html) response_json = selector.xpath("//text()").get() response_data = json.loads(response_json) print(json.dumps(response_data["headers"], indent=2)) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", headers={"Referer": "https://example.org/"}, meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): response_json = response.xpath("//text()").get() response_data = json.loads(response_json) headers = response_data["headers"] ``` Output (`"Referer"` line): ```json "Referer": "https://example.org/", ``` #### Getting browser HTML > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"browserHtml", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml ``` #### curl input.json ```json { "url": "https://toscrape.com", "browserHtml": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "browserHtml", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); System.out.println(browserHtml); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', browserHtml: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'browserHtml' => true, ], ]); $api = json_decode($response->getBody()); $browser_html = $api->browserHtml; ``` #### Proxy mode ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Browser-Html: true" \ https://toscrape.com ``` #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "browserHtml": True, }, ) browser_html: str = api_response.json()["browserHtml"] ``` #### Python client ```python import asyncio from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "browserHtml": True, } ) print(api_response["browserHtml"]) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): browser_html: str = response.text ``` Output (first 5 lines): ```html Scraping Sandbox ``` #### Reusing browser cookies on HTTP requests Send a browser request to the home page of a website, and use its response cookies as request cookies in an HTTP request to a different URL of that website. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var browserInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"browserHtml", true}, {"responseCookies", true} }; var browserInputJson = JsonSerializer.Serialize(browserInput); var browserContent = new StringContent(browserInputJson, Encoding.UTF8, "application/json"); HttpResponseMessage browserResponse = await client.PostAsync("https://api.zyte.com/v1/extract", browserContent); var browserResponseBody = await browserResponse.Content.ReadAsByteArrayAsync(); var browserData = JsonDocument.Parse(browserResponseBody); var httpInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"httpResponseBody", true}, {"requestCookies", browserData.RootElement.GetProperty("responseCookies")} }; var httpInputJson = JsonSerializer.Serialize(httpInput); var httpContent = new StringContent(httpInputJson, Encoding.UTF8, "application/json"); HttpResponseMessage httpResponse = await client.PostAsync("https://api.zyte.com/v1/extract", httpContent); var httpResponseBody = await httpResponse.Content.ReadAsByteArrayAsync(); var httpData = JsonDocument.Parse(httpResponseBody); var base64HttpResponseBodyField = httpData.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyField = System.Convert.FromBase64String(base64HttpResponseBodyField); var result = System.Text.Encoding.UTF8.GetString(httpResponseBodyField); Console.WriteLine(result); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map browserParameters = ImmutableMap.of( "url", "https://toscrape.com/", "browserHtml", true, "responseCookies", true); String browserRequestBody = new Gson().toJson(browserParameters); HttpPost browserRequest = new HttpPost("https://api.zyte.com/v1/extract"); browserRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); browserRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); browserRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); browserRequest.setEntity(new StringEntity(browserRequestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( browserRequest, browserResponse -> { HttpEntity browserEntity = browserResponse.getEntity(); String browserApiResponse = EntityUtils.toString(browserEntity, StandardCharsets.UTF_8); JsonObject browserJsonObject = JsonParser.parseString(browserApiResponse).getAsJsonObject(); Map httpParameters = ImmutableMap.of( "url", "https://books.toscrape.com/", "httpResponseBody", true, "requestCookies", browserJsonObject.get("responseCookies")); String httpRequestBody = new Gson().toJson(httpParameters); HttpPost httpRequest = new HttpPost("https://api.zyte.com/v1/extract"); httpRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); httpRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); httpRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); httpRequest.setEntity(new StringEntity(httpRequestBody)); client.execute( httpRequest, httpResponse -> { HttpEntity httpEntity = httpResponse.getEntity(); String httpApiResponse = EntityUtils.toString(httpEntity, StandardCharsets.UTF_8); JsonObject httpJsonObject = JsonParser.parseString(httpApiResponse).getAsJsonObject(); String base64HttpResponseBody = httpJsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com/', browserHtml: true, responseCookies: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((browserResponse) => { axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://books.toscrape.com/', httpResponseBody: true, requestCookies: browserResponse.data.responseCookies }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((httpResponse) => { const httpResponseBody = Buffer.from( httpResponse.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com/', 'browserHtml' => true, 'responseCookies' => true, ], ]); $browser_data = json_decode($browser_response->getBody()); $http_response = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://books.toscrape.com/', 'httpResponseBody' => true, 'requestCookies' => $browser_data->responseCookies, ], ]); $http_data = json_decode($http_response->getBody()); $http_response_body = base64_decode($http_data->httpResponseBody); echo $http_response_body; ``` #### Python ```python from base64 import b64decode import requests browser_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com/", "browserHtml": True, "responseCookies": True, }, ) http_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://books.toscrape.com/", "httpResponseBody": True, "requestCookies": browser_response.json()["responseCookies"], }, ) http_response_body = b64decode(http_response.json()["httpResponseBody"]) print(http_response_body.decode()) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() browser_response = await client.get( { "url": "https://toscrape.com/", "browserHtml": True, "responseCookies": True, } ) http_response = await client.get( { "url": "https://books.toscrape.com/", "httpResponseBody": True, "requestCookies": browser_response["responseCookies"], } ) http_response_body = b64decode(http_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com/", callback=self.parse_browser, meta={ "zyte_api_automap": { "browserHtml": True, "responseCookies": True, }, }, ) def parse_browser(self, response): yield response.follow( "https://books.toscrape.com/", callback=self.parse_http, meta={ "zyte_api_automap": { "requestCookies": response.raw_api_response["responseCookies"], }, }, ) def parse_http(self, response): print(response.text) ``` #### Setting a geolocation > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "http://ip-api.com/json"}, {"httpResponseBody", true}, {"geolocation", "AU"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var countryCode = responseData.RootElement.GetProperty("countryCode").ToString(); ``` #### CLI client input.jsonl ```json {"url": "http://ip-api.com/json", "httpResponseBody": true, "geolocation": "AU"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .countryCode ``` #### curl input.json ```json { "url": "http://ip-api.com/json", "httpResponseBody": true, "geolocation": "AU" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .countryCode ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "http://ip-api.com/json", "httpResponseBody", true, "geolocation", "AU"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String countryCode = data.get("countryCode").getAsString(); System.out.println(countryCode); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'http://ip-api.com/json', httpResponseBody: true, geolocation: 'AU' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const data = JSON.parse(httpResponseBody) const countryCode = data.countryCode }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'http://ip-api.com/json', 'httpResponseBody' => true, 'geolocation' => 'AU', ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); $data = json_decode($http_response_body); $country_code = $data->countryCode; ``` #### Proxy mode With the proxy mode, use the zyte-geolocation header. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Geolocation: US" \ http://ip-api.com/json \ | jq .countryCode ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "http://ip-api.com/json", "httpResponseBody": True, "geolocation": "AU", }, ) http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"]) response_data = json.loads(http_response_body) country_code = response_data["countryCode"] ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "http://ip-api.com/json", "httpResponseBody": True, "geolocation": "AU", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) response_data = json.loads(http_response_body) print(response_data["countryCode"]) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class IPAPIComSpider(Spider): name = "ip_api_com" async def start(self): yield Request( "http://ip-api.com/json", meta={ "zyte_api_automap": { "geolocation": "AU", }, }, ) def parse(self, response): response_data = json.loads(response.body) country_code = response_data["countryCode"] ``` Output: ```none AU ``` #### Making an HTTP request seem like it comes from a mobile device > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/user-agent"}, {"httpResponseBody", true}, {"device", "mobile"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var headerEnumerator = responseData.RootElement.EnumerateObject(); while (headerEnumerator.MoveNext()) { if (headerEnumerator.Current.Name.ToString() == "user-agent") { Console.WriteLine(headerEnumerator.Current.Value.ToString()); } } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/user-agent", "httpResponseBody": true, "device": "mobile"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output '.["user-agent"]' ``` #### curl input.json ```json { "url": "https://httpbin.org/user-agent", "httpResponseBody": true, "device": "mobile" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output '.["user-agent"]' ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/user-agent", "httpResponseBody", true, "device", "mobile"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String userAgent = data.get("user-agent").getAsString(); System.out.println(userAgent); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/user-agent', httpResponseBody: true, device: 'mobile' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(JSON.parse(httpResponseBody)['user-agent']) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/user-agent', 'httpResponseBody' => true, 'device' => 'mobile', ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); $data = json_decode($http_response_body); echo $data->{'user-agent'}.PHP_EOL; ``` #### Proxy mode With the proxy mode, use the zyte-device header. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Device: mobile" \ https://httpbin.org/user-agent \ | jq --raw-output '.["user-agent"]' ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/user-agent", "httpResponseBody": True, "device": "mobile", }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) user_agent = json.loads(http_response_body)["user-agent"] print(user_agent) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/user-agent", "httpResponseBody": True, "device": "mobile", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) user_agent = json.loads(http_response_body)["user-agent"] print(user_agent) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/user-agent", meta={ "zyte_api_automap": { "device": "mobile", } }, ) def parse(self, response): user_agent = json.loads(response.text)["user-agent"] print(user_agent) ``` Example output (may vary): ```none Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 ``` #### Getting structured data from a product details page of an e-commerce website > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"}, {"product", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var product = data.RootElement.GetProperty("product").ToString(); Console.WriteLine(product); ``` #### CLI client input.jsonl ```json {"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .product ``` #### curl input.json ```json { "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .product ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonObject product = jsonObject.get("product").getAsJsonObject(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(product)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', product: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const product = response.data.product console.log(product) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'product' => true, ], ]); $data = json_decode($response->getBody()); $product = json_encode($data->product); echo $product.PHP_EOL; ``` #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": ( "https://books.toscrape.com/catalogue" "/a-light-in-the-attic_1000/index.html" ), "product": True, }, ) product = api_response.json()["product"] print(product) ``` #### Python client ```python import asyncio import json from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": ( "https://books.toscrape.com/catalogue" "/a-light-in-the-attic_1000/index.html" ), "product": True, } ) product = api_response["product"] print(json.dumps(product, indent=2, ensure_ascii=False)) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class BooksToScrapeComSpider(Spider): name = "books_toscrape_com" async def start(self): yield Request( ( "https://books.toscrape.com/catalogue" "/a-light-in-the-attic_1000/index.html" ), meta={ "zyte_api_automap": { "product": True, }, }, ) def parse(self, response): product = response.raw_api_response["product"] print(product) ``` Output (first 5 lines): ```json { "name": "A Light in the Attic", "price": "51.77", "currency": "GBP", "currencyRaw": "£", ``` #### Submitting an HTML form with an HTTP request > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. In [https://quotes.toscrape.com/search.aspx](https://quotes.toscrape.com/search.aspx) you get an HTML form that could be stripped down to: ```html
``` When you select an **Author** (e.g. Albert Einstein), a form request is sent, and the **Tag** options fill up. To reproduce that: #### C# ```cs using System; using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using System.Web; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input1 = new Dictionary(){ {"url", "https://quotes.toscrape.com/search.aspx"}, {"httpResponseBody", true} }; var inputJson1 = JsonSerializer.Serialize(input1); var content1 = new StringContent(inputJson1, Encoding.UTF8, "application/json"); HttpResponseMessage response1 = await client.PostAsync("https://api.zyte.com/v1/extract", content1); var body1 = await response1.Content.ReadAsByteArrayAsync(); var data1 = JsonDocument.Parse(body1); var base64HttpResponseBody1 = data1.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes1 = System.Convert.FromBase64String(base64HttpResponseBody1); var httpResponseBody1 = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes1); var htmlDocument1 = new HtmlDocument(); htmlDocument1.LoadHtml(httpResponseBody1); var navigator1 = htmlDocument1.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator1.Evaluate("//*[@name='__VIEWSTATE']/@value"); nodeIterator.MoveNext(); var viewState = nodeIterator.Current.ToString(); var httpRequestTextParameters = new Dictionary { { "author", "Albert Einstein" }, { "tag", "----------" }, { "__VIEWSTATE", viewState} }; var httpRequestText = string.Join("&", httpRequestTextParameters.Select(kvp => $"{HttpUtility.UrlEncode(kvp.Key)}={HttpUtility.UrlEncode(kvp.Value)}")); var input2 = new Dictionary(){ {"url", "https://quotes.toscrape.com/filter.aspx"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"}, { "customHttpRequestHeaders", new List>() { new Dictionary() { {"name", "Content-Type"}, {"value", "application/x-www-form-urlencoded"} } } }, {"httpRequestText", httpRequestText} }; var inputJson2 = JsonSerializer.Serialize(input2); var content2 = new StringContent(inputJson2, Encoding.UTF8, "application/json"); HttpResponseMessage response2 = await client.PostAsync("https://api.zyte.com/v1/extract", content2); var body2 = await response2.Content.ReadAsByteArrayAsync(); var data2 = JsonDocument.Parse(body2); var base64HttpResponseBody2 = data2.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes2 = System.Convert.FromBase64String(base64HttpResponseBody2); var httpResponseBody2 = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes2); var htmlDocument2 = new HtmlDocument(); htmlDocument2.LoadHtml(httpResponseBody2); var navigator2 = htmlDocument2.CreateNavigator(); var nodeIterator2 = (XPathNodeIterator)navigator2.Evaluate("//*[@name='tag']//option"); int tagCount = 0; while (nodeIterator2.MoveNext()) { tagCount++; } Console.WriteLine($"{tagCount}"); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.ArrayList; import java.util.Base64; import java.util.Collections; import java.util.List; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.entity.UrlEncodedFormEntity; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.NameValuePair; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.apache.hc.core5.http.message.BasicNameValuePair; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters1 = ImmutableMap.of("url", "https://quotes.toscrape.com/search.aspx", "httpResponseBody", true); String requestBody1 = new Gson().toJson(parameters1); HttpPost request1 = new HttpPost("https://api.zyte.com/v1/extract"); request1.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request1.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request1.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request1.setEntity(new StringEntity(requestBody1)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request1, (response1) -> { HttpEntity httpEntity1 = response1.getEntity(); String httpApiResponse1 = EntityUtils.toString(httpEntity1, StandardCharsets.UTF_8); JsonObject httpJsonObject1 = JsonParser.parseString(httpApiResponse1).getAsJsonObject(); String base64HttpResponseBody1 = httpJsonObject1.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes1 = Base64.getDecoder().decode(base64HttpResponseBody1); String httpResponseBody1 = new String(httpResponseBodyBytes1, StandardCharsets.UTF_8); Document document1 = Jsoup.parse(httpResponseBody1); String viewState = document1.select("[name='__VIEWSTATE']").attr("value"); Map params = ImmutableMap.of( "author", "Albert Einstein", "tag", "----------", "__VIEWSTATE", viewState); List formParams = new ArrayList<>(); for (Map.Entry entry : params.entrySet()) { formParams.add(new BasicNameValuePair(entry.getKey(), entry.getValue())); } UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formParams, StandardCharsets.UTF_8); String httpRequestText = EntityUtils.toString(entity); Map customHttpRequestHeader = ImmutableMap.of("name", "Content-Type", "value", "application/x-www-form-urlencoded"); Map parameters2 = ImmutableMap.of( "url", "https://quotes.toscrape.com/filter.aspx", "httpResponseBody", true, "httpRequestMethod", "POST", "customHttpRequestHeaders", Collections.singletonList(customHttpRequestHeader), "httpRequestText", httpRequestText); String requestBody2 = new Gson().toJson(parameters2); HttpPost request2 = new HttpPost("https://api.zyte.com/v1/extract"); request2.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request2.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request2.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request2.setEntity(new StringEntity(requestBody2)); client.execute( request2, (response2) -> { HttpEntity httpEntity2 = response2.getEntity(); String httpApiResponse2 = EntityUtils.toString(httpEntity2, StandardCharsets.UTF_8); JsonObject httpJsonObject2 = JsonParser.parseString(httpApiResponse2).getAsJsonObject(); String base64HttpResponseBody2 = httpJsonObject2.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes2 = Base64.getDecoder().decode(base64HttpResponseBody2); String httpResponseBody2 = new String(httpResponseBodyBytes2, StandardCharsets.UTF_8); Document document2 = Jsoup.parse(httpResponseBody2); Elements tags = document2.select("select[name='tag'] option"); System.out.println(tags.size()); return null; }); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') const querystring = require('querystring') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/search.aspx', httpResponseBody: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const $ = cheerio.load(httpResponseBody) const viewState = $('[name="__VIEWSTATE"]').get(0).attribs.value const httpRequestText = querystring.stringify( { author: 'Albert Einstein', tag: '----------', __VIEWSTATE: viewState } ) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/filter.aspx', httpResponseBody: true, httpRequestMethod: 'POST', customHttpRequestHeaders: [ { name: 'Content-Type', value: 'application/x-www-form-urlencoded' } ], httpRequestText }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const $ = cheerio.load(httpResponseBody) console.log($('select[name="tag"] option').length) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/search.aspx', 'httpResponseBody' => true, ], ]); $data = json_decode($response_1->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $doc = new DOMDocument(); $doc->loadHTML($http_response_body); $xpath_1 = new DOMXPath($doc); $view_state = $xpath_1->query('//*[@name="__VIEWSTATE"]/@value')->item(0)->nodeValue; $http_request_text = http_build_query( [ 'author' => 'Albert Einstein', 'tag' => '----------', '__VIEWSTATE' => $view_state, ] ); $response_2 = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/filter.aspx', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', 'customHttpRequestHeaders' => [ [ 'name' => 'Content-Type', 'value' => 'application/x-www-form-urlencoded', ], ], 'httpRequestText' => $http_request_text, ], ]); $data = json_decode($response_2->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $doc->loadHTML($http_response_body); $xpath_2 = new DOMXPath($doc); $tags = $xpath_2->query('//*[@name="tag"]/option'); echo count($tags).PHP_EOL; ``` #### Python Install form2request, which makes it easier to handle HTML forms in Python. Then: ```python from base64 import b64decode from form2request import form2request from parsel import Selector import requests api_response_1 = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/search.aspx", "httpResponseBody": True, }, ) api_response_1_data = api_response_1.json() http_response_body_1 = b64decode(api_response_1_data["httpResponseBody"]) selector_1 = Selector(body=http_response_body_1, base_url=api_response_1_data["url"]) form = selector_1.css("form") request = form2request(form, {"author": "Albert Einstein"}, click=False) api_response_2 = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": request.url, "httpRequestMethod": request.method, "customHttpRequestHeaders": [ {"name": k, "value": v} for k, v in request.headers ], "httpRequestText": request.body.decode(), "httpResponseBody": True, }, ) http_response_body_2 = b64decode(api_response_2.json()["httpResponseBody"]) selector_2 = Selector(body=http_response_body_2) print(len(selector_2.css("select[name='tag'] option"))) ``` #### Python client Install form2request, which makes it easier to handle HTML forms in Python. Then: ```python import asyncio from base64 import b64decode from form2request import form2request from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response_1 = await client.get( { "url": "https://quotes.toscrape.com/search.aspx", "httpResponseBody": True, } ) http_response_body_1 = b64decode(api_response_1["httpResponseBody"]) selector_1 = Selector(body=http_response_body_1, base_url=api_response_1["url"]) form = selector_1.css("form") request = form2request(form, {"author": "Albert Einstein"}, click=False) api_response_2 = await client.get( { "url": request.url, "httpRequestMethod": request.method, "customHttpRequestHeaders": [ {"name": k, "value": v} for k, v in request.headers ], "httpRequestText": request.body.decode(), "httpResponseBody": True, } ) http_response_body_2 = b64decode(api_response_2["httpResponseBody"]) selector_2 = Selector(body=http_response_body_2) print(len(selector_2.css("select[name='tag'] option"))) asyncio.run(main()) ``` #### Scrapy Install form2request, which makes it easier to handle HTML forms in Scrapy. Then, use it and let transparent mode take care of the rest: ```python from form2request import form2request from scrapy import Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" start_urls = ["https://quotes.toscrape.com/search.aspx"] def parse(self, response): form = response.css("form") request = form2request(form, {"author": "Albert Einstein"}, click=False) yield request.to_scrapy(callback=self.parse_tags) def parse_tags(self, response): print(len(response.css("select[name='tag'] option"))) ``` Output (number of **Tag** options): ```json 25 ``` #### Decoding HTML from an HTTP response body (i.e. from bytes to text) > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### curl Use [file](https://www.darwinsys.com/file/) to find the media type of a previously-downloaded response based solely on its body (i.e. not following the HTML encoding sniffing algorithm). ```shell file --mime-encoding output.html ``` #### JS Use [content-type-parser](https://www.npmjs.com/package/content-type-parser), [html-encoding-sniffer](https://www.npmjs.com/package/html-encoding-sniffer) and [whatwg-encoding](https://www.npmjs.com/package/whatwg-encoding): ```js const contentTypeParser = require('content-type-parser') const htmlEncodingSniffer = require('html-encoding-sniffer') const whatwgEncoding = require('whatwg-encoding') // … const httpResponseHeaders = response.data.httpResponseHeaders let contentTypeCharset httpResponseHeaders.forEach(function (item) { if (item.name.toLowerCase() === 'content-type') { contentTypeCharset = contentTypeParser(item.value).get('charset') } }) const httpResponseBody = Buffer.from(response.data.httpResponseBody, 'base64') const encoding = htmlEncodingSniffer(httpResponseBody, { transportLayerEncodingLabel: contentTypeCharset }) const html = whatwgEncoding.decode(httpResponseBody, encoding) ``` #### Python [web-poet](https://web-poet.readthedocs.io/en/stable/index.html) provides a response wrapper that automatically decodes the response body following an encoding sniffing algorithm similar to the one defined in the HTML standard. Provided that you have extracted a response with both body and headers, and you have Base64-decoded the response body, you can decode the HTML bytes as follows: ```python from web_poet import HttpResponse # … headers = tuple( (item['name'], item['value']) for item in http_response_headers ) response = HttpResponse( url='https://example.com', body=http_response_body, status=200, headers=headers, ) html = response.text ``` #### Scrapy In transparent mode, regular Scrapy requests targeting HTML resources decode them by default. See zapi-text. #### Setting arbitrary headers in HTTP requests > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, { "customHttpRequestHeaders", new List>() { new Dictionary() { {"name", "Accept-Language"}, {"value", "fa"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var headerEnumerator = responseData.RootElement.GetProperty("headers").EnumerateObject(); var headers = new Dictionary(); while (headerEnumerator.MoveNext()) { headers.Add( headerEnumerator.Current.Name.ToString(), headerEnumerator.Current.Value.ToString() ); } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "customHttpRequestHeaders": [{"name": "Accept-Language", "value": "fa"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .headers ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "customHttpRequestHeaders": [ { "name": "Accept-Language", "value": "fa" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .headers ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map customHttpRequestHeader = ImmutableMap.of("name", "Accept-Language", "value", "fa"); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "customHttpRequestHeaders", Collections.singletonList(customHttpRequestHeader)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); JsonObject headers = data.get("headers").getAsJsonObject(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(headers)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, customHttpRequestHeaders: [ { name: 'Accept-Language', value: 'fa' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const headers = JSON.parse(httpResponseBody).headers }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'customHttpRequestHeaders' => [ [ 'name' => 'Accept-Language', 'value' => 'fa', ], ], ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); $data = json_decode($http_response_body); $headers = $data->headers; ``` #### Proxy mode With the proxy mode, the request headers from your requests are used automatically. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Accept-Language: fa" \ https://httpbin.org/anything \ | jq .headers ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "customHttpRequestHeaders": [ { "name": "Accept-Language", "value": "fa", }, ], }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) headers = json.loads(http_response_body)["headers"] ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "customHttpRequestHeaders": [ { "name": "Accept-Language", "value": "fa", }, ], } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) headers = json.loads(http_response_body)["headers"] print(json.dumps(headers, indent=2)) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", headers={"Accept-Language": "fa"}, ) def parse(self, response): headers = json.loads(response.text)["headers"] ``` Output (first 5 lines): ```json { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "fa", "Host": "httpbin.org", ``` #### Forcing data center IPs or device residential IPs > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); string[] ipTypes = { "datacenter", "residential" }; for (int i = 0; i < ipTypes.Length; i++) { var input = new Dictionary(){ {"url", "https://www.whatismyisp.com/"}, {"httpResponseBody", true}, {"ipType", ipTypes[i]} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(httpResponseBody); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//h1/span/text()"); nodeIterator.MoveNext(); var isp = nodeIterator.Current.ToString(); Console.WriteLine(isp); } ``` #### CLI client input.jsonl ```json {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "datacenter"} {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "residential"} ``` ```shell zyte-api input.jsonl 2> /dev/null \ | xargs -d\\n -n 1 \ bash -c " jq --raw-output .httpResponseBody <<< \"\$0\" \ | base64 --decode \ | xmllint --html --xpath 'string(//h1/span/text())' --noblanks - 2> /dev/null " ``` #### curl input.jsonl ```json {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "datacenter"} {"url": "https://www.whatismyisp.com/", "httpResponseBody": true, "ipType": "residential"} ``` ```shell cat input.jsonl \ | xargs -P 2 -d\\n -n 1 \ bash -c " curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data \"\$0\" \ --compressed \ https://api.zyte.com/v1/extract \ 2> /dev/null \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | xmllint --html --xpath 'string(//h1/span/text())' --noblanks - 2> /dev/null " ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { String[] ipTypes = {"datacenter", "residential"}; for (String ipType : ipTypes) { Map parameters = ImmutableMap.of( "url", "https://www.whatismyisp.com/", "httpResponseBody", true, "ipType", ipType); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); Document document = Jsoup.parse(httpResponseBody); String logout = document.select("h1 > span:first-of-type").text(); System.out.println(logout); return null; }); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') const ipTypes = ['datacenter', 'residential'] for (const ipType of ipTypes) { axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://www.whatismyisp.com/', httpResponseBody: true, ipType }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const $ = cheerio.load(httpResponseBody) const logout = $('h1 > span:first-of-type').text() console.log(logout) }) } ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://www.whatismyisp.com/', 'httpResponseBody' => true, 'ipType' => $ip_type, ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $doc = new DOMDocument(); $doc->loadHTML($http_response_body); $xpath = new DOMXPath($doc); $logout = $xpath->query('//h1/span/text()')->item(0)->nodeValue; echo $logout.PHP_EOL; } ``` #### Proxy mode With the proxy mode, use the zyte-iptype header. ```shell for ip_type in datacenter residential do curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --header "Zyte-IPType: $ip_type" \ --compressed \ https://www.whatismyisp.com/ \ 2> /dev/null \ | xmllint --html --xpath 'string(//h1/span/text())' --noblanks - 2> /dev/null done ``` #### Python ```python from base64 import b64decode import requests from parsel import Selector for ip_type in ("datacenter", "residential"): api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://www.whatismyisp.com/", "httpResponseBody": True, "ipType": ip_type, }, ) http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() logout = Selector(http_response_body).css("h1 > span::text").get() print(logout) ``` #### Python client ```python import asyncio from base64 import b64decode from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() for ip_type in ("datacenter", "residential"): api_response = await client.get( { "url": "https://www.whatismyisp.com/", "httpResponseBody": True, "ipType": ip_type, }, ) http_response_body_bytes = b64decode(api_response["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() logout = Selector(http_response_body).css("h1 > span::text").get() print(logout) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class WhatIsMyIspComSpider(Spider): name = "whatismyisp_com" async def start(self): for ip_type in ("datacenter", "residential"): yield Request( "https://www.whatismyisp.com/", meta={ "zyte_api_automap": { "ipType": ip_type, }, }, ) def parse(self, response): print(response.css("h1 > span::text").get()) ``` Output: ```none [A web hosting company] [An Internet service provider] ``` #### Disabling JavaScript in a browser request > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://www.whatismybrowser.com/detect/is-javascript-enabled"}, {"browserHtml", true}, {"javascript", false} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//*[@id=\"detected_value\"]/text()"); nodeIterator.MoveNext(); var isJavaScriptEnabled = nodeIterator.Current.ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": true, "javascript": false} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//*[@id="detected_value"]/text()' - 2> /dev/null ``` #### curl input.json ```json { "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": true, "javascript": false } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath '//*[@id="detected_value"]/text()' - 2> /dev/null ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml", true, "javascript", false); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); String isJavaScriptEnabled = document.select("#detected_value").text(); System.out.println(isJavaScriptEnabled); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://www.whatismybrowser.com/detect/is-javascript-enabled', browserHtml: true, javascript: false }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const $ = cheerio.load(response.data.browserHtml) const isJavaScriptEnabled = $('#detected_value').text() }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://www.whatismybrowser.com/detect/is-javascript-enabled', 'browserHtml' => true, 'javascript' => false, ], ]); $api = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($api->browserHtml); $xpath = new DOMXPath($doc); $is_javascript_enabled = $xpath->query("//*[@id='detected_value']")->item(0)->textContent; ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": True, "javascript": False, }, ) browser_html = api_response.json()["browserHtml"] selector = Selector(browser_html) is_javascript_enabled: str = selector.css("#detected_value::text").get() ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://www.whatismybrowser.com/detect/is-javascript-enabled", "browserHtml": True, "javascript": False, } ) browser_html = api_response["browserHtml"] selector = Selector(browser_html) is_javascript_enabled = selector.css("#detected_value::text").get() print(is_javascript_enabled) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class WhatIsMyBrowserComSpider(Spider): name = "whatismybrowser_com" async def start(self): yield Request( "https://www.whatismybrowser.com/detect/is-javascript-enabled", meta={ "zyte_api_automap": { "browserHtml": True, "javascript": False, }, }, ) def parse(self, response): is_javascript_enabled: str = response.css("#detected_value::text").get() ``` Output: ```none No ``` #### Appending arbitrary metadata to a request > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; var inputData = new List>() { new List(){"https://toscrape.com", 1}, new List(){"https://books.toscrape.com", 2}, new List(){"https://quotes.toscrape.com", 3}, }; var output = new List(); var handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All, MaxConnectionsPerServer = 15 }; var client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var responseTasks = new List>(); foreach (var entry in inputData) { var input = new Dictionary(){ {"url", entry[0]}, {"browserHtml", true}, {"echoData", entry[1]} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); var responseTask = client.PostAsync("https://api.zyte.com/v1/extract", content); responseTasks.Add(responseTask); } while (responseTasks.Any()) { var responseTask = await Task.WhenAny(responseTasks); responseTasks.Remove(responseTask); var response = await responseTask; output.Add(response); } ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true, "echoData": 1} {"url": "https://books.toscrape.com", "browserHtml": true, "echoData": 2} {"url": "https://quotes.toscrape.com", "browserHtml": true, "echoData": 3} ``` ```shell zyte-api --n-conn 15 input.jsonl -o output.jsonl ``` #### curl input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true, "echoData": 1} {"url": "https://books.toscrape.com", "browserHtml": true, "echoData": 2} {"url": "https://quotes.toscrape.com", "browserHtml": true, "echoData": 3} ``` ```shell cat input.jsonl \ | xargs -P 15 -d\\n -n 1 \ bash -c " curl \ --user $ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data \"\$0\" \ --compressed \ https://api.zyte.com/v1/extract \ | jq .echoData \ | awk '{print \$1}' \ >> output.jsonl " ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import java.io.IOException; import java.util.ArrayList; import java.util.Base64; import java.util.List; import java.util.Map; import java.util.concurrent.ExecutionException; import java.util.concurrent.Future; import org.apache.hc.client5.http.async.methods.SimpleHttpRequest; import org.apache.hc.client5.http.async.methods.SimpleHttpResponse; import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient; import org.apache.hc.client5.http.impl.async.HttpAsyncClients; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManagerBuilder; import org.apache.hc.client5.http.ssl.ClientTlsStrategyBuilder; import org.apache.hc.core5.concurrent.FutureCallback; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.nio.ssl.TlsStrategy; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws ExecutionException, InterruptedException, IOException, ParseException { Object[][] input = { {"https://toscrape.com", 1}, {"https://bookstoscrape.com", 2}, {"https://quotes.toscrape.com", 3} }; List futures = new ArrayList(); List output = new ArrayList(); int concurrency = 15; // https://issues.apache.org/jira/browse/HTTPCLIENT-2219 final TlsStrategy tlsStrategy = ClientTlsStrategyBuilder.create().useSystemProperties().build(); PoolingAsyncClientConnectionManager connectionManager = PoolingAsyncClientConnectionManagerBuilder.create().setTlsStrategy(tlsStrategy).build(); connectionManager.setMaxTotal(concurrency); connectionManager.setDefaultMaxPerRoute(concurrency); CloseableHttpAsyncClient client = HttpAsyncClients.custom().setConnectionManager(connectionManager).build(); try { client.start(); for (int i = 0; i < input.length; i++) { Map parameters = ImmutableMap.of("url", input[i][0], "browserHtml", true, "echoData", input[i][1]); String requestBody = new Gson().toJson(parameters); SimpleHttpRequest request = new SimpleHttpRequest("POST", "https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setBody(requestBody, ContentType.APPLICATION_JSON); final Future future = client.execute( request, new FutureCallback() { public void completed(final SimpleHttpResponse response) { String apiResponse = response.getBodyText(); output.add(apiResponse); } public void failed(final Exception ex) {} public void cancelled() {} }); futures.add(future); } for (int i = 0; i < futures.size(); i++) { futures.get(i).get(); } } finally { client.close(); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const { ConcurrencyManager } = require('axios-concurrency') const axios = require('axios') const urls = [ ['https://toscrape.com', 1], ['https://books.toscrape.com', 2], ['https://quotes.toscrape.com', 3] ] const output = [] const client = axios.create() ConcurrencyManager(client, 15) Promise.all( urls.map((input) => client.post( 'https://api.zyte.com/v1/extract', { url: input[0], browserHtml: true, echoData: input[1] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => output.push(response.data)) ) ) ``` #### PHP ```php ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => $url_and_index[0], 'browserHtml' => true, 'echoData' => $url_and_index[1], ], ]; $request = new \GuzzleHttp\Psr7\Request('POST', 'https://api.zyte.com/v1/extract'); global $promises; $promises[] = $client->sendAsync($request, $options)->then(function ($response) { global $output; $output[] = json_decode($response->getBody()); }); } foreach ($promises as $promise) { $promise->wait(); } ``` #### Proxy mode With the proxy mode you cannot set request metadata. #### Python ```python import asyncio import aiohttp input_data = [ ("https://toscrape.com", 1), ("https://books.toscrape.com", 2), ("https://quotes.toscrape.com", 3), ] output = [] async def extract(client, url, index): response = await client.post( "https://api.zyte.com/v1/extract", json={"url": url, "browserHtml": True, "echoData": index}, auth=aiohttp.BasicAuth("YOUR_ZYTE_API_KEY"), ) output.append(await response.json()) async def main(): connector = aiohttp.TCPConnector(limit_per_host=15) async with aiohttp.ClientSession(connector=connector) as client: await asyncio.gather( *[extract(client, url, index) for url, index in input_data] ) asyncio.run(main()) ``` #### Python client ```python import asyncio import json from zyte_api import AsyncZyteAPI input_data = [ ("https://toscrape.com", 1), ("https://books.toscrape.com", 2), ("https://quotes.toscrape.com", 3), ] async def main(): client = AsyncZyteAPI(n_conn=15) queries = [ {"url": url, "browserHtml": True, "echoData": index} for url, index in input_data ] async with client.session() as session: for future in session.iter(queries): response = await future print(json.dumps(response)) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider input_data = [ ("https://toscrape.com", 1), ("https://books.toscrape.com", 2), ("https://quotes.toscrape.com", 3), ] class ToScrapeSpider(Spider): name = "toscrape_com" custom_settings = { "CONCURRENT_REQUESTS": 15, "CONCURRENT_REQUESTS_PER_DOMAIN": 15, } async def start(self): for url, index in input_data: yield Request( url, meta={ "zyte_api_automap": { "browserHtml": True, "echoData": index, }, }, ) def parse(self, response): yield { "index": response.raw_api_response["echoData"], "html": response.text, } ``` Alternatively, you can use Scrapy’s `Request.cb_kwargs` directly for a similar purpose: ```python async def start(self): for url, index in input_data: yield Request( url, cb_kwargs={"index": index}, meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response, index): yield { "index": index, "html": response.text, } ``` Output: ```json {"url": "https://quotes.toscrape.com/", "statusCode": 200, "browserHtml": "\n\t\n\tQuotes to Scrape\n \n \n\n\n
\n
\n
\n

\n Quotes to Scrape\n

\n
\n
\n

\n \n Login\n \n

\n
\n
\n \n\n
\n
\n\n
\n “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n change\n \n deep-thoughts\n \n thinking\n \n world\n \n
\n
\n\n
\n “It is our choices, Harry, that show what we truly are, far more than our abilities.”\n by J.K. Rowling\n (about)\n \n
\n Tags:\n \n \n abilities\n \n choices\n \n
\n
\n\n
\n “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n inspirational\n \n life\n \n live\n \n miracle\n \n miracles\n \n
\n
\n\n
\n “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\n by Jane Austen\n (about)\n \n
\n Tags:\n \n \n aliteracy\n \n books\n \n classic\n \n humor\n \n
\n
\n\n
\n “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”\n by Marilyn Monroe\n (about)\n \n
\n Tags:\n \n \n be-yourself\n \n inspirational\n \n
\n
\n\n
\n “Try not to become a man of success. Rather become a man of value.”\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n adulthood\n \n success\n \n value\n \n
\n
\n\n
\n “It is better to be hated for what you are than to be loved for what you are not.”\n by André Gide\n (about)\n \n
\n Tags:\n \n \n life\n \n love\n \n
\n
\n\n
\n “I have not failed. I've just found 10,000 ways that won't work.”\n by Thomas A. Edison\n (about)\n \n
\n Tags:\n \n \n edison\n \n failure\n \n inspirational\n \n paraphrased\n \n
\n
\n\n
\n “A woman is like a tea bag; you never know how strong it is until it's in hot water.”\n by Eleanor Roosevelt\n (about)\n \n
\n Tags:\n \n \n misattributed-eleanor-roosevelt\n \n
\n
\n\n
\n “A day without sunshine is like, you know, night.”\n by Steve Martin\n (about)\n \n
\n Tags:\n \n \n humor\n \n obvious\n \n simile\n \n
\n
\n\n \n
\n
\n \n

Top Ten tags

\n \n \n love\n \n \n \n inspirational\n \n \n \n life\n \n \n \n humor\n \n \n \n books\n \n \n \n reading\n \n \n \n friendship\n \n \n \n friends\n \n \n \n truth\n \n \n \n simile\n \n \n \n
\n
\n\n
\n \n\n", "echoData": 3} {"url": "https://books.toscrape.com/", "statusCode": 200, "browserHtml": "\n \n All products | Books to Scrape - Sandbox\n\n\n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n\n\n \n \n\n \n\n \n \n \n\n \n \n\n \n \n \n \n \n
\n
\n
\n
Books to Scrape We love being scraped!\n
\n\n \n
\n
\n
\n\n \n \n
\n
\n \n
    \n
  • \n Home\n
  • \n
  • All products
  • \n
\n\n
\n\n \n\n
\n \n
\n

All products

\n
\n \n\n \n\n\n\n
\n\n
\n\n\n
\n \n
\n\n \n
\n \n
\n \n \n
\n\n \n \n \n 1000 results - showing 1 to 20.\n \n \n \n \n
\n \n
\n
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
\n\n
\n
    \n \n
  1. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"A\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    A Light in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.77

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  2. \n \n
  3. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Tipping\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Tipping the Velvet

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £53.74

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  4. \n \n
  5. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Soumission\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Soumission

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £50.10

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  6. \n \n
  7. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sharp\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sharp Objects

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £47.82

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  8. \n \n
  9. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sapiens:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sapiens: A Brief History ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £54.23

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  10. \n \n
  11. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Requiem Red

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.65

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  12. \n \n
  13. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Dirty Little Secrets ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.34

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  14. \n \n
  15. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Coming Woman: A ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.93

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  16. \n \n
  17. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Boys in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.60

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  18. \n \n
  19. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Black Maria

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.15

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  20. \n \n
  21. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Starving\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Starving Hearts (Triangular Trade ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £13.99

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  22. \n \n
  23. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Shakespeare's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Shakespeare's Sonnets

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £20.66

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  24. \n \n
  25. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Set\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Set Me Free

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.46

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  26. \n \n
  27. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Scott\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Scott Pilgrim's Precious Little ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.29

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  28. \n \n
  29. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Rip\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Rip it Up and ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £35.02

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  30. \n \n
  31. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Our\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Our Band Could Be ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £57.25

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  32. \n \n
  33. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Olio\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Olio

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £23.88

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  34. \n \n
  35. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Mesaerion:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Mesaerion: The Best Science ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £37.59

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  36. \n \n
  37. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Libertarianism\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Libertarianism for Beginners

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.33

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  38. \n \n
  39. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"It's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    It's Only the Himalayas

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £45.17

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  40. \n \n
\n \n\n\n\n
\n
    \n \n
  • \n \n Page 1 of 50\n \n
  • \n \n
  • next
  • \n \n
\n
\n\n\n
\n
\n \n\n\n
\n\n
\n
\n
\n\n\n \n
\n \n \n \n
\n\n\n \n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n\n \n \n\n\n \n \n \n\n \n\n\n \n \n\n \n \n \n \n\n", "echoData": 2} {"url": "https://toscrape.com/", "statusCode": 200, "browserHtml": "\n \n Scraping Sandbox\n \n \n \n \n
\n
\n
\n
\n \n

Web Scraping Sandbox

\n
\n
\n\n
\n
\n
\n

Books

\n

A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com

\n
\n \n
\n
\n \n \n \n \n \n \n
Details
Amount of items 1000
Pagination
Items per page max 20
Requires JavaScript
\n
\n
\n
\n\n
\n
\n
\n

Quotes

\n

A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.

\n
\n \n
\n
\n \n \n \n \n \n \n \n \n \n \n
Endpoints
DefaultMicrodata and pagination
Scroll infinite scrolling pagination
JavaScript JavaScript generated content
Delayed Same as JavaScript but with a delay (?delay=10000)
Tableful a table based messed-up layout
Login login with CSRF token (any user/passwd works)
ViewState an AJAX based filter form with ViewStates
Random a single random quote
\n
\n
\n
\n
\n \n\n", "echoData": 1} ``` #### Sending a POST request > ###### TIP > > For a more complete example featuring a request body and headers, > see the HTML form example. > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var method = responseData.RootElement.GetProperty("method").ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .method ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq .method ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "httpRequestMethod", "POST"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String method = data.get("method").getAsString(); System.out.println(method); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, httpRequestMethod: 'POST' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const method = JSON.parse(httpResponseBody).method }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $method = json_decode($http_response_body)->method; ``` #### Proxy mode With the proxy mode, the request method from your requests is used automatically. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -X POST \ https://httpbin.org/anything \ | jq .method ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) method = json.loads(http_response_body)["method"] ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) method = json.loads(http_response_body)["method"] print(method) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", method="POST", ) def parse(self, response): method = json.loads(response.text)["method"] ``` Output: ```json "POST" ``` #### Using network capture to intercept background requests sent during browser rendering > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://quotes.toscrape.com/scroll"}, {"browserHtml", true}, { "networkCapture", new List>() { new Dictionary() { {"filterType", "url"}, {"httpResponseBody", true}, {"value", "/api/"}, {"matchType", "contains"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var apiBody = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(apiBody); var captureEnumerator = data.RootElement.GetProperty("networkCapture").EnumerateArray(); captureEnumerator.MoveNext(); var capture = captureEnumerator.Current; var base64Body = capture.GetProperty("httpResponseBody").ToString(); var body = System.Convert.FromBase64String(base64Body); var captureData = JsonDocument.Parse(body); var quoteEnumerator = captureData.RootElement.GetProperty("quotes").EnumerateArray(); quoteEnumerator.MoveNext(); var quote = quoteEnumerator.Current; var authorEnumerator = quote.GetProperty("author").EnumerateObject(); while (authorEnumerator.MoveNext()) { if (authorEnumerator.Current.Name.ToString() == "name") { Console.WriteLine(authorEnumerator.Current.Value.ToString()); break; } } ``` #### CLI client input.jsonl ```json {"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "networkCapture": [{"filterType": "url", "httpResponseBody": true, "value": "/api/", "matchType": "contains"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output ".networkCapture[0].httpResponseBody" \ | base64 --decode \ | jq --raw-output ".quotes[0].author.name" ``` #### curl input.json ```json { "url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "networkCapture": [ { "filterType": "url", "httpResponseBody": true, "value": "/api/", "matchType": "contains" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output ".networkCapture[0].httpResponseBody" \ | base64 --decode \ | jq --raw-output ".quotes[0].author.name" ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonArray; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map filter = ImmutableMap.of( "filterType", "url", "httpResponseBody", true, "value", "/api/", "matchType", "contains"); Map parameters = ImmutableMap.of( "url", "https://quotes.toscrape.com/scroll", "browserHtml", true, "networkCapture", Collections.singletonList(filter)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonArray captures = jsonObject.get("networkCapture").getAsJsonArray(); JsonObject capture = captures.get(0).getAsJsonObject(); byte[] bodyBytes = Base64.getDecoder().decode(capture.get("httpResponseBody").getAsString()); String body = new String(bodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(body).getAsJsonObject(); JsonObject quote = data.get("quotes").getAsJsonArray().get(0).getAsJsonObject(); String authorName = quote.get("author").getAsJsonObject().get("name").getAsString(); System.out.println(authorName); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/scroll', browserHtml: true, networkCapture: [ { filterType: 'url', httpResponseBody: true, value: '/api/', matchType: 'contains' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const capture = response.data.networkCapture[0] const data = JSON.parse(Buffer.from(capture.httpResponseBody, 'base64')) console.log(data.quotes[0].author.name) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/scroll', 'browserHtml' => true, 'networkCapture' => [ [ 'filterType' => 'url', 'httpResponseBody' => true, 'value' => '/api/', 'matchType' => 'contains', ], ], ], ]); $api_response = json_decode($response->getBody()); $capture = $api_response->networkCapture[0]; $data = json_decode(base64_decode($capture->httpResponseBody)); echo $data->quotes[0]->author->name.PHP_EOL; ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, ) capture = api_response.json()["networkCapture"][0] data = json.loads(b64decode(capture["httpResponseBody"]).decode()) print(data["quotes"][0]["author"]["name"]) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, ) capture = api_response["networkCapture"][0] data = json.loads(b64decode(capture["httpResponseBody"]).decode()) print(data["quotes"][0]["author"]["name"]) asyncio.run(main()) ``` #### Scrapy ```python import json from base64 import b64decode from scrapy import Request, Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield Request( "https://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "networkCapture": [ { "filterType": "url", "httpResponseBody": True, "value": "/api/", "matchType": "contains", }, ], }, }, ) def parse(self, response): capture = response.raw_api_response["networkCapture"][0] data = json.loads(b64decode(capture["httpResponseBody"]).decode()) print(data["quotes"][0]["author"]["name"]) ``` Output: ```none Albert Einstein ``` #### Sending multiple requests in parallel > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; var urls = new string[2]; urls[0] = "https://books.toscrape.com/catalogue/page-1.html"; urls[1] = "https://books.toscrape.com/catalogue/page-2.html"; var output = new List(); var handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All, MaxConnectionsPerServer = 15 }; var client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var responseTasks = new List>(); foreach (var url in urls) { var input = new Dictionary(){ {"url", url}, {"browserHtml", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); var responseTask = client.PostAsync("https://api.zyte.com/v1/extract", content); responseTasks.Add(responseTask); } while (responseTasks.Any()) { var responseTask = await Task.WhenAny(responseTasks); responseTasks.Remove(responseTask); var response = await responseTask; output.Add(response); } ``` #### CLI client input.jsonl ```json {"url": "https://books.toscrape.com/catalogue/page-1.html", "browserHtml": true} {"url": "https://books.toscrape.com/catalogue/page-2.html", "browserHtml": true} ``` ```shell zyte-api --n-conn 15 input.jsonl -o output.jsonl ``` #### curl input.jsonl ```json {"url": "https://books.toscrape.com/catalogue/page-1.html", "browserHtml": true} {"url": "https://books.toscrape.com/catalogue/page-2.html", "browserHtml": true} ``` ```shell cat input.jsonl \ | xargs -P 15 -d\\n -n 1 \ bash -c " curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data \"\$0\" \ --compressed \ https://api.zyte.com/v1/extract \ | awk '{print \$1}' \ >> output.jsonl " ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.util.ArrayList; import java.util.Base64; import java.util.List; import java.util.Map; import java.util.concurrent.ExecutionException; import java.util.concurrent.Future; import org.apache.hc.client5.http.async.methods.SimpleHttpRequest; import org.apache.hc.client5.http.async.methods.SimpleHttpResponse; import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient; import org.apache.hc.client5.http.impl.async.HttpAsyncClients; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager; import org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManagerBuilder; import org.apache.hc.client5.http.ssl.ClientTlsStrategyBuilder; import org.apache.hc.core5.concurrent.FutureCallback; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.nio.ssl.TlsStrategy; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws ExecutionException, InterruptedException, IOException, ParseException { String[] urls = { "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html" }; List futures = new ArrayList(); List output = new ArrayList(); int concurrency = 15; // https://issues.apache.org/jira/browse/HTTPCLIENT-2219 final TlsStrategy tlsStrategy = ClientTlsStrategyBuilder.create().useSystemProperties().build(); PoolingAsyncClientConnectionManager connectionManager = PoolingAsyncClientConnectionManagerBuilder.create().setTlsStrategy(tlsStrategy).build(); connectionManager.setMaxTotal(concurrency); connectionManager.setDefaultMaxPerRoute(concurrency); CloseableHttpAsyncClient client = HttpAsyncClients.custom().setConnectionManager(connectionManager).build(); try { client.start(); for (int i = 0; i < urls.length; i++) { Map parameters = ImmutableMap.of("url", urls[i], "browserHtml", true); String requestBody = new Gson().toJson(parameters); SimpleHttpRequest request = new SimpleHttpRequest("POST", "https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setBody(requestBody, ContentType.APPLICATION_JSON); final Future future = client.execute( request, new FutureCallback() { public void completed(final SimpleHttpResponse response) { String apiResponse = response.getBodyText(); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); output.add(browserHtml); } public void failed(final Exception ex) {} public void cancelled() {} }); futures.add(future); } for (int i = 0; i < futures.size(); i++) { futures.get(i).get(); } } finally { client.close(); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const { ConcurrencyManager } = require('axios-concurrency') const axios = require('axios') const urls = [ 'https://books.toscrape.com/catalogue/page-1.html', 'https://books.toscrape.com/catalogue/page-2.html' ] const output = [] const client = axios.create() ConcurrencyManager(client, 15) Promise.all( urls.map((url) => client.post( 'https://api.zyte.com/v1/extract', { url, browserHtml: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => output.push(response.data)) ) ) ``` #### PHP ```php ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => $url, 'browserHtml' => true, ], ]; $request = new \GuzzleHttp\Psr7\Request('POST', 'https://api.zyte.com/v1/extract'); global $promises; $promises[] = $client->sendAsync($request, $options)->then(function ($response) { global $output; $output[] = json_decode($response->getBody()); }); } foreach ($promises as $promise) { $promise->wait(); } ``` #### Python ```python import asyncio import aiohttp urls = [ "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html", ] output = [] async def extract(client, url): response = await client.post( "https://api.zyte.com/v1/extract", json={"url": url, "browserHtml": True}, auth=aiohttp.BasicAuth("YOUR_ZYTE_API_KEY"), ) output.append(await response.json()) async def main(): connector = aiohttp.TCPConnector(limit_per_host=15) async with aiohttp.ClientSession(connector=connector) as client: await asyncio.gather(*[extract(client, url) for url in urls]) asyncio.run(main()) ``` #### Python client ```python import asyncio from zyte_api import AsyncZyteAPI urls = [ "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html", ] async def main(): client = AsyncZyteAPI(n_conn=15) queries = [{"url": url, "browserHtml": True} for url in urls] async with client.session() as session: for future in session.iter(queries): response = await future print(response) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider urls = [ "https://books.toscrape.com/catalogue/page-1.html", "https://books.toscrape.com/catalogue/page-2.html", ] class ToScrapeSpider(Spider): name = "toscrape_com" custom_settings = { "CONCURRENT_REQUESTS": 15, "CONCURRENT_REQUESTS_PER_DOMAIN": 15, } async def start(self): for url in urls: yield Request( url, meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): yield { "url": response.url, "browserHtml": response.text, } ``` Output: ```json {"url": "https://books.toscrape.com/catalogue/page-1.html", "statusCode": 200, "browserHtml": "\n \n All products | Books to Scrape - Sandbox\n\n\n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n\n\n \n \n\n \n\n \n \n \n\n \n \n\n \n \n \n \n \n
\n
\n
\n
Books to Scrape We love being scraped!\n
\n\n \n
\n
\n
\n\n \n \n
\n
\n \n
    \n
  • \n Home\n
  • \n
  • All products
  • \n
\n\n
\n\n \n\n
\n \n
\n

All products

\n
\n \n\n \n\n\n\n
\n\n
\n\n\n
\n \n
\n\n \n
\n \n
\n \n \n
\n\n \n \n \n 1000 results - showing 1 to 20.\n \n \n \n \n
\n \n
\n
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
\n\n
\n
    \n \n
  1. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"A\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    A Light in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.77

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  2. \n \n
  3. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Tipping\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Tipping the Velvet

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £53.74

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  4. \n \n
  5. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Soumission\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Soumission

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £50.10

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  6. \n \n
  7. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sharp\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sharp Objects

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £47.82

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  8. \n \n
  9. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sapiens:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sapiens: A Brief History ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £54.23

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  10. \n \n
  11. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Requiem Red

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.65

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  12. \n \n
  13. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Dirty Little Secrets ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.34

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  14. \n \n
  15. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Coming Woman: A ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.93

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  16. \n \n
  17. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Boys in the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.60

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  18. \n \n
  19. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Black Maria

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.15

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  20. \n \n
  21. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Starving\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Starving Hearts (Triangular Trade ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £13.99

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  22. \n \n
  23. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Shakespeare's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Shakespeare's Sonnets

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £20.66

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  24. \n \n
  25. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Set\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Set Me Free

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.46

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  26. \n \n
  27. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Scott\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Scott Pilgrim's Precious Little ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.29

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  28. \n \n
  29. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Rip\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Rip it Up and ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £35.02

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  30. \n \n
  31. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Our\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Our Band Could Be ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £57.25

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  32. \n \n
  33. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Olio\"\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Olio

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £23.88

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  34. \n \n
  35. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Mesaerion:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Mesaerion: The Best Science ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £37.59

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  36. \n \n
  37. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Libertarianism\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Libertarianism for Beginners

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £51.33

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  38. \n \n
  39. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"It's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    It's Only the Himalayas

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £45.17

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  40. \n \n
\n \n\n\n\n
\n
    \n \n
  • \n \n Page 1 of 50\n \n
  • \n \n
  • next
  • \n \n
\n
\n\n\n
\n
\n \n\n\n
\n\n
\n
\n
\n\n\n \n
\n \n \n \n
\n\n\n \n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n\n \n \n\n\n \n \n \n\n \n\n\n \n \n\n \n \n \n \n\n", "echoData": "https://books.toscrape.com/catalogue/page-1.html"} {"url": "https://books.toscrape.com/catalogue/page-2.html", "statusCode": 200, "browserHtml": "\n \n All products | Books to Scrape - Sandbox\n\n\n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n\n\n \n \n\n \n\n \n \n \n\n \n \n\n \n \n \n \n \n
\n
\n
\n
Books to Scrape We love being scraped!\n
\n\n \n
\n
\n
\n\n \n \n
\n
\n \n
    \n
  • \n Home\n
  • \n
  • All products
  • \n
\n\n
\n\n \n\n
\n \n
\n

All products

\n
\n \n\n \n\n\n\n
\n\n
\n\n\n
\n \n
\n\n \n
\n \n
\n \n \n
\n\n \n \n \n 1000 results - showing 21 to 40.\n \n \n \n \n
\n \n
\n
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
\n\n
\n
    \n \n
  1. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"In\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    In Her Wake

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £12.84

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  2. \n \n
  3. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"How\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    How Music Works

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £37.32

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  4. \n \n
  5. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Foolproof\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Foolproof Preserving: A Guide ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £30.52

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  6. \n \n
  7. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Chase\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Chase Me (Paris Nights ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £25.27

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  8. \n \n
  9. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Black\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Black Dust

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £34.53

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  10. \n \n
  11. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Birdsong:\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Birdsong: A Story in ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £54.64

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  12. \n \n
  13. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"America's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    America's Cradle of Quarterbacks: ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £22.50

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  14. \n \n
  15. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Aladdin\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Aladdin and His Wonderful ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £53.13

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  16. \n \n
  17. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Worlds\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Worlds Elsewhere: Journeys Around ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £40.30

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  18. \n \n
  19. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Wall\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Wall and Piece

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £44.18

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  20. \n \n
  21. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Four Agreements: A ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £17.66

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  22. \n \n
  23. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Five Love Languages: ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £31.05

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  24. \n \n
  25. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Elephant Tree

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £23.82

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  26. \n \n
  27. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"The\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    The Bear and the ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £36.89

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  28. \n \n
  29. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Sophie's\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Sophie's World

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £15.94

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  30. \n \n
  31. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Penny\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Penny Maybe

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.29

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  32. \n \n
  33. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Maude\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Maude (1883-1993):She Grew Up ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £18.02

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  34. \n \n
  35. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"In\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    In a Dark, Dark ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £19.63

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  36. \n \n
  37. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"Behind\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    Behind Closed Doors

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £52.22

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  38. \n \n
  39. \n\n\n\n\n\n\n
    \n \n
    \n \n \n \"You\n \n \n
    \n \n\n \n \n

    \n \n \n \n \n \n

    \n \n \n\n \n

    You can't bury them ...

    \n \n\n \n
    \n \n\n\n\n\n\n\n \n

    £33.63

    \n \n\n

    \n \n \n In stock\n \n

    \n\n \n \n\n\n\n\n\n\n \n
    \n \n
    \n\n\n \n
    \n \n
    \n\n
  40. \n \n
\n \n\n\n\n
\n
    \n \n
  • previous
  • \n \n
  • \n \n Page 2 of 50\n \n
  • \n \n
  • next
  • \n \n
\n
\n\n\n
\n
\n \n\n\n
\n\n
\n
\n
\n\n\n \n
\n \n \n \n
\n\n\n \n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n\n \n \n\n\n \n \n \n\n \n\n\n \n \n\n \n \n \n \n\n", "echoData": "https://books.toscrape.com/catalogue/page-2.html"} ``` #### Getting browser HTML in proxy mode > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### curl ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Browser-Html: true" \ https://toscrape.com ``` #### C# ```cs using System; using System.Net; using System.Net.Http; var proxy = new WebProxy("http://api.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var httpClientHandler = new HttpClientHandler { Proxy = proxy, }; var client = new HttpClient(handler: httpClientHandler, disposeHandler: true); client.DefaultRequestHeaders.Add("Zyte-Browser-Html", "true"); var message = new HttpRequestMessage(HttpMethod.Get, "https://toscrape.com"); var response = client.Send(message); var body = await response.Content.ReadAsStringAsync(); Console.WriteLine(body); ``` #### Java ```java import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.hc.client5.http.auth.AuthCache; import org.apache.hc.client5.http.auth.AuthScope; import org.apache.hc.client5.http.auth.CredentialsProvider; import org.apache.hc.client5.http.classic.methods.HttpGet; import org.apache.hc.client5.http.impl.auth.BasicAuthCache; import org.apache.hc.client5.http.impl.auth.BasicScheme; import org.apache.hc.client5.http.impl.auth.CredentialsProviderBuilder; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.routing.DefaultProxyRoutePlanner; import org.apache.hc.client5.http.protocol.HttpClientContext; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHost; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; class Example { public static void main(final String[] args) throws InterruptedException, IOException, ParseException { HttpHost proxy = new HttpHost("api.zyte.com", 8011); DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); CredentialsProvider credentialsProvider = CredentialsProviderBuilder.create() .add(new AuthScope(proxy), "YOUR_ZYTE_API_KEY", "".toCharArray()) .build(); AuthCache authCache = new BasicAuthCache(); BasicScheme basicAuth = new BasicScheme(); authCache.put(proxy, basicAuth); HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credentialsProvider); context.setAuthCache(authCache); CloseableHttpClient client = HttpClients.custom() .setRoutePlanner(routePlanner) .setDefaultCredentialsProvider(credentialsProvider) .build(); HttpGet request = new HttpGet("https://toscrape.com"); request.setHeader("Zyte-Browser-Html", "true"); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String httpResponseBody = EntityUtils.toString(entity, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } } ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { headers: { 'Zyte-Browser-Html': 'true' }, proxy: { protocol: 'http', host: 'api.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'headers' => [ 'Zyte-Browser-Html' => 'true', ], 'proxy' => 'http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python ```python import requests response = requests.get( "https://toscrape.com", headers={ "Zyte-Browser-Html": "true", }, proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Ruby ```ruby # frozen_string_literal: true require 'net/http' url = URI('https://toscrape.com/') proxy_host = 'api.zyte.com' proxy_port = '8011' http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port, 'YOUR_ZYTE_API_KEY', '') http.use_ssl = true request = Net::HTTP::Get.new(url) request['Zyte-Browser-Html'] = 'true' r = http.start do |h| h.request(request) end puts r.body ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request("https://toscrape.com", headers={"Zyte-Browser-Html": "true"}) def parse(self, response): print(response.text) ``` Output (first 5 lines): ```html Scraping Sandbox ``` #### Using proxy mode #### C# ```cs using System; using System.Net; using System.Net.Http; var proxy = new WebProxy("http://api.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var httpClientHandler = new HttpClientHandler { Proxy = proxy, }; var client = new HttpClient(handler: httpClientHandler, disposeHandler: true); var message = new HttpRequestMessage(HttpMethod.Get, "https://toscrape.com"); var response = client.Send(message); var body = await response.Content.ReadAsStringAsync(); Console.WriteLine(body); ``` #### curl ```bash curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### Java ```java import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.hc.client5.http.auth.AuthCache; import org.apache.hc.client5.http.auth.AuthScope; import org.apache.hc.client5.http.auth.CredentialsProvider; import org.apache.hc.client5.http.classic.methods.HttpGet; import org.apache.hc.client5.http.impl.auth.BasicAuthCache; import org.apache.hc.client5.http.impl.auth.BasicScheme; import org.apache.hc.client5.http.impl.auth.CredentialsProviderBuilder; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.routing.DefaultProxyRoutePlanner; import org.apache.hc.client5.http.protocol.HttpClientContext; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHost; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; class Example { public static void main(final String[] args) throws InterruptedException, IOException, ParseException { HttpHost proxy = new HttpHost("api.zyte.com", 8011); DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); CredentialsProvider credentialsProvider = CredentialsProviderBuilder.create() .add(new AuthScope(proxy), "YOUR_ZYTE_API_KEY", "".toCharArray()) .build(); AuthCache authCache = new BasicAuthCache(); BasicScheme basicAuth = new BasicScheme(); authCache.put(proxy, basicAuth); HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credentialsProvider); context.setAuthCache(authCache); CloseableHttpClient client = HttpClients.custom() .setRoutePlanner(routePlanner) .setDefaultCredentialsProvider(credentialsProvider) .build(); HttpGet request = new HttpGet("https://toscrape.com"); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String httpResponseBody = EntityUtils.toString(entity, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } } ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { proxy: { protocol: 'http', host: 'api.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'proxy' => 'http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python > ###### NOTE > > You need to install and configure our CA certificate for > the requests library. ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Ruby ```ruby # frozen_string_literal: true require 'net/http' url = URI('https://toscrape.com/') proxy_host = 'api.zyte.com' proxy_port = '8011' http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port, 'YOUR_ZYTE_API_KEY', '') http.use_ssl = true r = http.start do |h| h.request(Net::HTTP::Get.new(url)) end puts r.body ``` #### Scrapy When using [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy), set the `ZYTE_SMARTPROXY_URL` setting to `"http://api.zyte.com:8011"` and the `ZYTE_SMARTPROXY_APIKEY` setting to [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access) for Zyte API. > ###### NOTE > > **Important**: Use your **Zyte API key** here, not a Scrapy Cloud API key. Make sure you get this from the Zyte API access page. Then you can continue using Scrapy as usual and all requests will be proxied through Zyte API automatically. ```python from scrapy import Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): print(response.text) ``` #### Using the HTTPS endpoint of proxy mode #### curl ```bash curl \ --proxy https://api.zyte.com:8014 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### JS ```js const HttpsProxyAgent = require('https-proxy-agent') const httpsAgent = new HttpsProxyAgent.HttpsProxyAgent('https://YOUR_ZYTE_API_KEY:@api.zyte.com:8014') const axiosDefaultConfig = { httpsAgent } const axios = require('axios').create(axiosDefaultConfig) axios .get('https://toscrape.com') .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### Python ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "https://YOUR_ZYTE_API_KEY:@api.zyte.com:8014" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Sending arbitrary bytes in an HTTP request > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"}, {"httpRequestBody", "Zm9v"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var requestBody = responseData.RootElement.GetProperty("data").ToString(); Console.WriteLine(requestBody); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "httpRequestMethod", "POST", "httpRequestBody", "Zm9v"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String body = data.get("data").getAsString(); System.out.println(body); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, httpRequestMethod: 'POST', httpRequestBody: 'Zm9v' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).data console.log(body) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', 'httpRequestBody' => 'Zm9v', ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $body = json_decode($http_response_body)->data; echo $body.PHP_EOL; ``` #### Proxy mode With the proxy mode, the request body from your requests is used automatically, be it plain text or binary. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -X POST \ -H "Content-Type: application/octet-stream" \ --data foo \ https://httpbin.org/anything \ | jq .data ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v", }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) body: str = json.loads(http_response_body)["data"] print(body) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestBody": "Zm9v", } ) http_response_body: bytes = b64decode(api_response["httpResponseBody"]) body = json.loads(http_response_body)["data"] print(body) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", method="POST", body=b"foo", ) def parse(self, response): body = json.loads(response.body)["data"] print(body) ``` Output: ```none foo ``` #### Sending cookies > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. The following code example sends a cookie to [httpbin.org](https://httpbin.org) and prints the cookies that [httpbin.org](https://httpbin.org) reports to have received: #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/cookies"}, {"httpResponseBody", true}, { "requestCookies", new List>() { new Dictionary() { {"name", "foo"}, {"value", "bar"}, {"domain", "httpbin.org"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var result = System.Text.Encoding.UTF8.GetString(httpResponseBody); Console.WriteLine(result); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/cookies", "httpResponseBody": true, "requestCookies": [{"name": "foo", "value": "bar", "domain": "httpbin.org"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### curl input.json ```json { "url": "https://httpbin.org/cookies", "httpResponseBody": true, "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map cookies = ImmutableMap.of("name", "foo", "value", "bar", "domain", "httpbin.org"); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/cookies", "httpResponseBody", true, "requestCookies", Collections.singletonList(cookies)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/cookies', httpResponseBody: true, requestCookies: [ { name: 'foo', value: 'bar', domain: 'httpbin.org' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/cookies', 'httpResponseBody' => true, 'requestCookies' => [ [ 'name' => 'foo', 'value' => 'bar', 'domain' => 'httpbin.org', ], ], ], ]); $api = json_decode($response->getBody()); $http_response_body = base64_decode($api->httpResponseBody); echo $http_response_body; ``` #### Proxy mode With the proxy mode, the request `Cookie` header from your requests is used automatically to set cookies for the target URL domain. > ###### NOTE > > Setting cookies for additional domains is not supported. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Cookie: foo=bar" \ https://httpbin.org/cookies ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/cookies", "httpResponseBody": True, "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org", }, ], }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) print(http_response_body.decode()) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/cookies", "httpResponseBody": True, "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org", }, ], } ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/cookies", meta={ "zyte_api_automap": { "requestCookies": [ { "name": "foo", "value": "bar", "domain": "httpbin.org", }, ], }, }, ) def parse(self, response): print(response.text) ``` Output: ```json { "cookies": { "foo": "bar" } } ``` #### Sending text (Unicode) in an HTTP request > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://httpbin.org/anything"}, {"httpResponseBody", true}, {"httpRequestMethod", "POST"}, {"httpRequestText", "{\"foo\": \"bar\"}"} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); var responseData = JsonDocument.Parse(httpResponseBody); var requestBody = responseData.RootElement.GetProperty("data").ToString(); Console.WriteLine(requestBody); ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestText": "{\"foo\": \"bar\"}"} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### curl input.json ```json { "url": "https://httpbin.org/anything", "httpResponseBody": true, "httpRequestMethod": "POST", "httpRequestText": "{\"foo\": \"bar\"}" } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .data ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://httpbin.org/anything", "httpResponseBody", true, "httpRequestMethod", "POST", "httpRequestText", "{\"foo\": \"bar\"}"); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String body = data.get("data").getAsString(); System.out.println(body); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/anything', httpResponseBody: true, httpRequestMethod: 'POST', httpRequestText: '{"foo": "bar"}' }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).data console.log(body) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'httpRequestMethod' => 'POST', 'httpRequestText' => '{"foo": "bar"}', ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $body = json_decode($http_response_body)->data; echo $body.PHP_EOL; ``` #### Proxy mode With the proxy mode, the request body from your requests is used automatically, be it plain text or binary. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -X POST \ -H "Content-Type: application/json" \ --data '{"foo": "bar"}' \ https://httpbin.org/anything \ | jq .data ``` #### Python ```python import json from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestText": '{"foo": "bar"}', }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) body: str = json.loads(http_response_body)["data"] print(body) ``` #### Python client ```python import asyncio import json from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://httpbin.org/anything", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestText": '{"foo": "bar"}', } ) http_response_body = b64decode(api_response["httpResponseBody"]) body = json.loads(http_response_body)["data"] print(body) asyncio.run(main()) ``` #### Scrapy ```python import json from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://httpbin.org/anything", method="POST", body='{"foo": "bar"}', ) def parse(self, response): body = json.loads(response.body)["data"] print(body) ``` Output: ```json {"foo": "bar"} ``` #### Getting response headers > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"httpResponseHeaders", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var headerEnumerator = data.RootElement.GetProperty("httpResponseHeaders").EnumerateArray(); var headers = new Dictionary(); while (headerEnumerator.MoveNext()) { headers.Add( headerEnumerator.Current.GetProperty("name").ToString(), headerEnumerator.Current.GetProperty("value").ToString() ); } ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "httpResponseHeaders": true} ``` ```shell zyte-api input.jsonl \ | jq .httpResponseHeaders ``` #### curl input.json ```json { "url": "https://toscrape.com", "httpResponseHeaders": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq .httpResponseHeaders ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonArray; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://toscrape.com", "browserHtml", true, "httpResponseHeaders", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); JsonArray httpResponseHeaders = jsonObject.get("httpResponseHeaders").getAsJsonArray(); Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(httpResponseHeaders)); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', httpResponseHeaders: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseHeaders = response.data.httpResponseHeaders }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'httpResponseHeaders' => true, ], ]); $api = json_decode($response->getBody()); $http_response_headers = $api->httpResponseHeaders; ``` #### Proxy mode With the proxy mode, response headers are always included in the HTTP response, no need to ask for them explicitly. #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "httpResponseHeaders": True, }, ) http_response_headers = api_response.json()["httpResponseHeaders"] ``` #### Python client ```python import asyncio import json from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "httpResponseHeaders": True, } ) http_response_headers = api_response["httpResponseHeaders"] print(json.dumps(http_response_headers, indent=2)) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "httpResponseBody": False, "httpResponseHeaders": True, }, }, ) def parse(self, response): headers = response.headers ``` > ###### NOTE > > In transparent mode, > [httpResponseHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseHeaders) is sent by default for > httpResponseBody requests, but sending it > explicitly is still recommended, as future versions of > scrapy-zyte-api may stop sending it > by default. Output (first 5 lines): ```json [ { "name": "date", "value": "Fri, 25 Aug 2023 07:08:05 GMT" }, ``` #### Taking a screenshot > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"screenshot", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64Screenshot = data.RootElement.GetProperty("screenshot").ToString(); var screenshot = System.Convert.FromBase64String(base64Screenshot); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "screenshot": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .screenshot \ | base64 --decode \ > screenshot.jpg ``` #### curl input.json ```json { "url": "https://toscrape.com", "screenshot": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .screenshot \ | base64 --decode \ > screenshot.jpg ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "screenshot", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64Screenshot = jsonObject.get("screenshot").getAsString(); byte[] screenshot = Base64.getDecoder().decode(base64Screenshot); try (FileOutputStream fos = new FileOutputStream("screenshot.jpg")) { fos.write(screenshot); } return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', screenshot: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const screenshot = Buffer.from(response.data.screenshot, 'base64') }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'screenshot' => true, ], ]); $api = json_decode($response->getBody()); $screenshot = base64_decode($api->screenshot); ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "screenshot": True, }, ) screenshot: bytes = b64decode(api_response.json()["screenshot"]) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "screenshot": True, } ) screenshot = b64decode(api_response["screenshot"]) with open("screenshot.jpg", "wb") as f: f.write(screenshot) asyncio.run(main()) ``` #### Scrapy ```python from base64 import b64decode from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "screenshot": True, }, }, ) def parse(self, response): screenshot: bytes = b64decode(response.raw_api_response["screenshot"]) ``` Output: ![](zyte-api/usage/code-examples/output/screenshot.jpg) #### Start a client-managed session with a browser request and reuse it in an HTTP request Start a session with a browser request to the home page of a website, and reuse that session for an HTTP request to a different URL of that website. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var sessionId = Guid.NewGuid().ToString(); var browserInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"browserHtml", true}, { "session", new Dictionary() { {"id", sessionId} } } }; var browserInputJson = JsonSerializer.Serialize(browserInput); var browserContent = new StringContent(browserInputJson, Encoding.UTF8, "application/json"); await client.PostAsync("https://api.zyte.com/v1/extract", browserContent); var httpInput = new Dictionary(){ {"url", "https://toscrape.com/"}, {"httpResponseBody", true}, { "session", new Dictionary() { {"id", sessionId} } } }; var httpInputJson = JsonSerializer.Serialize(httpInput); var httpContent = new StringContent(httpInputJson, Encoding.UTF8, "application/json"); HttpResponseMessage httpResponse = await client.PostAsync("https://api.zyte.com/v1/extract", httpContent); var httpResponseBody = await httpResponse.Content.ReadAsByteArrayAsync(); var httpData = JsonDocument.Parse(httpResponseBody); var base64HttpResponseBodyField = httpData.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyField = System.Convert.FromBase64String(base64HttpResponseBodyField); var result = System.Text.Encoding.UTF8.GetString(httpResponseBodyField); Console.WriteLine(result); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import java.util.UUID; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { String sessionId = UUID.randomUUID().toString(); Map session = ImmutableMap.of("id", sessionId); Map browserParameters = ImmutableMap.of("url", "https://toscrape.com/", "browserHtml", true, "session", session); String browserRequestBody = new Gson().toJson(browserParameters); HttpPost browserRequest = new HttpPost("https://api.zyte.com/v1/extract"); browserRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); browserRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); browserRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); browserRequest.setEntity(new StringEntity(browserRequestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( browserRequest, browserResponse -> { Map httpParameters = ImmutableMap.of( "url", "https://books.toscrape.com/", "httpResponseBody", true, "session", session); String httpRequestBody = new Gson().toJson(httpParameters); HttpPost httpRequest = new HttpPost("https://api.zyte.com/v1/extract"); httpRequest.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); httpRequest.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); httpRequest.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); httpRequest.setEntity(new StringEntity(httpRequestBody)); client.execute( httpRequest, httpResponse -> { HttpEntity httpEntity = httpResponse.getEntity(); String httpApiResponse = EntityUtils.toString(httpEntity, StandardCharsets.UTF_8); JsonObject httpJsonObject = JsonParser.parseString(httpApiResponse).getAsJsonObject(); String base64HttpResponseBody = httpJsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const crypto = require('crypto') const sessionId = String(crypto.randomUUID()) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com/', browserHtml: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((browserResponse) => { axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://books.toscrape.com/', httpResponseBody: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((httpResponse) => { const httpResponseBody = Buffer.from( httpResponse.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com/', 'browserHtml' => true, 'session' => ['id' => $session_id], ], ]); $http_response = $client->request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://books.toscrape.com/', 'httpResponseBody' => true, 'session' => ['id' => $session_id], ], ]); $http_data = json_decode($http_response->getBody()); $http_response_body = base64_decode($http_data->httpResponseBody); echo $http_response_body; ``` #### Python ```python from base64 import b64decode from uuid import uuid4 import requests session_id = str(uuid4()) browser_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com/", "browserHtml": True, "session": {"id": session_id}, }, ) http_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://books.toscrape.com/", "httpResponseBody": True, "session": {"id": session_id}, }, ) http_response_body = b64decode(http_response.json()["httpResponseBody"]) print(http_response_body.decode()) ``` #### Python client ```python import asyncio from base64 import b64decode from uuid import uuid4 from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() session_id = str(uuid4()) browser_response = await client.get( { "url": "https://toscrape.com/", "browserHtml": True, "session": {"id": session_id}, } ) http_response = await client.get( { "url": "https://books.toscrape.com/", "httpResponseBody": True, "session": {"id": session_id}, } ) http_response_body = b64decode(http_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from uuid import uuid4 from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): session_id = str(uuid4()) yield Request( "https://toscrape.com/", callback=self.parse_browser, cb_kwargs={"session_id": session_id}, meta={ "zyte_api_automap": { "browserHtml": True, "session": {"id": session_id}, }, }, ) def parse_browser(self, response, session_id): yield response.follow( "https://books.toscrape.com/", callback=self.parse_http, meta={ "zyte_api_automap": { "session": {"id": session_id}, }, }, ) def parse_http(self, response): print(response.text) ``` #### Send HTTP requests with server-managed sessions started with browser requests Set a no-op action in [sessionContextParameters](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/sessionContextParameters) to force sessions to start with a browser request, but use HTTP requests. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com/"}, {"httpResponseBody", true}, { "sessionContext", new List>() { new Dictionary() { {"name", "id"}, {"value", "browser"} } } }, { "sessionContextParameters", new Dictionary() { { "actions", new List>() { new Dictionary() { {"action", "waitForTimeout"}, {"timeout", 0}, } } } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); Console.WriteLine(httpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com/", "httpResponseBody": true, "sessionContext": [{"name": "id", "value": "browser"}], "sessionContextParameters": {"actions": [{"action": "waitForTimeout", "timeout": 0}]}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### curl input.json ```json { "url": "https://toscrape.com/", "httpResponseBody": true, "sessionContext": [ { "name": "id", "value": "browser" } ], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0 } ] } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### Java ```java import com.google.common.collect.ImmutableList; import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "https://toscrape.com/", "httpResponseBody", true, "sessionContext", ImmutableList.of(ImmutableMap.of("name", "id", "value", "browser")), "sessionContextParameters", ImmutableMap.of( "actions", ImmutableList.of(ImmutableMap.of("action", "waitForTimeout", "timeout", 0)))); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com/', httpResponseBody: true, sessionContext: [ { name: 'id', value: 'browser' } ], sessionContextParameters: { actions: [ { action: 'waitForTimeout', timeout: 0 } ] } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com/', 'httpResponseBody' => true, 'sessionContext' => [ [ 'name' => 'id', 'value' => 'browser', ], ], 'sessionContextParameters' => [ 'actions' => [ [ 'action' => 'waitForTimeout', 'timeout' => 0, ], ], ], ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); echo $http_response_body.PHP_EOL; ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com/", "httpResponseBody": True, "sessionContext": [{"name": "id", "value": "browser"}], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0, }, ], }, }, ) http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() print(http_response_body) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() http_response = await client.get( { "url": "https://toscrape.com/", "httpResponseBody": True, "sessionContext": [{"name": "id", "value": "browser"}], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0, }, ], }, } ) http_response_body = b64decode(http_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "https://toscrape.com/", meta={ "zyte_api_automap": { "sessionContext": [ { "name": "id", "value": "browser", }, ], "sessionContextParameters": { "actions": [ { "action": "waitForTimeout", "timeout": 0, }, ], }, }, }, ) def parse(self, response): print(response.text) ``` #### Send HTTP requests with server-managed sessions started with a browser action that visits a specific URL > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "http://httpbin.org/cookies"}, {"httpResponseBody", true}, { "sessionContext", new List>() { new Dictionary() { {"name", "id"}, {"value", "cookies"} } } }, { "sessionContextParameters", new Dictionary() { { "actions", new List>() { new Dictionary() { {"action", "goto"}, {"url", "http://httpbin.org/cookies/set/foo/bar"}, } } } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); Console.WriteLine(httpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "http://httpbin.org/cookies", "httpResponseBody": true, "sessionContext": [{"name": "id", "value": "cookies"}], "sessionContextParameters": {"actions": [{"action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar"}]}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### curl input.json ```json { "url": "http://httpbin.org/cookies", "httpResponseBody": true, "sessionContext": [ { "name": "id", "value": "cookies" } ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar" } ] } } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` #### Java ```java import com.google.common.collect.ImmutableList; import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of( "url", "http://httpbin.org/cookies", "httpResponseBody", true, "sessionContext", ImmutableList.of(ImmutableMap.of("name", "id", "value", "cookies")), "sessionContextParameters", ImmutableMap.of( "actions", ImmutableList.of( ImmutableMap.of( "action", "goto", "url", "http://httpbin.org/cookies/set/foo/bar")))); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'http://httpbin.org/cookies', httpResponseBody: true, sessionContext: [ { name: 'id', value: 'cookies' } ], sessionContextParameters: { actions: [ { action: 'goto', url: 'http://httpbin.org/cookies/set/foo/bar' } ] } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) console.log(httpResponseBody.toString()) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'http://httpbin.org/cookies', 'httpResponseBody' => true, 'sessionContext' => [ [ 'name' => 'id', 'value' => 'cookies', ], ], 'sessionContextParameters' => [ 'actions' => [ [ 'action' => 'goto', 'url' => 'http://httpbin.org/cookies/set/foo/bar', ], ], ], ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); echo $http_response_body.PHP_EOL; ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "http://httpbin.org/cookies", "httpResponseBody": True, "sessionContext": [ { "name": "id", "value": "cookies", }, ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar", }, ], }, }, ) http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() print(http_response_body) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "http://httpbin.org/cookies", "httpResponseBody": True, "sessionContext": [ { "name": "id", "value": "cookies", }, ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar", }, ], }, }, ) http_response_body_bytes = b64decode(api_response["httpResponseBody"]) http_response_body = http_response_body_bytes.decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy > ###### TIP > > scrapy-zyte-api also provides its own session management > API, similar to that of > server-managed sessions, but > built on top of client-managed sessions. ```python from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): yield Request( "http://httpbin.org/cookies", meta={ "zyte_api_automap": { "sessionContext": [ { "name": "id", "value": "cookies", }, ], "sessionContextParameters": { "actions": [ { "action": "goto", "url": "http://httpbin.org/cookies/set/foo/bar", }, ], }, }, }, ) def parse(self, response): print(response.text) ``` Output: ```json { "cookies": { "foo": "bar" } } ``` #### Send 2 consecutive requests through the same IP address using a client-managed session > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var sessionId = Guid.NewGuid().ToString(); for (int i = 0; i < 2; i++) { var input = new Dictionary(){ {"url", "https://httpbin.org/ip"}, {"httpResponseBody", true}, { "session", new Dictionary() { {"id", sessionId} } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBodyBytes = System.Convert.FromBase64String(base64HttpResponseBody); var httpResponseBody = System.Text.Encoding.UTF8.GetString(httpResponseBodyBytes); var responseData = JsonDocument.Parse(httpResponseBody); var ipAddress = responseData.RootElement.GetProperty("origin").ToString(); Console.WriteLine(ipAddress); } ``` #### CLI client input.jsonl ```json {"url": "https://httpbin.org/ip", "httpResponseBody": true, "session": {"id": "e07843b4-fd72-4a02-82b4-3376c6ceba92"}} {"url": "https://httpbin.org/ip", "httpResponseBody": true, "session": {"id": "e07843b4-fd72-4a02-82b4-3376c6ceba92"}} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .origin ``` #### curl input.json ```json { "url": "https://httpbin.org/ip", "httpResponseBody": true, "session": { "id": "e07843b4-fd72-4a02-82b4-3376c6ceba92" } } ``` ```shell for i in {1..2} do curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ | jq --raw-output .origin done ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import java.util.UUID; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { String sessionId = UUID.randomUUID().toString(); CloseableHttpClient client = HttpClients.createDefault(); for (int i = 0; i < 2; i++) { Map session = ImmutableMap.of("id", sessionId); Map parameters = ImmutableMap.of( "url", "https://httpbin.org/ip", "httpResponseBody", true, "session", session); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); JsonObject data = JsonParser.parseString(httpResponseBody).getAsJsonObject(); String body = data.get("origin").getAsString(); System.out.println(body); return null; }); } } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const crypto = require('crypto') const sessionId = String(crypto.randomUUID()) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/ip', httpResponseBody: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).origin console.log(body) axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://httpbin.org/ip', httpResponseBody: true, session: { id: sessionId } }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) const body = JSON.parse(httpResponseBody).origin console.log(body) }) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://httpbin.org/anything', 'httpResponseBody' => true, 'session' => ['id' => $session_id], ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); $body = json_decode($http_response_body)->origin; echo $body.PHP_EOL; } ``` #### Proxy mode With the proxy mode, use the `Zyte-Session-ID` header. ```shell for i in {1..2} do curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --header 'Zyte-Session-ID: e07843b4-fd72-4a02-82b4-3376c6ceba92' \ --compressed \ https://httpbin.org/ip \ | jq --raw-output .origin done ``` #### Python ```python import json from base64 import b64decode from uuid import uuid4 import requests session_id = str(uuid4()) for _ in range(2): api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://httpbin.org/ip", "httpResponseBody": True, "session": {"id": session_id}, }, ) http_response_body = b64decode(api_response.json()["httpResponseBody"]) body: str = json.loads(http_response_body)["origin"] print(body) ``` #### Python client ```python import asyncio import json from base64 import b64decode from uuid import uuid4 from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() session_id = str(uuid4()) for i in range(2): api_response = await client.get( { "url": "https://httpbin.org/ip", "httpResponseBody": True, "session": {"id": session_id}, }, ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() data = json.loads(http_response_body) print(data["origin"]) asyncio.run(main()) ``` #### Scrapy > ###### TIP > > scrapy-zyte-api also provides its own session management > API, similar to that of > server-managed sessions, but > built on top of client-managed sessions. ```python import json from uuid import uuid4 from scrapy import Request, Spider class HTTPBinOrgSpider(Spider): name = "httpbin_org" async def start(self): session_id = str(uuid4()) yield Request( "https://httpbin.org/ip", cb_kwargs={"session_id": session_id}, meta={"zyte_api_automap": {"session": {"id": session_id}}}, ) def parse(self, response, session_id): print(json.loads(response.body)["origin"]) yield Request( "https://httpbin.org/ip", meta={"zyte_api_automap": {"session": {"id": session_id}}}, dont_filter=True, callback=self.parse2, ) def parse2(self, response): print(json.loads(response.body)["origin"]) ``` Output: ```none 203.0.113.122 203.0.113.122 ``` #### Access a [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. To get content from the [shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM), use the `evaluate` action to create an invisible DOM element, which you will get in [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml), and fill it with the desired content from the shadow DOM. > ###### TIP > > If your `evaluate` action does not work as expected, check the > [actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/actions) response field for errors. The following example code shows how to access the shadow DOM paragraph from [a shadow DOM example in CodePen](https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=) using the `evaluate` action with the following `source`: ```js const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') // Hide, in case you also want to take a screenshot. div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) ``` #### C# ```cs using System; using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using System.Xml.XPath; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view="}, {"browserHtml", true}, { "actions", new List>() { new Dictionary() { {"action", "evaluate"}, {"source", @" const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) "} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var nodeIterator = (XPathNodeIterator)navigator.Evaluate("//*[@id=\"shadow-root-content\"]/text()"); nodeIterator.MoveNext(); var shadowText = nodeIterator.Current.ToString(); Console.WriteLine(shadowText); ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map actions = ImmutableMap.of( "action", "evaluate", "source", "const div = document.createElement('div')\n" + "div.setAttribute('id', 'shadow-root-content')\n" + "div.style.display = 'none'\n" + "const iframe = document.getElementById('result')\n" + "div.innerText = iframe\n" + " .contentWindow.document\n" + " .getElementById('shadow-root')\n" + " .shadowRoot.querySelector('p').textContent\n" + "document.body.appendChild(div)"); Map parameters = ImmutableMap.of( "url", "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", "browserHtml", true, "actions", Collections.singletonList(actions)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); String shadowText = document.select("#shadow-root-content").text(); System.out.println(shadowText); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=', browserHtml: true, actions: [ { action: 'evaluate', source: ` const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) ` } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml const $ = cheerio.load(browserHtml) const shadowText = $('#shadow-root-content').text() console.log(shadowText) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=', 'browserHtml' => true, 'actions' => [ [ 'action' => 'evaluate', 'source' => " const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) ", ], ], ], ]); $data = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($data->browserHtml); $xpath = new DOMXPath($doc); $shadow_text = $xpath->query("//*[@id='shadow-root-content']")->item(0)->textContent; echo $shadow_text.PHP_EOL; ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", "browserHtml": True, "actions": [ { "action": "evaluate", "source": """ const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) """, }, ], }, ) browser_html = api_response.json()["browserHtml"] shadow_text = Selector(browser_html).css("#shadow-root-content::text").get() print(shadow_text) ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", "browserHtml": True, "actions": [ { "action": "evaluate", "source": """ const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) """, }, ], }, ) browser_html = api_response["browserHtml"] shadow_text = Selector(browser_html).css("#shadow-root-content::text").get() print(shadow_text) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class CodePenSpider(Spider): name = "codepen" async def start(self): yield Request( "https://cdpn.io/TLadd/fullpage/PoGoQeV?anon=true&view=", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "evaluate", "source": """ const div = document.createElement('div') div.setAttribute('id', 'shadow-root-content') div.style.display = 'none' const iframe = document.getElementById('result') div.innerText = iframe .contentWindow.document .getElementById('shadow-root') .shadowRoot.querySelector('p').textContent document.body.appendChild(div) """, }, ], }, }, ) def parse(self, response): shadow_text = response.css("#shadow-root-content::text").get() print(shadow_text) ``` Output: ```none Shadow Paragraph ``` ## Using the Zyte IDE ![image](zyte-api/ide/images/open-ide.png) To open the **Zyte IDE**, select **Zyte API › Zyte IDE** in the sidebar of the [Zyte dashboard](https://app.zyte.com/). The **Zyte IDE** lets you: - Write, debug, and deploy browser scripts, written using our [TypeScript](https://www.typescriptlang.org/) scripting API, to use as actions in browser requests. - Build Zyte API requests visually, and debug existing requests. ### Zyte IDE requirements To use the Zyte IDE you need a modern browser. You also need to enable third-party cookies on the `zyte.group` domain. If your browser blocks them, you will see an error like the following: > Error loading webview: Error: Could not register service workers: > NotSupportedError: Failed to register a ServiceWorker for scope … ### Browser script advantages Browser scripts are written using the scripting API, a [TypeScript](https://www.typescriptlang.org/) API to expose Zyte API actions. The main advantage of browser scripts over action sequences is support for a non-linear flow: [TypeScript](https://www.typescriptlang.org/) allows using conditional statements, loops, and so on. For example, in a browser script you can check if an element is present in a webpage, and run different actions depending on that. Browser scripts also allow accessing [iframe](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) elements through `Page.getIframe()`. ### `src/` and `dist/` On the ![Explorer](zyte-api/ide/images/explorer.png) [Explorer](https://code.visualstudio.com/docs/getstarted/userinterface#_explorer) view of the Zyte IDE you get 2 folders: - `src/` is the development folder, where you develop your browser scripts. Changes to this folder are saved on the Zyte IDE cloud storage system. Multiple developers can work on this folder at the same time, and see the work of one another in real time. - `dist/` is the deployment folder. It contains files built from the last version of `src/` that has been deployed to Zyte API. You should never write anything in the `dist/` folder, all its contents are removed during deployment. ### Creating a new script To create a new browser script: 1. Open the Zyte IDE. 2. Select ![Application Menu](zyte-api/ide/images/menu.png) (top-left corner) **› File › New File…**. 3. On the **Create New…** dialog, select **Smart Browser Interaction**. 4. On the **Use Case** dialog, select one of the following: - One of the special action interaction classes, to extend it in your new script. - **Others**, to create a script from scratch. - **Utils**, to create a file with utility code that you can reuse from browser scripts and other utility code files. 5. On the **Domain** dialog, enter the domain of the website for which you are writing the script (e.g. `toscrape.com`). A TypeScript file is created in the `src/` folder based on your input data, and will open on a tab of the Zyte IDE. For example, if you select the **Others** use case and the `toscrape.com` domain, a `src/Others/toscrape.com.ts` file is created with the following content: ```typescript import { BaseInteraction, Page } from "smartbrowser-core-interactions/index.ts"; interface Args { // add your arguments here // arg1: string; } export default class Interaction extends BaseInteraction { domains = ["example.com"]; async do(page: Page, args: Args): Promise { // implement your logic here } } ``` You can now use the scripting API to implement your browser script, and then debug and, eventually, deploy that browser script. To get started, try out our our example browser scripts! ### Debugging a script After you create an interaction class, you can run your interaction class on a webpage from the Zyte IDE to see how it works: 1. Select ![Run Smart Browser Interaction](zyte-api/ide/images/run.png) (top-right corner). 2. On the **URL** dialog, enter a target URL (e.g. `https://toscrape.com`). 3. On the **Interaction Parameters(in JSON)** dialog, enter a JSON object of arguments for your interaction class. You can leave it empty to pass no parameters. On pressing Enter, the Zyte IDE view splits vertically, and on the right-hand view a tab loads a tool, **Smart Browser DevTools**, which runs your interaction against the specified URL, showing you the result in real time, and offering you debugging tools and data. When your interaction finishes, an “execution finished” message pops up on the bottom-right corner. Once your interaction class is working as expected, you can deploy it and use it as a browser action in your data extraction requests. ### Completing our Know Your Customer procedure While all Zyte API features are available to every customer, the following features are disabled by default: - Custom script deployment - Setting ipType to residential Unless you are on a free trial, you can enable these features by completing our [Know Your Customer](https://en.wikipedia.org/wiki/Know_your_customer) (KYC) procedure. > ###### TIP > > If you are on an Enterprise plan, you have > already completed our Know Your Customer (KYC) procedure. To start your KYC application: 1. Open the Zyte IDE. 2. Select ![Zyte Smart Browser Devtools](zyte-api/ide/images/zyte.png) on the [Activity Bar](https://code.visualstudio.com/docs/getstarted/userinterface) (left-hand side). 3. On the **Zyte IDE** side view that opens, under **Deploy**, click the **Request Access** button. > ###### TIP > > If you see a **Deploy** button instead, you have probably already > passed our KYC procedure. 4. Fill and submit the form that opens. Once our legal team has reviewed your application, we will notify you the outcome. If approved, you will instantly get access to all previously disabled features. ### Deploying your changes You can debug scripts from the Zyte IDE, but to use them as browser actions in your data extraction requests you must first deploy your changes to Zyte API. To deploy all your changes to Zyte API: 1. Select ![Zyte Smart Browser Devtools](zyte-api/ide/images/zyte.png) on the [Activity Bar](https://code.visualstudio.com/docs/getstarted/userinterface) (left-hand side). 2. On the **Zyte IDE** side view that opens, click the **Deploy** button. > ###### TIP > > If you see a **Request Access** button instead, see kyc. The deployment process empties the `dist/` folder, builds files from the `src/` folder into the `dist/` folder, and deploys the files in the `dist/` folder to Zyte API. ### Using a script with Zyte API Once you have deployed a script, you can get the interaction ID as follows: 1. Select ![Zyte Smart Browser Devtools](zyte-api/ide/images/zyte.png) on the [Activity Bar](https://code.visualstudio.com/docs/getstarted/userinterface) (left-hand side). 2. On the **Zyte IDE** side view that opens, under **Interactions**, right-click on your interaction, and click **Copy Interaction ID**. The interaction ID will be copied to your clipboard. You can then invoke your script as a browser action from your data extraction requests: 1. Set the `action` field to `"interaction"`. 2. Set the `id` field to the interaction ID from your clipboard. For example, if your interaction ID is `Others-toscrape.com`, use: ```json { "action": "interaction", "id": "Others-toscrape.com" } ``` To pass arguments to your script, set the `args` field to an object. That object is passed to your script as the *args* parameter of `BaseInteraction.do()`. For example: ```json { "action": "interaction", "id": "Others-toscrape.com", "args": { "foo": "bar" } } ``` ## Scripting API This is the reference documentation of the scripting API, a [TypeScript](https://www.typescriptlang.org/) API that you can use to write browser scripts. For usage examples, see zapi-browser-script-examples. ### Using the scripting API The `smartbrowser-core-interactions` module provides classes and functions that you can use to implement browser scripts. You can import those classes and functions from `smartbrowser-core-interactions/index.ts`. For example: ```typescript import { BaseInteraction, Page } from "smartbrowser-core-interactions/index.ts"; ``` Below you can find the reference documentation of the complete scripting API. ### BaseInteraction and Page `BaseInteraction()` and `Page()` are the base of the scripting API. #### *class* BaseInteraction() Base class for browser scripts. *abstract* *exported from* `base_classes.Base` #### BaseInteraction.do(page, args) Entry point of a browser script. * **Arguments:** * **page** (*Page*) – Current page. * **args** (*object*) – `args` action parameter passed through a Zyte API request, defaults to `{}`. * **Returns:** **Promise** – #### *class* Page() The webpage currently loaded. *interface* *exported from* `api.page` #### Page.click(selector) Click the first element matching *selector*. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – #### Page.cookies() Get cookies associated with current page. * **Returns:** **Promise** – #### Page.deleteCookie(deleteCookieRequest) Removes a cookies from a set of cookies associated with current page * **Arguments:** * **deleteCookieRequest** (*DeleteCookieRequest*) * **Returns:** **Promise** – #### Page.evaluate(source) Executes JavaScript code within page context. * **Arguments:** * **source** (*string*) – string with JavaScript source to be executed in page context * **Returns:** **Promise** – Promise that resolves to data returned by the evaluated code, if any. #### Page.fetch(url, options) Send a request for *url* within the current [browsing context](https://developer.mozilla.org/en-US/docs/Glossary/Browsing_context) and return a promise that resolves to a `FetchResponse()` object. Note that the browsing context may limit what requests can be made, e.g. through [CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS). * **Arguments:** * **url** (*string*) – URL to fetch * **options** (*FetchOptions*) – Fetch options * **Returns:** **Promise** – FetchResponse object #### Page.getIframe(selector) Get the first Iframe matching *selector*. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – #### Page.goto(url, options) Navigate to *url*. By default, the returned promise resolves once *url* loads. See `GotoOptions.waitUntil` for other options. * **Arguments:** * **url** (*string*) – Target URL * **options** (*GotoOptions*) – Navigation options * **Returns:** **Promise** – #### Page.hide(selector) Hide all elements matching *selector*. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – #### Page.hover(selector) Hover over the first element matching *selector*. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – #### Page.keyPress(key) Press *key*. Supported key IDs are: | Key Group | Key | Key IDs | |---------------------|-------------------------|------------------------| | Letters | A | `"a"`, `"A"`, `"KeyA"` | | B | `"b"`, `"B"`, `"KeyB"` | | | C | `"c"`, `"C"`, `"KeyC"` | | | D | `"d"`, `"D"`, `"KeyD"` | | | E | `"e"`, `"E"`, `"KeyE"` | | | F | `"f"`, `"F"`, `"KeyF"` | | | G | `"g"`, `"G"`, `"KeyG"` | | | H | `"h"`, `"H"`, `"KeyH"` | | | I | `"i"`, `"I"`, `"KeyI"` | | | J | `"j"`, `"J"`, `"KeyJ"` | | | K | `"k"`, `"K"`, `"KeyK"` | | | L | `"l"`, `"L"`, `"KeyL"` | | | M | `"m"`, `"M"`, `"KeyM"` | | | N | `"n"`, `"N"`, `"KeyN"` | | | O | `"o"`, `"O"`, `"KeyO"` | | | P | `"p"`, `"P"`, `"KeyP"` | | | Q | `"q"`, `"Q"`, `"KeyQ"` | | | R | `"r"`, `"R"`, `"KeyR"` | | | S | `"s"`, `"S"`, `"KeyS"` | | | T | `"t"`, `"T"`, `"KeyT"` | | | U | `"u"`, `"U"`, `"KeyU"` | | | V | `"v"`, `"V"`, `"KeyV"` | | | W | `"w"`, `"W"`, `"KeyW"` | | | X | `"x"`, `"X"`, `"KeyX"` | | | Y | `"y"`, `"Y"`, `"KeyY"` | | | Z | `"z"`, `"Z"`, `"KeyZ"` | | | Digits | 0 | `"0"`, `"Digit0"` | | 1 | `"1"`, `"Digit1"` | | | 2 | `"2"`, `"Digit2"` | | | 3 | `"3"`, `"Digit3"` | | | 4 | `"4"`, `"Digit4"` | | | 5 | `"5"`, `"Digit5"` | | | 6 | `"6"`, `"Digit6"` | | | 7 | `"7"`, `"Digit7"` | | | 8 | `"8"`, `"Digit8"` | | | 9 | `"9"`, `"Digit9"` | | | Symbols | Ampersand | `"&"` | | Asterisk | `"*"` | | | At sign | `"@"` | | | Backslash | `"Backslash"`, `"\\"` | | | Backtick | `"Backquote"`, `"`"` | | | Caret | `"^"` | | | Closing brace | `"}"` | | | Closing bracket | `"BracketRight"`, `"]"` | | | Closing parenthesis | `")"` | | | Colon | `":"` | | | Comma | `"Comma"`, `","` | | | Dollar sign | `"$"` | | | Double quote | `'"'` | | | Equal | `"Equal"`, `"="` | | | Exclamation mark | `"!"` | | | Greater-than sign | `">"` | | | Hash | `"#"` | | | Interrogation mark | `"?"` | | | Less-than sign | `"<"` | | | Minus sign | `"Minus"`, `"-"` | | | Opening brace | `"{"` | | | Opening bracket | `"BracketLeft"`, `"["` | | | Opening parenthesis | `")"` | | | Percent sign | `"%"` | | | Period | `"Period"`, `"."` | | | Plus sign | `"+"` | | | Semicolon | `"Semicolon"`, `";"` | | | Single quote | `"Quote"`, `"'"` | | | Slash | `"Slash"`, `"/"` | | | Tilde | `"~"` | | | Underscore | `"_"` | | | Vertical bar | `"|"` | | | Whitespace | Enter | `"Enter"`, `"\n"` | | Space | `"Space"`, `" "` | | | Tab | `"Tab"` | | | Editing | Backspace | `"Backspace"`, `"\r"` | | Delete | `"Delete"` | | | Insert | `"Insert"` | | | Navigation | Down arrow | `"ArrowDown"` | | Left arrow | `"ArrowLeft"` | | | Page down | `"PageDown"` | | | Page up | `"PageUp"` | | | Right arrow | `"ArrowRight"` | | | Up arrow | `"ArrowUp"` | | | Modifier | Alt | `"Alt"` | | Caps Lock | `"CapsLock"` | | | Left Alt | `"AltLeft"` | | | Left Control | `"ControlLeft"` | | | Left Shift | `"ShiftLeft"` | | | Right Alt | `"AltRight"` | | | Right Control | `"ControlRight"` | | | Right Shift | `"ShiftRight"` | | | Shift | `"Shift"` | | | Numpad | 0 | `"Numpad0"` | | 1 | `"Numpad1"` | | | 2 | `"Numpad2"` | | | 3 | `"Numpad3"` | | | 4 | `"Numpad4"` | | | 5 | `"Numpad5"` | | | 6 | `"Numpad6"` | | | 7 | `"Numpad7"` | | | 8 | `"Numpad8"` | | | Add sign | `"NumpadAdd"` | | | Decimal sign | `"NumpadDecimal"` | | | Divide sign | `"NumpadDivide"` | | | Enter | `"NumpadEnter"` | | | Equal | `"NumpadEqual"` | | | Multiply sign | `"NumpadMultiply"` | | | Substract sign | `"NumpadSubtract"` | | | UI | Break key | `"Pause"` | | Composition | Non-conversion | `"NonConvert"` | | Other | Null | `"\u0000"` | * **Arguments:** * **key** (*string*) – Key ID. * **Returns:** **Promise** – #### Page.querySelector(selector) Query the DOM for an element matching *selector* and return first element found. If no element matches *selector*, the return value resolves to `null`. If multiple elements match *selector*, only the first element is returned. To get all elements, use `querySelectorAll()` instead. This method throws an error if *selector* is invalid. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – Promise that resolves to an `ElementHandle()` object for the first element matching *selector*. #### Page.querySelectorAll(selector) Query the DOM for all elements matching *selector*. If no elements match *selector*, the return value resolves to `[]`. This method throws an error if *selector* is invalid. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – Promise that resolves to an array of all `ElementHandle()` objects matching *selector*. #### Page.reload(options) Refresh the current page. * **Arguments:** * **options** (*GotoOptions*) – Navigation options * **Returns:** **Promise** – #### Page.scrollBottom(options) Scroll to given position in a page. * **Arguments:** * **options** (*ScrollBottomOptions*) – Scrolling options * **Returns:** **Promise** – #### Page.scrollTo(options) Scroll to a given position in the document. * **Arguments:** * **options** (*ScrollToOptions*) – Scrolling options * **Returns:** **Promise** – #### Page.select(selector, values) Selects an option in a [select](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select) element. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **values** (*string[]*) – Array of [option](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/option) element [values](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/option#value) to select. * **Returns:** **Promise** – #### Page.setCookie(cookie) Sets a cookie to be sent with subsequent requests within current page context. * **Arguments:** * **cookie** (*Cookie*) – Cookie to be set, should have name and value properties * **Returns:** **Promise** – #### Page.type(selector, text, delay) Type *text* into the first element matching *selector*. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **text** (*string*) – Input text. * **delay** (*number*) – Time to wait between key presses in seconds, defaults to 0. * **Returns:** **Promise** – #### Page.url() Returns a string with the URL of the current page. * **Returns:** **Promise** – Promise that resolves to the URL of the page. #### Page.waitForNavigation(timeout, waitUntil) Wait for navigation to finish. * **Arguments:** * **timeout** (*number*) – Maximum time to wait, in seconds. Defaults to 30 seconds. * **waitUntil** ( *"load"|"domcontentloaded"|"networkidle0"*) – When to consider that navigation has finished. One of: `"load"` ([load event](https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event), default), `"domcontentloaded"` ([DOMContentLoaded event](https://developer.mozilla.org/en-US/docs/Web/API/Window/DOMContentLoaded_event)), or `"networkidle0"` (no ongoing network connections for at least 0.5 seconds). * **Returns:** **Promise** – Promise that resolves to a [Response](https://developer.mozilla.org/en-US/docs/Web/API/Response) object. #### Page.waitForSelector(selector, timeout) Wait for *selector* to match an item in the [DOM](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model). If there is already a match by the time the method is called, the returned promise resolves immediately. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **timeout** (*number*) – Maximum wait time, in seconds. Defaults to 30 seconds. * **Returns:** **Promise** – #### Page.waitForTimeout(timeout) Return a promise that resolves after *timeout*. * **Arguments:** * **timeout** (*number*) – Wait time, in seconds. * **Returns:** **Promise** – ### Selector and ElementHandle #### *class* Selector() Expression that aims to match one or more [DOM](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model) elements. *interface* *exported from* `api.page` #### Selector.allElements? **type:** boolean Whether to match all possible elements (`true`) or only the first one (`false`, default). #### Selector.state? **type:** “attached”|”visible”|”hidden” Visibility required for an element to match. Possible values are: - `"visible"` (default): only visible elements are matched. An element is visible if it has a non-empty bounding box and does not have `visibility` set to `hidden`. Note that elements with `display` set to `none` have an empty bounding box, and are hence not considered visible. - `"hidden"`: only non-visible elements are matched. - `"attached"`: any element may be matched, regardless of its visibility. #### Selector.type **type:** “css”|”xpath” Whether `value` is a [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) expression or an [XPath 1.0](https://www.w3.org/TR/1999/REC-xpath-19991116/) expression. You can find some resources to learn about these selector languages in the [parsel documentation](https://parsel.readthedocs.io/en/latest/usage.html#learning-expression-languages). #### Selector.value **type:** string Expression. #### *class* ElementHandle() [DOM](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model) element. *interface* *exported from* `api.page` #### ElementHandle.getAttribute(name) Return the value of the element attribute with the specified *name*, or `null` if the attribute does not exist. * **Arguments:** * **name** (*string*) * **Returns:** **Promise** – #### ElementHandle.getText() Return the text between the element start tag and end tag. * **Returns:** **Promise** – #### ElementHandle.querySelector(selector) Return a nested element matching *selector*. * **Arguments:** * **selector** (*Selector|string*) – `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string. * **Returns:** **Promise** – #### ElementHandle.screenshot() Return the screenshot of the element in PNG format as a base64-encoded string. * **Returns:** **Promise** – ### FetchResponse #### *class* FetchResponse() Response from `Page.fetch()`. Its API is a subset of the [Response](https://developer.mozilla.org/en-US/docs/Web/API/Response) API. *interface* *exported from* `api.page` #### FetchResponse.headers **type:** Record #### FetchResponse.ok **type:** boolean #### FetchResponse.status **type:** number #### FetchResponse.statusText **type:** string #### FetchResponse.type **type:** string #### FetchResponse.url **type:** string #### FetchResponse.bytes() * **Returns:** **Uint8Array** – #### FetchResponse.json() * **Returns:** **any** – #### FetchResponse.text() * **Returns:** **string** – ### Cookies #### *class* Cookie() *interface* *exported from* `api.page` #### Cookie.domain? **type:** string #### Cookie.expires? **type:** number #### Cookie.httpOnly? **type:** boolean #### Cookie.name **type:** string #### Cookie.path? **type:** string #### Cookie.sameSite? **type:** CookieSameSite #### Cookie.secure? **type:** boolean #### Cookie.url? **type:** string #### Cookie.value **type:** string #### *class* DeleteCookieRequest() *interface* *exported from* `api.page` #### DeleteCookieRequest.domain? **type:** string #### DeleteCookieRequest.name **type:** string #### DeleteCookieRequest.partitionKey? **type:** string #### DeleteCookieRequest.path? **type:** string #### DeleteCookieRequest.url? **type:** string #### *class* CookieSameSite() One of: `"Strict"`, `"Lax"`, `"None"`. ### Option classes #### *class* FetchOptions() Options for `Page.fetch()`. It is a subset of [RequestInit](https://developer.mozilla.org/en-US/docs/Web/API/RequestInit). *interface* *exported from* `api.page` #### FetchOptions.body? **type:** string #### FetchOptions.cache? **type:** “default”|”reload”|”no-cache”|”force-cache”|”only-if-cached” #### FetchOptions.headers? **type:** Record #### FetchOptions.method? **type:** “GET”|”POST”|”PUT”|”DELETE”|”PATCH”|”OPTIONS”|”HEAD” #### FetchOptions.mode? **type:** “same-origin”|”no-cors”|”cors” #### FetchOptions.redirect? **type:** “follow”|”error”|”manual” #### FetchOptions.referrer? **type:** string #### *class* GotoOptions() Options for `Page.goto()`. *interface* *exported from* `api.page` #### GotoOptions.timeout? **type:** number Maximum wait time, in seconds. Defaults to 30 seconds. Use 0 to disable the timeout. #### GotoOptions.waitUntil **type:** “load”|”networkidle0” When to consider that navigation has finished. Possible values: - `"load"` ([load event](https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event), default). - `"networkidle0"` (no ongoing network connections for at least 0.5 seconds). #### *class* ScrollBottomOptions() *interface* *exported from* `api.page` #### ScrollBottomOptions.maxPageHeight? **type:** number #### ScrollBottomOptions.maxScrollCount? **type:** number #### ScrollBottomOptions.maxScrollDelay? **type:** number #### ScrollBottomOptions.timeout? **type:** number #### *class* ScrollToOptions() *interface* *exported from* `api.page` #### ScrollToOptions.left? **type:** number #### ScrollToOptions.top **type:** number ### Special action interactions Zyte maintains a set of subclasses of `BaseInteraction()` that all Zyte API users may use as special actions. When implementing a custom interaction, instead of extending `BaseInteraction()`, you can extend one of these special action interaction classes: - `SearchKeywordInteraction()` - `SetLocationInteraction()` #### *class* SearchKeywordArgs() Argument interface for `SearchKeywordInteraction()`. *interface* *exported from* `base_classes.SearchKeyword` #### SearchKeywordArgs.keyword **type:** string Keyword or keywords to search for. #### *class* SearchKeywordInteraction() Interaction that uses the search box of a page. See `SearchKeywordArgs()` for the interface of the *args* parameter of the `do()` method of this class. *abstract* *exported from* `base_classes.SearchKeyword` **Extends:** : - `BaseInteraction()` #### SearchKeywordInteraction.keywordCssSelector **type:** Selector|string `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string to find the input field where the search keywords must be typed. #### SearchKeywordInteraction.typeKeyword(page, keyword) Type *keyword* into the search field on *page*, start the search, and wait for the results page to load. * **Arguments:** * **page** (*Page*) – Current page. * **keyword** (*string*) – Search keywords. * **Returns:** **Promise** – #### *class* SetLocation.Address() [Postal address](https://en.wikipedia.org/wiki/Address). *interface* *exported from* `base_classes.SetLocation` #### SetLocation.Address.city **type:** string #### SetLocation.Address.country **type:** string [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) country code. #### SetLocation.Address.postalCode **type:** string [Postal code](https://en.wikipedia.org/wiki/Postal_code). #### SetLocation.Address.region **type:** string Country subdivision of relevance for the [postal address](https://en.wikipedia.org/wiki/Address). #### SetLocation.Address.streetAddress **type:** string [Street address](https://en.wiktionary.org/wiki/street_address). #### *class* SetLocationArgs() Argument interface for `SetLocationInteraction()`. *interface* *exported from* `base_classes.SetLocation` #### SetLocationArgs.address **type:** Address [Postal address](https://en.wikipedia.org/wiki/Address). #### *class* SetLocationInteraction() Interaction to fill fields on the address form of a page. See `SetLocationArgs()` for the interface of the *args* parameter of the `do()` method of this class. *abstract* *exported from* `base_classes.SetLocation` **Extends:** : - `BaseInteraction()` #### SetLocationInteraction.postalCodeCssSelector **type:** Selector|string `Selector()` instance or [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) string to find the input field of the postal code. #### SetLocationInteraction.typePostalCode(page, postalCode) Type *postalCode* into the postal code field on *page*. * **Arguments:** * **page** (*Page*) – Current page. * **postalCode** (*string*) – [Postal code](https://en.wikipedia.org/wiki/Postal_code). * **Returns:** **Promise** – ### Versioning Every new version of the `smartbrowser-core-interactions` module is backward-compatible, and is automatically made available to every instance of the Zyte IDE, and to code already deployed to Zyte API. In the future, we may implement a versioning system to make it possible to introduce backward-incompatible changes into the `smartbrowser-core-interactions` module without breaking existing code. ## Browser script examples The following sections showcase ready-to-use browser scripts. See zyte-ide to learn how to use them. ### Search The following example subclasses `SearchKeywordInteraction()` to implement search for docs.zyte.com: ```typescript import { SearchKeywordInteraction } from "smartbrowser-core-interactions/index.ts"; export default class DocsZyteComSearchKeywordInteraction extends SearchKeywordInteraction { domains = ["docs.zyte.com"]; keywordCssSelector = "input.sidebar-search"; } ``` To try the example, use `https://docs.zyte.com/` as target URL, `{"keyword": "foo"}` as parameters, and any geolocation. > ###### NOTE > > If you debug the example in real time from the Zyte IDE, for the browser script to work you must widen the > browser view until the Zyte docs search box becomes visible. > > ![](zyte-api/ide/examples/images/search.png) > > This is not a problem in a Zyte API requests, where the default > [viewport](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/viewport) is wide enough. ### Interactive form The following example fills and submits the content filtering form at [quotes.toscrape.com/search.aspx](https://quotes.toscrape.com/search.aspx): ```typescript import { BaseInteraction, Page } from "smartbrowser-core-interactions/index.ts"; interface Args { author: string; tag: string; } export default class QuotesToScrapeComSearchInteraction extends BaseInteraction { domains = ["quotes.toscrape.com"]; async do(page: Page, args: Args): Promise { await page.select("#author", [args.author]); await page.waitForSelector({type: "css", "value": "#tag option[value]", state: "attached"}); await page.select("#tag", [args.tag]); await page.waitForTimeout(3); await page.click('input[name="submit_button"]'); await page.waitForSelector(".quote"); } } ``` To try the example, use `https://quotes.toscrape.com/search.aspx` as target URL, `{"author": "Steve Martin", "tag": "humor"}` as parameters, and any geolocation. ### API call The following example shows how you can use JavaScript through the `evaluate` [action](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/actions) to send an API request from a browser and get the API response in the resulting [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml): ```typescript import { BaseInteraction, Page } from "smartbrowser-core-interactions/index.ts"; interface Args {} export default class QuotesToScrapeComAPICall extends BaseInteraction { domains = ["quotes.toscrape.com"]; async do(page: Page, args: Args): Promise { const source = ` fetch("http://quotes.toscrape.com/api/quotes?page=1") .then(response => response.json()) .then(data => { document.write(JSON.stringify(data)) }) .catch(error => { document.write(JSON.stringify({error})); }); `; await page.evaluate(source); } } ``` To try the example, use `http://quotes.toscrape.com/scroll` as target URL, `{}` as parameters, and any geolocation. ## Migrating to Zyte API Learn how to migrate from: > ##### Browser automation tools > > Migrate from tools like Playwright, Puppeteer, Selenium, or Splash, for > better productivity and scalability. > ##### Bright Data Web Unlocker > > Enjoy browser HTML, screenshots, and browser actions. > ##### ScrapingBee > > Migrate from ScrapingBee to Zyte API. > ##### scrapy-zyte-api > > Upgrade from scrapy-zyte-smartproxy or from scrapy-crawlera. > ##### ZenRows > > Migrate from ZenRows to Zyte API. > ##### Zyte Smart Proxy Manager > > Enjoy lower ban rates, browser HTML, screenshots, browser actions, and > smart geolocation. ## Migrating from browser automation to Zyte API Learn how to migrate from browser automation tools, like [Playwright](https://playwright.dev/), [Puppeteer](https://pptr.dev/), [Selenium](https://www.selenium.dev/), or [Splash](https://splash.readthedocs.io/en/stable/), to Zyte API. ### Feature comparison The following table summarizes the feature differences between Zyte API and browser automation tools: | Feature | Zyte API | Browser automation | |-------------------|------------|----------------------| | API | HTTP | Varies | | Website-aware API | Yes | No | | Avoid bans | Yes | Hard | | Scalable | Yes | Hard | ### Migration examples The following examples show common browser automation functionality implemented using many browser automation tools, followed by an example of the same functionality implemented using Zyte API. Use these examples to get started porting your own code. To learn more about the browser automation features of Zyte API, see zapi-browser. If your code requires a non-linear flow or something else that cannot be translated into a JSON array with a static sequence of actions, you may need Zyte API browser scripts. #### Getting browser HTML This is how you get a browser [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) rendered as HTML using browser automation tools: #### Playwright > ###### NOTE > > This example uses JavaScript with [Playwright](https://playwright.dev/) for browser automation and > [cheerio](https://github.com/cheeriojs/cheerio) for HTML parsing. ```js const playwright = require('playwright') async function main () { const browser = await playwright.chromium.launch() const page = await browser.newPage() await page.goto('https://toscrape.com') const browserHtml = await page.content() await browser.close() } main() ``` #### Puppeteer > ###### NOTE > > This example uses JavaScript with [Puppeteer](https://pptr.dev/) for browser > automation and [cheerio](https://github.com/cheeriojs/cheerio) for HTML parsing. ```js const puppeteer = require('puppeteer') async function main () { const browser = await puppeteer.launch() const page = await browser.newPage() await page.goto('https://toscrape.com') const browserHtml = await page.content() await browser.close() } main() ``` #### scrapy-playwright > ###### NOTE > > This example uses [scrapy-playwright](https://github.com/scrapy-plugins/scrapy-playwright). ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={"playwright": True}, ) def parse(self, response): browser_html: str = response.text ``` #### scrapy-splash > ###### NOTE > > This example uses [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash). ```python from scrapy import Spider from scrapy_splash import SplashRequest class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield SplashRequest("https://toscrape.com") def parse(self, response): browser_html: str = response.text ``` #### Selenium > ###### NOTE > > This example uses [Selenium](https://www.selenium.dev/) with [Python bindings](https://pypi.org/project/selenium/) for browser > automation and [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. ```python from selenium import webdriver driver = webdriver.Firefox() driver.get("https://toscrape.com") browser_html = driver.page_source driver.close() ``` #### Splash > ###### NOTE > > This example uses Python with [Splash](https://splash.readthedocs.io/en/stable/) for browser automation, > [requests](https://requests.readthedocs.io/en/latest/) to use the HTTP API of Splash, and [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. ```python from urllib.parse import quote import requests splash_url = "YOUR_SPLASH_URL" url = "https://toscrape.com" response = requests.get(f"{splash_url}/render.html?url={quote(url)}") browser_html: str = response.content.decode() ``` And this is how you do it using Zyte API: > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"browserHtml", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "browserHtml": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml ``` #### curl input.json ```json { "url": "https://toscrape.com", "browserHtml": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "browserHtml", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); System.out.println(browserHtml); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', browserHtml: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'browserHtml' => true, ], ]); $api = json_decode($response->getBody()); $browser_html = $api->browserHtml; ``` #### Proxy mode ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ -H "Zyte-Browser-Html: true" \ https://toscrape.com ``` #### Python ```python import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "browserHtml": True, }, ) browser_html: str = api_response.json()["browserHtml"] ``` #### Python client ```python import asyncio from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "browserHtml": True, } ) print(api_response["browserHtml"]) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "browserHtml": True, }, }, ) def parse(self, response): browser_html: str = response.text ``` Output (first 5 lines): ```html Scraping Sandbox ``` See zapi-browser-html. #### Taking a screenshot This is how you take a screenshot using browser automation tools: #### Playwright > ###### NOTE > > This example uses JavaScript with [Playwright](https://playwright.dev/) for browser automation and > [cheerio](https://github.com/cheeriojs/cheerio) for HTML parsing. ```js const playwright = require('playwright') async function main () { const browser = await playwright.chromium.launch() const context = await browser.newContext({ viewport: { width: 1920, height: 1080 } }) const page = await context.newPage() await page.goto('https://toscrape.com') const screenshot = await page.screenshot({ type: 'jpeg' }) await browser.close() } main() ``` #### Puppeteer > ###### NOTE > > This example uses JavaScript with [Puppeteer](https://pptr.dev/) for browser > automation and [cheerio](https://github.com/cheeriojs/cheerio) for HTML parsing. ```js const puppeteer = require('puppeteer') async function main () { const browser = await puppeteer.launch({ defaultViewport: { width: 1920, height: 1080 } }) const page = await browser.newPage() await page.goto('https://toscrape.com') const screenshot = await page.screenshot({ type: 'jpeg' }) await browser.close() } main() ``` #### scrapy-playwright > ###### NOTE > > This example uses [scrapy-playwright](https://github.com/scrapy-plugins/scrapy-playwright). ```python from scrapy import Request, Spider from scrapy_playwright.page import PageMethod class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "playwright": True, "playwright_context": "new", "playwright_context_kwargs": { "viewport": {"width": 1920, "height": 1080}, }, "playwright_page_methods": [ PageMethod("screenshot", type="jpeg"), ], }, ) def parse(self, response): screenshot: bytes = response.meta["playwright_page_methods"][0].result ``` #### scrapy-splash > ###### NOTE > > This example uses [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash). ```python from scrapy import Spider from scrapy_splash import SplashRequest class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield SplashRequest( "https://toscrape.com", endpoint="render.jpeg", args={ "viewport": "1920x1080", }, ) def parse(self, response): screenshot: bytes = response.body ``` #### Selenium > ###### NOTE > > This example uses [Selenium](https://www.selenium.dev/) with [Python bindings](https://pypi.org/project/selenium/) for browser > automation and [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. ```python from io import BytesIO from tempfile import NamedTemporaryFile from PIL import Image from selenium import webdriver # https://stackoverflow.com/a/37183295 def set_viewport_size(driver, width, height): window_size = driver.execute_script( """ return [window.outerWidth - window.innerWidth + arguments[0], window.outerHeight - window.innerHeight + arguments[1]]; """, width, height, ) driver.set_window_size(*window_size) def get_jpeg_screenshot(driver): f = NamedTemporaryFile(suffix=".png") driver.save_screenshot(f.name) f.seek(0) image = Image.open(f) rgb_image = image.convert("RGB") image_io = BytesIO() rgb_image.save(image_io, format="JPEG") return image_io.getvalue() driver = webdriver.Firefox() set_viewport_size(driver, 1920, 1080) driver.get("https://toscrape.com") screenshot = get_jpeg_screenshot(driver) driver.close() ``` #### Splash > ###### NOTE > > This example uses Python with [Splash](https://splash.readthedocs.io/en/stable/) for browser automation, > [requests](https://requests.readthedocs.io/en/latest/) to use the HTTP API of Splash, and [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. ```python from urllib.parse import quote import requests splash_url = "YOUR_SPLASH_URL" url = "https://toscrape.com" response = requests.get(f"{splash_url}/render.jpeg?url={quote(url)}&viewport=1920x1080") screenshot: bytes = response.content ``` And this is how you do it using Zyte API: > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"screenshot", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64Screenshot = data.RootElement.GetProperty("screenshot").ToString(); var screenshot = System.Convert.FromBase64String(base64Screenshot); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "screenshot": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .screenshot \ | base64 --decode \ > screenshot.jpg ``` #### curl input.json ```json { "url": "https://toscrape.com", "screenshot": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .screenshot \ | base64 --decode \ > screenshot.jpg ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "screenshot", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64Screenshot = jsonObject.get("screenshot").getAsString(); byte[] screenshot = Base64.getDecoder().decode(base64Screenshot); try (FileOutputStream fos = new FileOutputStream("screenshot.jpg")) { fos.write(screenshot); } return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', screenshot: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const screenshot = Buffer.from(response.data.screenshot, 'base64') }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'screenshot' => true, ], ]); $api = json_decode($response->getBody()); $screenshot = base64_decode($api->screenshot); ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "screenshot": True, }, ) screenshot: bytes = b64decode(api_response.json()["screenshot"]) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "screenshot": True, } ) screenshot = b64decode(api_response["screenshot"]) with open("screenshot.jpg", "wb") as f: f.write(screenshot) asyncio.run(main()) ``` #### Scrapy ```python from base64 import b64decode from scrapy import Request, Spider class ToScrapeComSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "screenshot": True, }, }, ) def parse(self, response): screenshot: bytes = b64decode(response.raw_api_response["screenshot"]) ``` Output: ![](zyte-api/usage/code-examples/output/screenshot.jpg) See zapi-screenshot. #### Consuming scroll-based pagination This is how you use browser automation tools to load a webpage on a web browser, scroll to the bottom in a loop until it stops loading more content, and get the resulting [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) rendered as HTML: #### Playwright > ###### NOTE > > This example uses JavaScript with [Playwright](https://playwright.dev/) for browser automation and > [cheerio](https://github.com/cheeriojs/cheerio) for HTML parsing. ```js const cheerio = require('cheerio') const playwright = require('playwright') async function main () { const browser = await playwright.chromium.launch() const page = await browser.newPage() await page.goto('https://quotes.toscrape.com/scroll') await page.evaluate(async () => { const scrollInterval = setInterval( function () { const scrollingElement = (document.scrollingElement || document.body) scrollingElement.scrollTop = scrollingElement.scrollHeight }, 100 ) let previousHeight = null while (true) { const currentHeight = window.innerHeight + window.scrollY if (!previousHeight) { previousHeight = currentHeight await new Promise(resolve => setTimeout(resolve, 500)) } else if (previousHeight === currentHeight) { clearInterval(scrollInterval) break } else { previousHeight = currentHeight await new Promise(resolve => setTimeout(resolve, 500)) } } }) const $ = cheerio.load(await page.content()) const quoteCount = $('.quote').length await browser.close() } main() ``` #### Puppeteer > ###### NOTE > > This example uses JavaScript with [Puppeteer](https://pptr.dev/) for browser > automation and [cheerio](https://github.com/cheeriojs/cheerio) for HTML parsing. ```js const cheerio = require('cheerio') const puppeteer = require('puppeteer') async function main () { const browser = await puppeteer.launch() const page = await browser.newPage() await page.goto('https://quotes.toscrape.com/scroll') await page.evaluate(async () => { const scrollInterval = setInterval( function () { const scrollingElement = (document.scrollingElement || document.body) scrollingElement.scrollTop = scrollingElement.scrollHeight }, 100 ) let previousHeight = null while (true) { const currentHeight = window.innerHeight + window.scrollY if (!previousHeight) { previousHeight = currentHeight await new Promise(resolve => setTimeout(resolve, 500)) } else if (previousHeight === currentHeight) { clearInterval(scrollInterval) break } else { previousHeight = currentHeight await new Promise(resolve => setTimeout(resolve, 500)) } } }) const $ = cheerio.load(await page.content()) const quoteCount = $('.quote').length await browser.close() } main() ``` #### scrapy-playwright > ###### NOTE > > This example uses [scrapy-playwright](https://github.com/scrapy-plugins/scrapy-playwright). ```python from asyncio import sleep from scrapy import Request, Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield Request( "https://quotes.toscrape.com/scroll", meta={ "playwright": True, "playwright_include_page": True, }, ) # Based on https://stackoverflow.com/a/69193325 async def scroll_to_bottom(self, page): await page.evaluate( """ var scrollInterval = setInterval( function () { var scrollingElement = (document.scrollingElement || document.body); scrollingElement.scrollTop = scrollingElement.scrollHeight; }, 100 ); """ ) previous_height = None while True: current_height = await page.evaluate( "(window.innerHeight + window.scrollY)" ) if not previous_height: previous_height = current_height await sleep(0.5) elif previous_height == current_height: await page.evaluate("clearInterval(scrollInterval)") break else: previous_height = current_height await sleep(0.5) async def parse(self, response): page = response.meta["playwright_page"] await self.scroll_to_bottom(page) body = await page.content() response = response.replace(body=body) quote_count = len(response.css(".quote")) await page.close() ``` #### scrapy-splash > ###### NOTE > > This example uses [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash). ```python from scrapy import Spider from scrapy_splash import SplashRequest # Based on https://stackoverflow.com/a/40366442 SCROLL_TO_BOTTOM_LUA = """ function main(splash) local num_scrolls = 10 local scroll_delay = 0.1 local scroll_to = splash:jsfunc("window.scrollTo") local get_body_height = splash:jsfunc( "function() {return document.body.scrollHeight;}" ) assert(splash:go(splash.args.url)) for _ = 1, num_scrolls do scroll_to(0, get_body_height()) splash:wait(scroll_delay) end return splash:html() end """ class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield SplashRequest( "https://quotes.toscrape.com/scroll", endpoint="execute", args={"lua_source": SCROLL_TO_BOTTOM_LUA}, ) def parse(self, response): quote_count = len(response.css(".quote")) ``` #### Selenium > ###### NOTE > > This example uses [Selenium](https://www.selenium.dev/) with [Python bindings](https://pypi.org/project/selenium/) for browser > automation and [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. ```python from time import sleep from parsel import Selector from selenium import webdriver # Based on https://stackoverflow.com/a/69193325 def scroll_to_bottom(driver): driver.execute_script( """ var scrollInterval = setInterval( function () { var scrollingElement = (document.scrollingElement || document.body); scrollingElement.scrollTop = scrollingElement.scrollHeight; }, 100 ); """ ) previous_height = None while True: current_height = driver.execute_script( "return window.innerHeight + window.scrollY" ) if not previous_height: previous_height = current_height sleep(0.5) elif previous_height == current_height: driver.execute_script("clearInterval(window.scrollInterval)") break else: previous_height = current_height sleep(0.5) driver = webdriver.Firefox() driver.get("https://quotes.toscrape.com/scroll") scroll_to_bottom(driver) selector = Selector(driver.page_source) quote_count = len(selector.css(".quote")) driver.close() ``` #### Splash > ###### NOTE > > This example uses Python with [Splash](https://splash.readthedocs.io/en/stable/) for browser automation, > [requests](https://requests.readthedocs.io/en/latest/) to use the HTTP API of Splash, and [Parsel](https://parsel.readthedocs.io/en/latest/) for HTML parsing. ```python from urllib.parse import quote import requests from parsel import Selector # Based on https://stackoverflow.com/a/40366442 SCROLL_TO_BOTTOM_LUA = """ function main(splash) local num_scrolls = 10 local scroll_delay = 0.1 local scroll_to = splash:jsfunc("window.scrollTo") local get_body_height = splash:jsfunc( "function() {return document.body.scrollHeight;}" ) assert(splash:go(splash.args.url)) for _ = 1, num_scrolls do scroll_to(0, get_body_height()) splash:wait(scroll_delay) end return splash:html() end """ splash_url = "YOUR_SPLASH_URL" url = "https://quotes.toscrape.com/scroll" response = requests.get( f"{splash_url}/execute?url={quote(url)}&lua_source={quote(SCROLL_TO_BOTTOM_LUA)}" ) selector = Selector(text=response.content.decode()) quote_count = len(selector.css(".quote")) ``` And this is how you do it using Zyte API: > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; using HtmlAgilityPack; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://quotes.toscrape.com/scroll"}, {"browserHtml", true}, { "actions", new List>() { new Dictionary() { {"action", "scrollBottom"} } } } }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var browserHtml = data.RootElement.GetProperty("browserHtml").ToString(); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(browserHtml); var navigator = htmlDocument.CreateNavigator(); var quoteCount = (double)navigator.Evaluate("count(//*[@class=\"quote\"])"); ``` #### CLI client input.jsonl ```json {"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{"action": "scrollBottom"}]} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .browserHtml \ | xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null ``` #### curl input.json ```json { "url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [ { "action": "scrollBottom" } ] } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .browserHtml \ | xmllint --html --xpath 'count(//*[@class="quote"])' - 2> /dev/null ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Collections; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map action = ImmutableMap.of("action", "scrollBottom"); Map parameters = ImmutableMap.of( "url", "https://quotes.toscrape.com/scroll", "browserHtml", true, "actions", Collections.singletonList(action)); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String browserHtml = jsonObject.get("browserHtml").getAsString(); Document document = Jsoup.parse(browserHtml); int quoteCount = document.select(".quote").size(); System.out.println(quoteCount); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') const cheerio = require('cheerio') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://quotes.toscrape.com/scroll', browserHtml: true, actions: [ { action: 'scrollBottom' } ] }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const browserHtml = response.data.browserHtml const $ = cheerio.load(browserHtml) const quoteCount = $('.quote').length }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://quotes.toscrape.com/scroll', 'browserHtml' => true, 'actions' => [ ['action' => 'scrollBottom'], ], ], ]); $data = json_decode($response->getBody()); $doc = new DOMDocument(); $doc->loadHTML($data->browserHtml); $xpath = new DOMXPath($doc); $quote_count = $xpath->query("//*[@class='quote']")->count(); ``` #### Python ```python import requests from parsel import Selector api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, ) browser_html = api_response.json()["browserHtml"] quote_count = len(Selector(browser_html).css(".quote")) ``` #### Python client ```python import asyncio from parsel import Selector from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://quotes.toscrape.com/scroll", "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, ) browser_html = api_response["browserHtml"] quote_count = len(Selector(browser_html).css(".quote")) print(quote_count) asyncio.run(main()) ``` #### Scrapy ```python from scrapy import Request, Spider class QuotesToScrapeComSpider(Spider): name = "quotes_toscrape_com" async def start(self): yield Request( "https://quotes.toscrape.com/scroll", meta={ "zyte_api_automap": { "browserHtml": True, "actions": [ { "action": "scrollBottom", }, ], }, }, ) def parse(self, response): quote_count = len(response.css(".quote")) ``` Output: ```none 100 ``` See zapi-actions. ## Migrating from Bright Data Web Unlocker to Zyte API Learn how to migrate from [Bright Data Web Unlocker](https://brightdata.com/products/web-unlocker) to Zyte API. ### Feature comparison The following table summarizes the feature differences between both products: | Feature | Zyte API | Web Unlocker | |-----------------|---------------|----------------| | API | HTTP or proxy | Proxy | | Browser HTML | Yes | No | | Screenshots | Yes | No | | Browser actions | Yes | No | | Network capture | Yes | No | ### Proxy mode Zyte API offers a proxy mode, which makes it easier to migrate from Bright Data Web Unlocker. > ###### NOTE > > Before you decide whether to use the proxy mode or the HTTP API, learn their differences. To migrate, update your proxy endpoint and authentication, and migrate your request parameters. ### HTTP API When migrating from Bright Data Web Unlocker to the HTTP API of Zyte API, the main challenge is switching from a proxy API to an HTTP API. To read its rich output data, you need JSON parsing and sometimes base64-decoding. For example, to get the same output as the following Web Unlocker request: ```bash curl \ --proxy zproxy.lum-superproxy.io:22225 \ --proxy-user lum-customer-YOUR_USER-zone-YOUR_ZONE:YOUR_PASSWORD \ https://toscrape.com/ ``` Use Zyte API as follows: ```bash curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --compressed \ --data '{"url": "https://toscrape.com", "httpResponseBody": true}' \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode ``` See zapi-usage for richer Zyte API examples, covering more scenarios and features. See also unlocker-params to migrate your request parameters. ### Parameter mapping If your Web Unlocker requests set the `country` parameter, migrate them as follows: #### Proxy mode Use the zyte-geolocation request header. For example, replace: ```bash curl … --proxy-user lum-customer-YOUR_USER-zone-YOUR_ZONE-country-us:YOUR_PASSWORD … ``` With: ```bash curl … -H Zyte-Geolocation:US … ``` #### HTTP API Use the [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) request field. For example, replace: ```bash curl … --proxy-user lum-customer-YOUR_USER-zone-YOUR_ZONE-country-us:YOUR_PASSWORD … ``` With: ```bash curl … --data '{…, "geolocation": "US", …}' … ``` ## Migrating from ScrapingBee to Zyte API Learn how to migrate from [ScrapingBee](https://www.scrapingbee.com/) to Zyte API. ### Feature comparison The following table summarizes the feature differences between both products: | Feature | ScrapingBee | Zyte API | |---------------------------------|---------------------------------------|-------------------------------------------------------------------------------------------------------| | Client software | Python, NodeJS | Python, Scrapy | | Pricing | Fixed plans | Pay as you go Monthly commitment over $100 | | Ban avoidance | Manual, may increase costs | Automatic, no extra costs | | Automatic extraction | Google SERP, custom LLM prompts | Standard schemas including Google SERP, custom LLM prompts | | Geolocation | 243 countries, no data center support | 249 countries, data center support | | Sessions | Client-managed only (5m) | Client-managed (15m) and server-managed | | Actions | Basic only (9) | Basic (15), advanced, website-specific and custom | | Screenshots | Yes, can target an element | Yes, cannot target an element | | Body size limit | 2 MB | 10 MB | | Custom headers | Yes | Only in HTTP requests, limited to `Referer` in browser requests, cannot disable ban-avoidance headers | | Ad blocking | Yes | No | | Resource blocking | Yes | No | | Custom proxies | Yes | No | | Server-side CSS/XPath selectors | Yes | No | | Rate limiting | Concurrency-based | RPM-based | | Usage API | Yes, up to 6 requests per second | Yes, up to 20 requests per second | #### Pricing ScrapingBee offers 4 plans with a fixed price per month, each with a fixed number of “credits” per month that you have to spend on that month or lose. With Zyte API you pay only for what you use, up to a $100 monthly spending limit. If you need a higher spending limit, you must commit to paying half as monthly commitment, which you do not get back if you spend less during a month. With ScrapingBee, HTTP requests cost 1 credit each, while browser requests cost 5 credits each. If you need to use device residential IPs (“premium proxies”) to avoid bans, costs raise to 10 credits per HTTP request (10×) and 25 credits per browser request (5×). For scenarios where device residential IPs do not avoid bans either, ScrapingBee offers special “stealth” proxies for browser requests at 75 credits per request (15×). ScrapingBee also charges 20 credits when targetting Google domains. With Zyte API, request cost varies depending not only on the type of request (HTTP or browser), but also on the tier of the target website, which covers the cost of any tech that Zyte API may use to get you a ban-free response, including browser rendering and device residential IPs. No extra cost for Google domains; not even for automatic extraction of SERP ([serp](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/serp)). Unless you are never using premium or stealth proxies, you are targetting mostly high-tier websites, and the number of credits per month that you need is close to those included in one of ScrapingBee‘s plans, **Zyte API tends to be a cheaper choice**. For example, the $49 ScrapingBee plan includes 150k credits, i.e. 150k HTTP requests. For tier 1-2 websites (i.e. most websites), Zyte API is cheaper. And Zyte API can also be cheaper for higher-tier websites if you need fewer than 150k requests: 114k requests for tier 3, 70k requests for tier 2, and 39k request for tier 5. #### Ban handling ScrapingBee makes it your responsibility to choose the right technologies (browser rendering, device residential IPs, “stealth IPs”) to avoid bans, with the corresponding cost increase. Zyte API automatically chooses the leanest technology possible transparently, without any extra cost, and automatically adapting to website changes. #### Automatic extraction ScrapingBee supports automatic extraction through user-defined LLM prompts. Zyte API automatic extraction provides automatic extraction for supported types *and* user-defined LLM prompts to extract additional fields. Both ScrapingBee and Zyte API support Google SERP extraction ([serp](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/serp)). #### Rate limiting ScrapingBee limits the number of concurrent requests that you can send, starting at 5 with the most basic plan. Zyte API limits the number of requests per minute (RPM) that you can send. It is 3000 by default for all Zyte API keys, but you can request a higher limit. For services like these that support advanced features like browser rendering or automatic extraction, which usually increase response times, RPM rate limiting allows you to maintain your throughput regardless of which features you use thanks to unlimited concurrency, while concurrency-based limits slow down your crawls as you use features that make requests slower. For example, assuming an HTTP request takes 2 seconds and a browser request takes 20 seconds, switching from HTTP requests to browser requests with ScrapingBee would make your crawl 10 times slower, while Zyte API would allow you to maintain a similar crawl speed by using more concurrent requests to make up for the response time increase. ### Migrating The main differences between the HTTP APIs of ScrapingBee and Zyte API are how request parameters are defined and how the response is encoded. In **ScrapingBee**, you send a `GET` request, and you specify parameters in the URL query string, URL-encoded, e.g. ```bash curl "https://app.scrapingbee.com/api/v1/?api_key=YOUR_ZYTE_API_KEY&url=https%3A%2F%2Ftoscrape.com" ``` The API response body comes straight from the target website: ```html Scraping Sandbox … ``` HTTP response headers and cookies from the target website are also received as regular headers and cookies, only prefixed with `Spb-`. ```none Spb-Content-Encoding: br Spb-Content-Type: text/html ``` In **Zyte API**, you send a `POST` request, and you specify parameters in the request body as JSON, e.g. > ###### TIP > > Same as ScrapingBee, Zyte API offers a proxy mode > that you can use instead of the HTTP API if it makes things simpler. ```bash curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data '{"url": "https://toscrape.com", "httpResponseBody": true, "httpResponseHeaders": true}' \ --compressed \ https://api.zyte.com/v1/extract ``` The API response is a JSON object with all the response data from the target website: ```json { "url": "https://toscrape.com/", "statusCode": 200, "httpResponseBody": "PCFET0NUWVBFIGh0bWw+CjxodG1sIGxhbmc9ImVuIj4KICAgIDx…", "httpResponseHeaders": [ { "name": "content-type", "value": "text/html" }, { "name": "content-encoding", "value": "br" } ] } ``` > ###### NOTE > > [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) is base64-encoded to support binary > responses, like images or PDF files. Once you understand how to migrate a simple request like the one above, you can migrate any other request the same way, replacing ScrapingBee parameters with Zyte API counterparts. ### Parameter mapping | ScrapingBee | Zyte API | |---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | (default) | [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody), [httpResponseHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseHeaders) | | `api_key` | Use basic authentication | | `url` | [url](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/url) | | `render_js` | [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/browserHtml) | | `js_scenario` | See below | | `wait` | `waitForTimeout` action (see below) | | `wait_for` | `waitForSelector` action (see below) | | `wait_browser` | `waitForNavigation` action (see below) | | `block_ads` | Not supported | | `block_resources` | Not supported | | `viewport_width` | [viewport](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/viewport) | | `window_height` | [viewport](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/viewport) | | `premium_proxy` | [ipType=residential](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType) (not required to avoid bans) | | `country_code` | [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) (does not require [ipType=residential](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType)) | | `stealth_proxy` | N/A, ban avoidance is a transparent feature | | `own_proxy` | Not supported | | `forward_headers` | [customHttpRequestHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customHttpRequestHeaders), [requestHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestHeaders) | | `forward_headers_pure` | Not supported | | `ai_query` | [customAttributes](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributes) | | `ai_selector` | Not supported | | `ai_extract_rules` | [customAttributes](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributes) | | `extract_rules` | Not supported | | `screenshot` | [screenshot](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/screenshot) | | `screenshot_selector` | Not supported | | `screenshot_full_page` | [screenshotOptions.fullPage=true](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/screenshotOptions.fullPage) | | `json_response` | See zapi-network-capture | | `return_page_source` | Not supported (use [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody) if you are only using browser rendering to avoid bans) | | `scraping_config` | Not supported | | `session_id` | [session.id](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session.id) (must be UUID4) | | `timeout` | Not supported | | `cookies` | [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies) | | `device` | [device](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/device) | | `custom_google` | N/A | | `transparent_status_code` | N/A, Zyte API returns the response or not based on whether or not it is a ban, not based on the status code | ### Action mapping ScrapingBee allows defining a sequence of browser actions through the `"instructions"` JSON array of the `js_scenario` parameter. For example: ```json { "instructions": [ {"click": "#buttonId"} ] } ``` Which URL-encoded would become: ```none js_scenario=%7B%22instructions%22%3A+%5B%7B%22click%22%3A+%22%23buttonId%22%7D%5D%7D ``` The Zyte API equivalent is the [actions](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/actions) field. The following is a matching example: ```json { "actions": [ { "action": "click", "selector": { "type": "css", "value": "#buttonId" } } ] } ``` These are ScrapingBee actions and their Zyte API counterparts: `click`: `click` `evaluate`: `evaluate` `fill`: `type` `infinite_scroll`: `scrollBottom` `scroll_x`: `scrollTo` `scroll_y`: `scrollTo` `wait`: `waitForTimeout` `wait_for`: `waitForSelector` `wait_for_and_click`: `waitForSelector`, `click` The following Zyte API actions are not supported by ScrapingBee: `doubleClick` `goto` `hide` `hover` `keyPress` `reload` `searchKeyword` `select` `setLocation` `waitForRequest` `waitForResponse` Zyte API also supports custom actions. ## Migrating from scrapy-zyte-smartproxy to scrapy-zyte-api This migration guide provides the steps necessary to migrate from scrapy-zyte-smartproxy or scrapy-crawlera to scrapy-zyte-api. > ###### NOTE > > If you use Smart Proxy Manager, see > spm-migrate for general migration information. ### Maybe keep scrapy-zyte-smartproxy If you use scrapy-zyte-smartproxy for Scrapy integration with Smart Proxy Manager, and you only want to migrate to Zyte API to enjoy better ban avoidance or pricing, you can continue using scrapy-zyte-smartproxy: scrapy-zyte-smartproxy 2.3.1 and higher support the proxy mode of Zyte API. > ###### TIP > > If you are using scrapy-crawlera, you would need to migrate to > scrapy-zyte-smartproxy to use Zyte API proxy mode. > See the release notes of scrapy-zyte-smartproxy 2.0.0 for details. It might be worth migrating to > scrapy-zyte-api instead. To switch from Smart Proxy Manager to the proxy mode of Zyte API, replace your Smart Proxy Manager API key with [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access), and set the ZYTE_SMARTPROXY_URL setting to `"http://api.zyte.com:8011"`. Alternatively, you can enable Zyte API proxy mode for specific requests. You should also add 520 and 521 to the `RETRY_HTTP_CODES` setting: settings.py ```python from scrapy.settings.default_settings import RETRY_HTTP_CODES as DEFAULT_RETRY_HTTP_CODES RETRY_HTTP_CODES = DEFAULT_RETRY_HTTP_CODES + [520, 521] ``` scrapy-zyte-smartproxy will automatically translate Smart Proxy Manager headers into their Zyte API counterparts where possible, and drop them when not. But you should eventually update your headers, see spm-migrate-map. Using scrapy-zyte-smartproxy for Zyte API makes it easier to migrate from Smart Proxy Manager. However, the proxy mode of Zyte API has feature differences with the HTTP API. Continue reading to learn how to migrate to scrapy-zyte-api. > ###### TIP > > You can keep both scrapy-zyte-smartproxy and scrapy-zyte-api, and use > one or the other for different requests or spiders. ### Set up scrapy-zyte-api 1. You need Python 3.8 or higher to use the latest version of scrapy-zyte-api. 2. You need Scrapy 2.0.1 or higher to use the latest version of scrapy-zyte-api. If you are using a lower version of Scrapy, please upgrade to a higher Scrapy version, and make sure your code works as expected with the newer Scrapy version before you continue the migration process. The [Scrapy release notes](https://docs.scrapy.org/en/latest/news.html) of every Scrapy version cover backward-incompatible changes and deprecation removals, which should help you upgrade your existing code as you upgrade Scrapy. 3. Install the latest version of scrapy-zyte-api: ```bash pip install --upgrade scrapy-zyte-api ``` 4. Configure scrapy-zyte-api in your `settings.py` file. If your Scrapy version is 2.10 or higher, add the following settings: ```python ADDONS = { "scrapy_zyte_api.Addon": 500, } ZYTE_API_TRANSPARENT_MODE = False ``` Otherwise add the following settings: ```python DOWNLOAD_HANDLERS = { "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", } DOWNLOADER_MIDDLEWARES = { "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000, } REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter" SPIDER_MIDDLEWARES = { "scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware": 100, } TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" ``` If any of these settings already exists in your `settings.py` file, modify the existing setting as needed instead of re-defining it. For example, if you already have `DOWNLOADER_MIDDLEWARES` defined, add `"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,` to your existing definition, keeping existing downloader middlewares untouched. Also, make sure that these settings are not being overridden elsewhere. For example, make sure they are not defined in multiple lines of your `settings.py` file, and that they are not overridden in your [Scrapy Cloud project settings](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud). > ###### NOTE > > On projects that were not using the asyncio Twisted reactor, your > existing code may need changes, such as: > - [Handling a pre-installed Twisted reactor](https://docs.scrapy.org/en/latest/topics/asyncio.html#handling-a-pre-installed-reactor). > > Some Twisted imports install the default, non-asyncio Twisted > reactor as a side effect. Once a reactor is installed, it cannot be > changed for the whole run time. > - [Converting Twisted Deferreds into asyncio Futures](https://docs.scrapy.org/en/latest/topics/asyncio.html#awaiting-on-deferreds). > > Note that you might be using Deferreds without realizing it through > some Scrapy functions and methods. For example, when you yield the > return value of `self.crawler.engine.download()` from a spider > callback, you are yielding a Deferred. 5. Add [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access) to `settings.py` as well: ```python ZYTE_API_KEY = "YOUR_ZYTE_API_KEY" ``` 6. To enable cookie support, the `COOKIES_ENABLED` setting is not enough, you must also define an additional setting in `settings.py`: ```python ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED = True ``` ### Migrate Your next steps depend on how you want to approach your migration. You can migrate some requests, migrate some spiders, or migrate your entire project. #### Migrate a request Migrating requests makes sense if you want to keep scrapy-zyte-smartproxy but you need to drive specific requests through scrapy-zyte-api for features only available through the HTTP API. To migrate a Scrapy request, set the following fields in the request metadata: ```python yield Request( ..., meta={ "dont_proxy": True, "zyte_api_automap": True, }, ) ``` > ###### TIP > > If your spider stops with the `plugin_conflict` finish reason, make > sure the `ZYTE_API_TRANSPARENT_MODE` setting is `False`. Only set > `ZYTE_API_TRANSPARENT_MODE` to `True` when migrating an > entire spider or project. #### Migrate a spider Compared to migrating an entire project, migrating spiders one by one, incrementally, can be more time consuming, but also less disruptive, giving you time to validate the migration of each spider separately. To migrate a Scrapy spider, use [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) or [update_settings](https://docs.scrapy.org/en/master/topics/spiders.html#scrapy.Spider.update_settings) to toggle scrapy-zyte-smartproxy, scrapy-crawlera, and scrapy-zyte-api: ```python class MySpider(Spider): custom_settings = { "ZYTE_API_TRANSPARENT_MODE": True, "ZYTE_SMARTPROXY_ENABLED": False, "CRAWLERA_ENABLED": False, # Only needed if you use scrapy-crawlera } ``` You can look at the stats of a crawl after migration to check that the migration was successful: there should be `scrapy-zyte-api`-prefixed stats indicating scrapy-zyte-api usage, and there should be no scrapy-zyte-smartproxy stats, which are prefixed with either `zyte_smartproxy` (Smart Proxy Manager) or `zyte_api_proxy` (Zyte API), or scrapy-crawlera stats, which are prefixed with `crawlera`. #### Migrate a project To migrate a Scrapy project: 1. Disable scrapy-zyte-smartproxy or scrapy-crawlera. scrapy-zyte-smartproxy is enabled through the `ZYTE_SMARTPROXY_ENABLED` setting. scrapy-crawlera through `CRAWLERA_ENABLED`. To disable, find where you define that setting (e.g. `settings.py`, Scrapy Cloud settings), and remove it. Also, make sure you are not enabling those settings on specific spiders, e.g. through the [custom_settings](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.custom_settings) class attribute of a spider class, or in your cloud (e.g. in Scrapy Cloud, which allows overriding settings for specific spiders). 2. Configure Zyte API to run in transparent mode. If you use `scrapy_zyte_api.Addon`, remove the `ZYTE_API_TRANSPARENT_MODE = False` line from `settings.py`. The add-on enables transparent mode automatically. If you do *not* use `scrapy_zyte_api.Addon`, add the following line to `settings.py`: ```python ZYTE_API_TRANSPARENT_MODE = True ``` To check that the migration was successful, you can either check stats for each spider or remove scrapy-zyte-smartproxy and scrapy-crawlera. #### Remove proxy headers Regardless of whether you are migrating only some spiders or your whole project, review the code of requests that now go through Zyte API to look for proxy headers, i.e. those prefixed with `X-Crawlera-` or `Zyte-` (case-insensitive), and replace them with Zyte API counterparts according to this table. > ###### TIP > > You can usually find Scrapy requests by searching your code for uses > of the `Request` class, but mind that there are other > ways to create requests, including: `request.copy()`, `request.replace()`, `request.from_curl()`, > `request_from_dict()`, `response.follow()` and `response.follow_all()`. You can specify those parameters through a `zyte_api_automap` dictionary in request metadata. For example, to set the [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) of a request to the USA: ```python yield Request( ..., meta={ "zyte_api_automap": { "geolocation": "US", }, }, ) ``` For details, see automap. #### Handle retries scrapy-zyte-api implements an advanced retry mechanism, with a default retry policy that should work for most scenarios. If retries for ban responses are being exceeded and you want to increase retries, or if you want to retry permanent download errors, you can try switching to the aggressive retry policy: settings.py ```python ZYTE_API_RETRY_POLICY = "zyte_api.aggressive_retrying" ``` You can also create a custom retry policy, see the reference documentation of `RetryFactory` and `AggressiveRetryFactory` for examples. When retries are exceeded for a given request, an exception is raised, and if not caught, an error message is logged. See retry-non-successful to learn how to handle such exceptions. #### Adjust the crawl speed If your crawl speed lowers significantly after migrating: 1. Ensure that you are not setting a `DOWNLOAD_DELAY`, which scrapy-zyte-smartproxy and scrapy-crawlera ignore, but scrapy-zyte-api respects. If you need to keep a download delay for some domains, you can use the `DOWNLOAD_SLOTS` setting. Note that requests sent through scrapy-zyte-api use a different slot, prefixed with `zyte-api@` (e.g. `zyte-api@example.com`). 2. Increase the `CONCURRENT_REQUESTS` and `CONCURRENT_REQUESTS_PER_DOMAIN` settings as needed. 3. If a higher concurrency does not improve your crawl speed, the cause may be rate limiting; if the `scrapy-zyte-api/throttle_ratio` Scrapy stat is high, you may want to request a higher limit. If your crawl speed increases too much after migrating: - If the AutoThrottle Scrapy extension is enabled (i.e. `AUTOTHROTTLE_ENABLED` is `True`, as it is by default in Scrapy Cloud), scrapy-zyte-api bypasses the extension for Zyte API request, to let Zyte API handle rate limiting on its own. Set the `ZYTE_API_PRESERVE_DELAY` setting to `True` to prevent scrapy-zyte-api from bypassing the extension. #### Memory may increase Zyte API HTTP response bodies are Base64-encoded, making them 33-37% larger, hence increasing memory usage. If your spider runs out of memory after migration, consider: - Increasing available memory. If you use Scrapy Cloud, use more units. - Lower `SCRAPER_SLOT_MAX_ACTIVE_SIZE` to a value that prevents exceeding available memory while allowing an acceptable crawl speed. ### Remove scrapy-zyte-smartproxy (optional) Once you have migrated all your code and are happy with the result, you can remove scrapy-zyte-smartproxy and scrapy-crawlera: ```bash pip uninstall scrapy-zyte-smartproxy scrapy-crawlera ``` And remove from your code and from Scrapy Cloud any related Scrapy setting, i.e. those prefixed with either `ZYTE_SMARTPROXY_` or `CRAWLERA_`, including those that you used to disable scrapy-zyte-smartproxy in an earlier migration step (no need to disable something that is not installed anymore). ## Migrating from ZenRows to Zyte API Learn how to migrate from [ZenRows](https://www.zenrows.com/) to Zyte API. ### Feature comparison The following table summarizes the feature differences between both products: | Feature | Zyte API | ZenRows | |-------------------------------|--------------------------------------------------------|------------------------------------------------------------------| | API | HTTP or proxy | HTTP | | Client software | Python, Scrapy | Python, NodeJS | | Restricted website categories | No broad category restrictions [^1] | Banks, payment gateways, visas/permits, government | | Advanced ban avoidance | Always available, automatic | Only Business+, manual | | Automatic extraction | AI-powered, standard schemas | Undocumented website support, item type support or output schema | | Markdown output | No | Yes | | Geolocation | 249 countries, data center support | 190 countries, no data center support | | Sessions | Client-managed (15m) and server-managed | Client-managed only (10m), no cookies | | Actions | Basic (15), advanced, website-specific and custom | Basic only (10) | | Screenshots | JPEG/PNG, configurable viewport, cannot target element | PNG only, fixed viewport, can target element | | Network capture | Up to 5 MiB / 10 responses | Unlimited | | Network blocking | No | Yes | | JavaScript disabling | Yes | No | | Server-side CSS selectors | No | Yes | | Rate limiting | RPM-based | Concurrency-based | | Overuse handling | Rate-limiting responses | Rate-limiting responses followed by IP blocking | [^1]: Some specific websites may be blocked for legal or compliance reasons. #### Automatic extraction ZenRows supports automatic extraction, but their documentation does not provide details on supported websites, item types or output schemas. Zyte API automatic extraction is AI-based, i.e. it works on any website of a supported type (e.g. e-commerce, blogs/news, job postings), and we provide detailed documentation about output schemas. #### Sessions ZenRows only supports client-managed sessions, and limits them to 10 minutes. Moreover, their sessions do not maintain cookies, you must do that on the client side. Zyte API allows 15 minutes for client-managed sessions, but also supports server-managed sessions with much longer lifetimes and an easier API. Moreover, the Scrapy plugin supports an additional session management API. #### Screenshots Both ZenRows and Zyte API support PNG screenshots of the visible viewport or the full page. ZenRows allows taking a screenshot of a specific element. Zyte API allows configuring the browser [viewport](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/viewport). Zyte API can return both [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) and [screenshot](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/screenshot) on the same request, i.e. get the browser HTML matching a given screenshot. In ZenRows you would need 2 separate requests, and the contents of each might not be a perfect match. #### Rate limiting ZenRows limits the number of concurrent requests that you can send, starting at 10 with the most basic plan. Zyte API limits the number of requests per minute (RPM) that you can send. It is 3000 by default for all Zyte API keys, but you can request a higher limit. For services like these that support advanced features like browser rendering or automatic extraction, which usually increase response times, RPM rate limiting allows you to maintain your throughput regardless of which features you use thanks to unlimited concurrency, while concurrency-based limits slow down your crawls as you use features that make requests slower. For example, assuming an HTTP request takes 2 seconds and a browser request takes 20 seconds, switching from HTTP requests to browser requests with ZenRows would make your crawl 10 times slower, while Zyte API would allow you to maintain a similar crawl speed by using more concurrent requests to make up for the response time increase. #### Overuse handling When you exceed your concurrency with ZenRows, they start by sending rate-limiting responses, but eventually they block your IP address for increasing amounts of time. With Zyte API, reaching your rate limit is not only allowed, but encouraged, and you can request a higher limit limit if you need it. ### Migrating The main differences between the HTTP APIs of ZenRows and Zyte API are how request parameters are defined and how the response is encoded. In **ZenRows**, you send a `GET` request, and you specify parameters in the URL query string, URL-encoded, e.g. ```bash curl "https://api.zenrows.com/v1/?apikey=YOUR_ZYTE_API_KEY&url=https%3A%2F%2Ftoscrape.com" ``` The API response body comes straight from the target website: ```html Scraping Sandbox … ``` HTTP response headers from the target website are also received as regular headers, only prefixed with `Zr-`, and the response URL (which might not match the request URL, e.g. in case of redirection) is received as the special `Zr-Final-Url` header: ```none Zr-Content-Encoding: br Zr-Content-Type: text/html Zr-Final-Url: https://toscrape.com/ ``` In **Zyte API**, you send a `POST` request, and you specify parameters in the request body as JSON, e.g. ```bash curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data '{"url": "https://toscrape.com", "httpResponseBody": true, "httpResponseHeaders": true}' \ --compressed \ https://api.zyte.com/v1/extract ``` The API response is a JSON object with all the response data from the target website: ```json { "url": "https://toscrape.com/", "statusCode": 200, "httpResponseBody": "PCFET0NUWVBFIGh0bWw+CjxodG1sIGxhbmc9ImVuIj4KICAgIDx…", "httpResponseHeaders": [ { "name": "content-type", "value": "text/html" }, { "name": "content-encoding", "value": "br" } ] } ``` > ###### NOTE > > [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) is base64-encoded to support binary > responses, like images or PDF files. Once you understand how to migrate a simple request like the one above, you can migrate any other request the same way, replacing ZenRows parameters with Zyte API counterparts. ### Parameter mapping | ZenRows | Zyte API | |------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | (default) | [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody), [httpResponseHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseHeaders) | | `apikey` | Use basic authentication | | `url` | [url](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/url) | | `js_render` | [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/browserHtml) | | `custom_headers` | [customHttpRequestHeaders](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customHttpRequestHeaders) | | `premium_proxy` | [ipType=residential](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType) (not required to avoid bans) | | `proxy_country` | [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) (does not require [ipType=residential](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/ipType)) | | `session_id` | [session.id](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session.id) (must be UUID4) | | `device` | [device](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/device) | | `original_status` | N/A (see [statusCode](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/statusCode)) | | `allowed_status_codes` | N/A (see zapi-successful-responses-wrapping-bad-responses) | | `block_resources` | Not supported | | `json_response` | See zapi-network-capture | | `css_extractor` | Not supported | | `autoparse` | See zapi-extract | | `markdown_response` | Not supported | | `screenshot` | [screenshot](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/screenshot) | | `screenshot_fullpage` | [screenshotOptions.fullPage=true](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/screenshotOptions.fullPage) | | `screenshot_selector` | Not supported | For parameters defining browser actions, see zenrows-actions. ### Action mapping These are ZenRows actions and their Zyte API counterparts. `check`: `click` `click`: `click` `evaluate`: `evaluate` `fill`: `type` `scroll_x`: `scrollTo` `scroll_y`: `scrollTo` `select_option`: `select` `uncheck`: `click` `wait`: `waitForTimeout` `wait_for`: `waitForSelector` `wait_for` only supports CSS selectors, while `waitForSelector` also supports XPath selectors. ZenRows also has a `solve_captcha` action that requires you to specify which CAPTCHA you need to solve, while Zyte API avoids bans automatically by default (no action necessary), while allowing CAPTCHA management to be disabled through zapi-permissions-control. The following Zyte API actions are not supported by ZenRows: `doubleClick` `goto` `hide` `hover` `keyPress` `reload` `scrollBottom` `searchKeyword` `setLocation` `waitForNavigation` `waitForRequest` `waitForResponse` Zyte API also supports custom actions. ZenRows actions have `frame_`-prefixed counterparts that work on [iframes](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe), and a utility action (`frame_reveal`) to inject iframe contents into the main DOM. On Zyte API you need to use custom actions to interact with iframes. ## Migrating from Smart Proxy Manager to Zyte API Learn how to migrate from Smart Proxy Manager to Zyte API. ### Key differences The following table summarizes the feature differences between both products: | Feature | Smart Proxy Manager | Zyte API | |------------------------|-----------------------|-------------------------------------------------| | API | Proxy | HTTP or proxy | | Ban avoidance | Good | Great | | Device residential IPs | Add-on | Automatic (configurable) | | Session management | Client-managed | Server-managed or client-managed | | Geolocation | Manual | Automatic (configurable) | | Browser HTML | No | Yes (HTTP API only, proxy mode support planned) | | Screenshots | No | Yes (HTTP API only) | | Browser actions | No | Yes (HTTP API only) | | Network capture | No | Yes (HTTP API only) | | HTTP redirection | Not followed | Followed by default, can be disabled | | User throttling | Concurrency-based | Request-based | See also spm-migrate-map below for some additional, lower-level differences. #### Ban avoidance Smart Proxy Manager does a good job at avoiding bans through proxy rotation, ban detection, retrying algorithms, and browser mimicking through browser profiles. Zyte API improves on it by using an actual browser, if that is required to prevent bans on a particular website. Zyte API also supports webpage interaction. #### Device residential IPs Zyte API supports both static data center IPs and device residential IPs. It automatically chooses the right type of IP address as needed, but it also allows you to force a specific IP type. #### Session management While Smart Proxy Manager only supports client-managed sessions, Zyte API supports both client-managed sessions and server-managed sessions. The main difference between both implementations of client-managed sessions is that, in Zyte API, it is the client (you), not the server, who generates the session ID. See [session.id](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session.id) for details. Additionally, for some scenarios, using browser actions can remove the need for multiple requests with a shared session. #### Geolocation Both products let you choose which country of origin to use for a request. However, with Zyte API you usually do not need to manually choose which country of origin to use for each request, because Zyte API automatically chooses the best country of origin based on the target website. Smart Proxy Manager does support a richer list of countries of origin that you can set manually. However, if you let Zyte API choose the right country of origin, it can use additional countries not available for manual override. Smart Proxy Manager also allows defining an account with a set of geolocations, so that requests using that account pick a geolocation from that set. Zyte API does not support this, you can either omit the geolocation and let Zyte API choose the best geolocation, or set a specific geolocation for a given request. For more information, see zapi-geolocation. ### Authentication You cannot use your Smart Proxy Manager API key for Zyte API, you need to [get a separate API key to use Zyte API](https://app.zyte.com/o/zyte-api/api-access). ### Proxy mode Zyte API offers a proxy mode, which makes it easier to migrate from Smart Proxy Manager. > ###### NOTE > > Before you decide whether to use the proxy mode or the HTTP API, learn their differences. To migrate, update your proxy endpoint and API key. You may also need to update some proxy headers as indicated below. > ###### WARNING > > The proxy mode is not optimized for use in combination with > browser automation tools. Consider using Zyte API’s browser > automation features instead. See > zapi-browser-automation. The following example shows a basic request using Smart Proxy Manager: #### C# ```cs using System; using System.IO; using System.Net; using System.Text; var proxy = new WebProxy("http://proxy.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var request = (HttpWebRequest)WebRequest.Create("https://toscrape.com"); request.Proxy = proxy; request.PreAuthenticate = true; request.AllowAutoRedirect = false; var response = (HttpWebResponse)request.GetResponse(); var stream = response.GetResponseStream(); var reader = new StreamReader(stream); var httpResponseBody = reader.ReadToEnd(); reader.Close(); response.Close(); Console.WriteLine(httpResponseBody); ``` #### curl ```bash curl \ --proxy proxy.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { proxy: { protocol: 'http', host: 'proxy.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'proxy' => 'http://YOUR_ZYTE_API_KEY:@proxy.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python > ###### NOTE > > You need to install and configure our CA certificate for > the requests library. ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@proxy.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Scrapy After you install and configure [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy), you can use Scrapy as usual and all requests will be proxied through Smart Proxy Manager automatically. ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): print(response.text) ``` And this is an identical request using the proxy mode of Zyte API: #### C# ```cs using System; using System.Net; using System.Net.Http; var proxy = new WebProxy("http://api.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var httpClientHandler = new HttpClientHandler { Proxy = proxy, }; var client = new HttpClient(handler: httpClientHandler, disposeHandler: true); var message = new HttpRequestMessage(HttpMethod.Get, "https://toscrape.com"); var response = client.Send(message); var body = await response.Content.ReadAsStringAsync(); Console.WriteLine(body); ``` #### curl ```bash curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### Java ```java import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.hc.client5.http.auth.AuthCache; import org.apache.hc.client5.http.auth.AuthScope; import org.apache.hc.client5.http.auth.CredentialsProvider; import org.apache.hc.client5.http.classic.methods.HttpGet; import org.apache.hc.client5.http.impl.auth.BasicAuthCache; import org.apache.hc.client5.http.impl.auth.BasicScheme; import org.apache.hc.client5.http.impl.auth.CredentialsProviderBuilder; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.routing.DefaultProxyRoutePlanner; import org.apache.hc.client5.http.protocol.HttpClientContext; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHost; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; class Example { public static void main(final String[] args) throws InterruptedException, IOException, ParseException { HttpHost proxy = new HttpHost("api.zyte.com", 8011); DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); CredentialsProvider credentialsProvider = CredentialsProviderBuilder.create() .add(new AuthScope(proxy), "YOUR_ZYTE_API_KEY", "".toCharArray()) .build(); AuthCache authCache = new BasicAuthCache(); BasicScheme basicAuth = new BasicScheme(); authCache.put(proxy, basicAuth); HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credentialsProvider); context.setAuthCache(authCache); CloseableHttpClient client = HttpClients.custom() .setRoutePlanner(routePlanner) .setDefaultCredentialsProvider(credentialsProvider) .build(); HttpGet request = new HttpGet("https://toscrape.com"); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String httpResponseBody = EntityUtils.toString(entity, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } } ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { proxy: { protocol: 'http', host: 'api.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'proxy' => 'http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python > ###### NOTE > > You need to install and configure our CA certificate for > the requests library. ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@api.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Ruby ```ruby # frozen_string_literal: true require 'net/http' url = URI('https://toscrape.com/') proxy_host = 'api.zyte.com' proxy_port = '8011' http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port, 'YOUR_ZYTE_API_KEY', '') http.use_ssl = true r = http.start do |h| h.request(Net::HTTP::Get.new(url)) end puts r.body ``` #### Scrapy When using [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy), set the `ZYTE_SMARTPROXY_URL` setting to `"http://api.zyte.com:8011"` and the `ZYTE_SMARTPROXY_APIKEY` setting to [your Zyte API key](https://app.zyte.com/o/zyte-api/api-access) for Zyte API. > ###### NOTE > > **Important**: Use your **Zyte API key** here, not a Scrapy Cloud API key. Make sure you get this from the Zyte API access page. Then you can continue using Scrapy as usual and all requests will be proxied through Zyte API automatically. ```python from scrapy import Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): print(response.text) ``` ### HTTP API > ###### TIP > > If you are using [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy) (previously > scrapy-crawlera), see scrapy-zyte-smartproxy-migrate for detailed > migration steps. When migrating from Smart Proxy Manager to the HTTP API of Zyte API, the main challenge is switching from a proxy API to an HTTP API. To read its rich output data, you need JSON parsing and sometimes base64-decoding. The following example shows a basic request using Smart Proxy Manager: #### C# ```cs using System; using System.IO; using System.Net; using System.Text; var proxy = new WebProxy("http://proxy.zyte.com:8011", true); proxy.Credentials = new NetworkCredential("YOUR_ZYTE_API_KEY", ""); var request = (HttpWebRequest)WebRequest.Create("https://toscrape.com"); request.Proxy = proxy; request.PreAuthenticate = true; request.AllowAutoRedirect = false; var response = (HttpWebResponse)request.GetResponse(); var stream = response.GetResponseStream(); var reader = new StreamReader(stream); var httpResponseBody = reader.ReadToEnd(); reader.Close(); response.Close(); Console.WriteLine(httpResponseBody); ``` #### curl ```bash curl \ --proxy proxy.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com ``` #### JS ```js const axios = require('axios') axios .get( 'https://toscrape.com', { proxy: { protocol: 'http', host: 'proxy.zyte.com', port: 8011, auth: { username: 'YOUR_ZYTE_API_KEY', password: '' } } } ) .then((response) => { const httpResponseBody = response.data console.log(httpResponseBody) }) ``` #### PHP ```php request('GET', 'https://toscrape.com', [ 'proxy' => 'http://YOUR_ZYTE_API_KEY:@proxy.zyte.com:8011', ]); $http_response_body = (string) $response->getBody(); fwrite(STDOUT, $http_response_body); ``` #### Python > ###### NOTE > > You need to install and configure our CA certificate for > the requests library. ```python import requests response = requests.get( "https://toscrape.com", proxies={ scheme: "http://YOUR_ZYTE_API_KEY:@proxy.zyte.com:8011" for scheme in ("http", "https") }, ) http_response_body: bytes = response.content print(http_response_body.decode()) ``` #### Scrapy After you install and configure [scrapy-zyte-smartproxy](https://github.com/scrapy-plugins/scrapy-zyte-smartproxy), you can use Scrapy as usual and all requests will be proxied through Smart Proxy Manager automatically. ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): print(response.text) ``` And this is an identical request using the HTTP API of Zyte API: > ###### NOTE > > Install and configure code example requirements and > the Zyte CA certificate to run the example below. #### C# ```cs using System.Collections.Generic; using System.Net; using System.Net.Http; using System.Text; using System.Text.Json; using System.Threading.Tasks; HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.All }; HttpClient client = new HttpClient(handler); var apiKey = "YOUR_ZYTE_API_KEY"; var bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(apiKey + ":"); var auth = System.Convert.ToBase64String(bytes); client.DefaultRequestHeaders.Add("Authorization", "Basic " + auth); client.DefaultRequestHeaders.Add("Accept-Encoding", "br, gzip, deflate"); var input = new Dictionary(){ {"url", "https://toscrape.com"}, {"httpResponseBody", true} }; var inputJson = JsonSerializer.Serialize(input); var content = new StringContent(inputJson, Encoding.UTF8, "application/json"); HttpResponseMessage response = await client.PostAsync("https://api.zyte.com/v1/extract", content); var body = await response.Content.ReadAsByteArrayAsync(); var data = JsonDocument.Parse(body); var base64HttpResponseBody = data.RootElement.GetProperty("httpResponseBody").ToString(); var httpResponseBody = System.Convert.FromBase64String(base64HttpResponseBody); ``` #### CLI client input.jsonl ```json {"url": "https://toscrape.com", "httpResponseBody": true} ``` ```shell zyte-api input.jsonl \ | jq --raw-output .httpResponseBody \ | base64 --decode \ > output.html ``` #### curl input.json ```json { "url": "https://toscrape.com", "httpResponseBody": true } ``` ```shell curl \ --user YOUR_ZYTE_API_KEY: \ --header 'Content-Type: application/json' \ --data @input.json \ --compressed \ https://api.zyte.com/v1/extract \ | jq --raw-output .httpResponseBody \ | base64 --decode \ > output.html ``` #### Java ```java import com.google.common.collect.ImmutableMap; import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.Base64; import java.util.Map; import org.apache.hc.client5.http.classic.methods.HttpPost; import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.core5.http.ContentType; import org.apache.hc.core5.http.HttpEntity; import org.apache.hc.core5.http.HttpHeaders; import org.apache.hc.core5.http.ParseException; import org.apache.hc.core5.http.io.entity.EntityUtils; import org.apache.hc.core5.http.io.entity.StringEntity; class Example { private static final String API_KEY = "YOUR_ZYTE_API_KEY"; public static void main(final String[] args) throws InterruptedException, IOException, ParseException { Map parameters = ImmutableMap.of("url", "https://toscrape.com", "httpResponseBody", true); String requestBody = new Gson().toJson(parameters); HttpPost request = new HttpPost("https://api.zyte.com/v1/extract"); request.setHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON); request.setHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"); request.setHeader(HttpHeaders.AUTHORIZATION, buildAuthHeader()); request.setEntity(new StringEntity(requestBody)); CloseableHttpClient client = HttpClients.createDefault(); client.execute( request, response -> { HttpEntity entity = response.getEntity(); String apiResponse = EntityUtils.toString(entity, StandardCharsets.UTF_8); JsonObject jsonObject = JsonParser.parseString(apiResponse).getAsJsonObject(); String base64HttpResponseBody = jsonObject.get("httpResponseBody").getAsString(); byte[] httpResponseBodyBytes = Base64.getDecoder().decode(base64HttpResponseBody); String httpResponseBody = new String(httpResponseBodyBytes, StandardCharsets.UTF_8); System.out.println(httpResponseBody); return null; }); } private static String buildAuthHeader() { String auth = API_KEY + ":"; String encodedAuth = Base64.getEncoder().encodeToString(auth.getBytes()); return "Basic " + encodedAuth; } } ``` #### JS ```js const axios = require('axios') axios.post( 'https://api.zyte.com/v1/extract', { url: 'https://toscrape.com', httpResponseBody: true }, { auth: { username: 'YOUR_ZYTE_API_KEY' } } ).then((response) => { const httpResponseBody = Buffer.from( response.data.httpResponseBody, 'base64' ) }) ``` #### PHP ```php request('POST', 'https://api.zyte.com/v1/extract', [ 'auth' => ['YOUR_ZYTE_API_KEY', ''], 'headers' => ['Accept-Encoding' => 'gzip'], 'json' => [ 'url' => 'https://toscrape.com', 'httpResponseBody' => true, ], ]); $data = json_decode($response->getBody()); $http_response_body = base64_decode($data->httpResponseBody); ``` #### Proxy mode With the proxy mode, you always get a response body. ```shell curl \ --proxy api.zyte.com:8011 \ --proxy-user YOUR_ZYTE_API_KEY: \ --compressed \ https://toscrape.com \ > output.html ``` #### Python ```python from base64 import b64decode import requests api_response = requests.post( "https://api.zyte.com/v1/extract", auth=("YOUR_ZYTE_API_KEY", ""), json={ "url": "https://toscrape.com", "httpResponseBody": True, }, ) http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"]) ``` #### Python client ```python import asyncio from base64 import b64decode from zyte_api import AsyncZyteAPI async def main(): client = AsyncZyteAPI() api_response = await client.get( { "url": "https://toscrape.com", "httpResponseBody": True, } ) http_response_body = b64decode(api_response["httpResponseBody"]).decode() print(http_response_body) asyncio.run(main()) ``` #### Scrapy In transparent mode, when you target a text resource (e.g. HTML, JSON), regular Scrapy requests work out of the box: ```python from scrapy import Spider class ToScrapeSpider(Spider): name = "toscrape_com" start_urls = ["https://toscrape.com"] def parse(self, response): http_response_text: str = response.text ``` While regular Scrapy requests also work for binary responses at the moment, they may stop working in future versions of scrapy-zyte-api, so passing [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/httpResponseBody) is recommended when targeting binary resources: ```python from scrapy import Request, Spider class ToScrapeSpider(Spider): name = "toscrape_com" async def start(self): yield Request( "https://toscrape.com", meta={ "zyte_api_automap": { "httpResponseBody": True, }, }, ) def parse(self, response): http_response_body: bytes = response.body ``` Output (first 5 lines): ```html Scraping Sandbox ``` See zapi-usage for richer Zyte API examples, covering more scenarios and features. See also spm-migrate-map to migrate your request parameters. If your code seems to run slower with Zyte API, see zapi-optimize. There is no easy way to use Zyte API to drive requests from browser automation tools. If you are using Smart Proxy Manager as a proxy for a browser automation tool, consider using Zyte API for your browser automation needs instead. See zapi-browser-automation. ### Parameter mapping The following table shows a mapping of Smart Proxy Manager request headers and their corresponding proxy mode headers and Zyte API parameters: | Smart Proxy Manager | Zyte API (proxy mode) | Zyte API (HTTP API) | |----------------------------|--------------------------|----------------------------------------------------------------------------------------------------------| | `X-Crawlera-Client` | zyte-client | `User-Agent` header | | x-crawlera-cookies bc | See below | See below | | x-crawlera-jobid bc | zyte-jobid | [jobId](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/jobId) | | x-crawlera-max-retries | Not planned | Not planned | | x-crawlera-no-bancheck | Planned | Not planned | | x-crawlera-profile bc | zyte-device \* | See below | | x-crawlera-profile-pass bc | zyte-override-headers \* | See below | | X-Crawlera-Region bc | zyte-geolocation | [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) | | x-crawlera-session bc | zyte-session-id | [session.id](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/session.id) | | x-crawlera-timeout | Not planned | Not planned | | x-crawlera-use-https | N/A | N/A | Headers tagged with bc can be used in Zyte API proxy mode. See spm-migrate-bc. #### Replacing X-Crawlera-Cookies x-crawlera-cookies supports 3 values: - `enable` causes automatic cookies to override request cookies. To achieve this in Zyte API, do not set [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies). - `disable` causes request cookies to override automatic cookies. This is the default behavior of Zyte API, using [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies) overrides automatic cookies. - `discard` causes both request cookies and automatic cookies to be discarded. To achieve this in Zyte API, do not set [requestCookies](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/requestCookies), and set [cookieManagement](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/cookieManagement) (zyte-cookie-management in proxy mode) to `discard`. #### Replacing X-Crawlera-Profile and X-Crawlera-Profile-Pass In general, you can replace `X-Crawlera-Profile` with the [device](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/device) Zyte API request parameter. Mind, however, that the behavior of Zyte API is actually a middle ground between the `desktop` (or `mobile`) and `pass` values of x-crawlera-profile: browser-specific headers are always sent (unlike `pass`, which disables them altogether), but you can override them (unlike `desktop` or `mobile`, which force them unless you use `X-Crawlera-Profile-Pass`). See zapi-body-request-headers for more information. #### Header backward compatibility When using Zyte API proxy mode, migrating to Zyte API proxy mode headers is recommended. However, the following Smart Proxy Manager request headers can be used in Zyte API proxy mode: x-crawlera-cookies, x-crawlera-jobid, x-crawlera-profile, x-crawlera-profile-pass, X-Crawlera-Region, x-crawlera-session. If any of these Smart Proxy Manager headers is used, the response will include x-crawlera-error if needed for the following error codes: `banned`, `invalid_request`, `bad_auth`, `bad_proxy_auth`, `max_header_size_exceeded`, `internal_server_error`, `timeout`, `domain_forbidden`. > ###### TIP > > To force getting x-crawlera-error on a request without Smart > Proxy Manager request headers, add a no-op Smart Proxy Manager request > header, e.g. `X-Crawlera-Profile-Pass: Foo`. ### Unsubscribe from Smart Proxy Manager Once you have successfully migrated to Zyte API, remember to unsubscribe from Smart Proxy Manager. If in doubt, [reach out to us](https://support.zyte.com/support/tickets/new). ## Zyte API pricing [Sign up](https://app.zyte.com/account/signup/zyteapi) to get a standard plan with no commitment and some free credit. Request cost depends on the target website and selected features. Use our [cost estimator](https://app.zyte.com/o/cost-estimator) to calculate costs. We charge only for successful responses and provide volume discounts. ### Plans | | Standard (PAYG) | Standard (commitment) | Enterprise | |---------------------|--------------------------------------------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------------| | How to enroll? | Automatic on [signup](https://app.zyte.com/account/signup/zyteapi) | [Subscriptions](https://app.zyte.com/o/subscriptions/overview) page | [Contact sales](https://www.zyte.com/zyte-web-scraping-api/#form) | | Initial free credit | $5 [^1] | $5 [^1] | $200 | | Rate limit | 3000 RPM | 3000 RPM | Custom | | Spending limit | $100 / month | $200-$1000 / month | Custom | | Commitment | | $100-$500 | Custom | | Volume discount | | 25%-52% | Custom | [^1]: Shared across standard plans. Switching to another standard plan does not provide additional free credit. **Enterprise** plans may also enjoy: - Assistance from expert engineers for onboarding and troubleshooting. - Access to compliance experts (see also zapi-permissions-control). - Add-on consultancy to design and scale projects. - Premium 24/7 support and service level agreements (SLAs): 1-hour response time on weekdays, and 8 hours on weekends. - Early access to new features and priority in feature requests. ### Initial free credit On [sign up](https://app.zyte.com/account/signup/zyteapi), you receive $5 free credit for your first billing month. After your free credit expires, your account is suspended until you set a spending limit. Set a spending limit before your free credit expires to ensure uninterrupted service. Standard plans include $5 free credit, while an Enterprise plan includes $200 free credit. Switching between standard plans does not provide additional free credit. Your initial $5 credit carries over when you change your commitment level. ### Request costs The **target website** and **request type** (HTTP or browser) determine the request tier and base cost. Requests using extended geolocations or device residential IPs have different base costs that depend only on request type, plus additional costs based on network consumption. Additional costs apply for: - Actions: Based on CPU and network consumption [^3] - Network captures: Based on output size - Screenshots: $0.002 [^2] - Automatic extraction: $0.0004-$0.0016 per data type [^2] (except [serp](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/serp), which is free) - Custom attributes: Cost depends on the extraction [method](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributesOptions.method): - [“generate”](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributesOptions.method): Based on [inputTokens](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/customAttributes.metadata.inputTokens) ($0.002/1k) and [outputTokens](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/customAttributes.metadata.outputTokens) ($0.01/1k) [^2] - [“extract”](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributesOptions.method): Fixed $0.001 [^2] [^2]: Cost before volume discount. [^3]: Network costs scale with device residential IPs usage. ### Request tiers Zyte API automatically uses the most cost-efficient technology for each website and assigns price tiers accordingly. There are 5 tiers each for HTTP requests and browser requests. Every combination of target website and request type belongs to a tier that determines the base cost. See [Pricing @ zyte.com](https://www.zyte.com/pricing/#pricing) for tier pricing and distribution data. Tier assignment is automatic. New combinations start with a temporary tier until enough data is gathered for permanent assignment. We review tier assignments quarterly. Affected customers receive 2 weeks notice of any tier changes. ### Successful responses You are only charged for successful responses. Rate-limiting and unsuccessful responses are free. ### Spending limit Your spending limit is the maximum monthly charge for Zyte API usage. When reached, your account is suspended until the next billing month. Enterprise plans have custom spending limits managed through your account manager. On standard plans, you can either have a $100 spending limit which is pay-as-you-go, or a higher spending limit ($200, $400, $700 or $1000) where you pay 50% of your spending limit as monthly commitment plus additional spend based on actual usage up to your spending limit. **Increasing spending limits:** - Takes effect immediately and lifts account suspension - For $200+ limits, you pay the monthly commitment difference immediately - Changes your monthly commitment for future billing cycles **Decreasing spending limits:** - Takes effect next billing month For limits above $1000, [contact sales](https://www.zyte.com/zyte-web-scraping-api/#form) for an Enterprise plan. ### Account suspension When you reach your spending limit, requests return account suspension responses. Enterprise plans: Contact your account manager. Standard plans: Increase your spending limit to immediately resume service. ### Monthly commitment Enterprise plans: Custom monthly commitment Standard plans: - $100 spending limit: No monthly commitment (PAYG) - $200+ spending limit: Monthly commitment = 50% of spending limit Your monthly commitment determines your volume discount. Monthly commitment is paid at the start of each billing month. If your actual usage exceeds the commitment, you pay the additional spend on your next bill. If usage is below the commitment, no refund is provided. **Example:** $200 spending limit = $100 monthly commitment - Month 1: Pay $100 commitment - Actual usage: $150 → Next bill includes $50 additional spend - Actual usage: $80 → Next bill includes $0 additional spend ### Volume discount Volume discounts are applied to each request. Enterprise plans: Custom volume discount Standard plans: Volume discount based on monthly commitment: | Monthly commitment | Volume discount | |----------------------|-------------------| | $100 | 25% | | $200 | 40% | | $350 | 48% | | $500 | 52% | ## Zyte API frequently asked questions ### How many concurrent requests can I send? See zyte-api-concurrency. ### Is there a response size limit? The size limit of [browserHtml](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/browserHtml) and [httpResponseBody](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/httpResponseBody) (before base64-encoding) is 10 MB. Longer responses are truncated. [HTTP compression](https://en.wikipedia.org/wiki/HTTP_compression) does not affect this limit. ### Can I set a more granular geolocation? [geolocation](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation) only supports country granularity. However, websites seldom limit content by IP address at a lower granularity. Provided you use the right country, you should be able to get content for any specific country subdivision (ZIP code, state, etc.). The way to get content targeted at a specific country subdivision is usually through cookies. The way to get the right cookies depends on the target website: - On some websites you can use our setLocation action or some other actions to configure the target subdivision. > ###### TIP > > You can use sessions to minimize the amount > of browser requests you need. - On some websites you can manually set a cookie that forces content for the target subdivision. - On some websites you may need to start a session and configure that session for the target subdivision. ### Can I override the `User-Agent` header? - You can on HTTP requests, but Zyte API may override your value for certain websites if needed for ban avoidance. - You cannot on browser requests. ## Get started with Scrapy Cloud Scrapy Cloud is a service that allows running web scraping code in the cloud. Scrapy Cloud is designed for [Scrapy](https://scrapy.org) projects, but can support other technologies. ### First steps 1. Sign up for Scrapy Cloud on the [Zyte dashboard](https://app.zyte.com/) for free. 2. Follow our web scraping tutorial, which covers running a job in Scrapy Cloud. ### Using Scrapy Cloud See sc-usage for general usage help. sc-reference provides detailed reference documentation. ## Scrapy Cloud usage ### Basic usage > ##### Projects > > Manage your projects. > ##### Deployment > > Deploy code to projects. > ##### Spiders > > Write and configure web crawlers. > ##### Jobs > > Run spiders and scripts. > ##### Items > > Handle data extracted by jobs. ### Advanced topics > ##### Scripts > > Write and run Python scripts. > ##### Units > > Manage your resources for jobs. > ##### Reference > > See the complete API reference documentation. ## Scrapy Cloud projects A Scrapy Cloud project represents a code base for web scraping. You can have any number of Scrapy Cloud projects, each with its own code base, or with a specific version (e.g. commit) of some code base. A common approach is to keep 2 projects: - A project for production, with a stable code base. - A project for development, to test changes before moving them to the production project. For information about managing projects, see: - [Organizations and Projects](https://support.zyte.com/support/solutions/articles/22000200432-organizations-and-projects) - [Inviting Users to Projects](https://support.zyte.com/support/solutions/articles/22000200430-inviting-users-to-projects) - [Managing Organization and Project members](https://support.zyte.com/support/solutions/articles/22000271734-managing-organization-and-project-members) - [Customizing Scrapy settings in Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000200670-customizing-scrapy-settings-in-scrapy-cloud) - [Deleting projects](https://support.zyte.com/support/solutions/articles/22000200397-deleting-projects) ## Deploying code to Scrapy Cloud projects For information about deploying your code to a project, see: - [Deploying your spiders to Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000204081-deploying-your-spiders-to-scrapy-cloud) - [Deploying a Project from a Github Repository](https://support.zyte.com/support/solutions/articles/22000201935-deploying-a-project-from-a-github-repository) - [Versioning your deploys to Zyte Developer Tool Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000204254-versioning-your-deploys-to-zyte-developer-tool-scrapy-cloud) - [Deploying non-code files](https://support.zyte.com/support/solutions/articles/22000200416-deploying-non-code-files) > ###### TIP > > ai-code support Scrapy Cloud deployment. ### Stacks For information about Scrapy Cloud stacks, see: - [Changing the Deploy Environment With Scrapy Cloud Stacks](https://support.zyte.com/support/solutions/articles/22000200402-changing-the-deploy-environment-with-scrapy-cloud-stacks) - [Deploying Python 3 spiders to Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000200387-deploying-python-3-spiders-to-scrapy-cloud) #### Requirements For information about installing additional Python packages into stacks, see: - [Deploying Python Dependencies for Your Projects in Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000200400-deploying-python-dependencies-for-your-projects-in-scrapy-cloud) #### Addons When using a stack, you can use [Scrapy Cloud add-ons](https://support.zyte.com/support/solutions/articles/22000200395-scrapy-cloud-addons) to extend your code, including: - [Autothrottle](https://support.zyte.com/support/solutions/articles/22000200424-auto-throttle-addon), to crawl gently. - [DeltaFetch](https://support.zyte.com/support/solutions/articles/22000200411-delta-fetch-addon), to crawl only new pages. - [DotScrapy Persistence](https://support.zyte.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon), to persist data between jobs. - [Images](https://support.zyte.com/support/solutions/articles/22000200389-images-storage-addon), to download images into S3 storage. - [Magic Fields](https://support.zyte.com/support/solutions/articles/22000200418-magic-fields-addon), to add item fields. - [Page Storage](https://support.zyte.com/support/solutions/articles/22000200403-page-storage-addon), to store visited pages. - [Query Cleaner](https://support.zyte.com/support/solutions/articles/22000200412-query-cleaner-addon), to clean request URL query parameters. ### Using Docker images For information about using Docker images to deploy your code, see: - [Deploying Custom Docker images on Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud) - [Errors while deploying Custom Image to Scrapy Cloud](https://support.zyte.com/support/solutions/articles/22000232799-errors-while-deploying-custom-image-to-scrapy-cloud) ## Scrapy Cloud spiders A Scrapy Cloud spider is a Scrapy spider that is part of a Scrapy project that has been deployed into a Scrapy Cloud project. You can start jobs to execute the code of a spider. Our web scraping tutorial covers creating, deploying, and running spiders. For more information, see the Scrapy documentation. ### Spider templates and virtual spiders Scrapy Cloud supports defining spider templates, that you can use from the Scrapy Cloud UI to create virtual spiders that run the code of the corresponding spider template with predefined parameters. #### Spider templates To create a spider template: 1. Add [scrapy-spider-metadata](https://scrapy-spider-metadata.readthedocs.io/en/latest/) as a dependency to your Scrapy Cloud project. 2. On the spiders that you wish to use as templates, define [metadata](https://scrapy-spider-metadata.readthedocs.io/en/latest/metadata.html#defining-spider-metadata) including a `title` and `description` of your choice, and setting `template` to `True`: ```python from scrapy import Spider class MySpider(Spider): ... metadata = { "title": "My Template", "description": "Description of my template.", "template": True, } ``` When you redeploy your code, you can start creating virtual spiders from your spider templates. > ###### NOTE > > Spider templates are also regular spiders, and can be executed directly as well. #### Virtual spiders To create a virtual spider from a spider template, go to your Scrapy Cloud project page and, on the left-hand sidebar, under **Spiders**, select **Create spider**. On the **Create Spider** page, you can select a template, define the parameters of your new virtual spider, and save your spider. You can then use your virtual spider from Scrapy Cloud as if it were a regular spider. Virtual spiders exist only in Scrapy Cloud, not in your code. However, changes to the code of their spider template will affect them. #### Spider parameters The point of spider templates is to be able to create virtual spiders from them that each works differently based on predefined parameters. To expose parameters to the Scrapy Cloud UI so that they can be defined when creating a virtual spider, add a [parameter specification](https://scrapy-spider-metadata.readthedocs.io/en/latest/params.html) to your template spiders using [scrapy-spider-metadata](https://scrapy-spider-metadata.readthedocs.io/en/latest/): ```python from pydantic import BaseModel from scrapy import Spider from scrapy_spider_metadata import Args class MyParams(BaseModel): foo: str class MySpider(Args[MyParams], Spider): ... ``` ##### Parameter types Scrapy Cloud supports the following parameter types: - `bool` - `int`, `float` (with `gt`, `lt`, `ge`, and `le` [numeric constraint](https://docs.pydantic.dev/latest/concepts/fields/#numeric-constraints) support) - `str` (with [string constraint](https://docs.pydantic.dev/latest/concepts/fields/#string-constraints) support) Scrapy Cloud also supports defining a placeholder through [json_schema_extra](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field): ```python from pydantic import BaseModel, Field class MyParams(BaseModel): url: str = Field( json_schema_extra={ "placeholder": "https://books.toscrape.com", }, ) ``` - `str` + `Enum` Define `enumMeta` in [json_schema_extra](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field) to give your enumeration choices an optional title and description: ```python from enum import Enum from pydantic import BaseModel, Field class Foo(str, Enum): bar: str = "bar" baz: str = "baz" class MyParams(BaseModel): foo: Foo = Field( json_schema_extra={ "enumMeta": { Foo.bar: { "title": "Bar", "description": "Bar description.", }, Foo.baz: { "title": "Baz", "description": "Baz description.", }, }, }, ) ``` ##### Widgets Scrapy Cloud also supports a few special UI widgets that you can enable through the `widget` key of [json_schema_extra](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field), e.g. ```python from pydantic import BaseModel, Field class MyParams(BaseModel): foo: int = Field( json_schema_extra={ "widget": "widget-id", }, ) ``` The following widgets are supported: - `custom-attrs`, to specify a [custom attributes schema](https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/customAttributes). - `request-limit`, to specify a maximum number of requests. - `textarea`, for multi-line text input. ##### Parameter groups Scrapy Cloud also supports defining 2 or more optional parameters so that filling 1 of them (and only 1) is required: ```python from pydantic import BaseModel, ConfigDict class MyParams(BaseModel): model_config = ConfigDict( json_schema_extra={ "groups": [ { "id": "a-or-b", "title": "A or B", "description": "Fill A or B.", "widget": "exclusive", }, ], }, ) a: str = Field( "json_schema_extra": { "group": "a-or-b", "exclusiveRequired": True, }, ) b: str = Field( "json_schema_extra": { "group": "a-or-b", "exclusiveRequired": True, }, ) ``` ## Scrapy Cloud jobs A job is the execution of a Scrapy spider or a Python script in Scrapy Cloud. ### Running a job For information about running a job, see: - [Running a Scrapy spider](https://support.zyte.com/support/solutions/articles/22000200667-running-a-scrapy-spider) > ###### TIP > > Starting a job through the API allows > defining [Scrapy setting](https://scrapy-poet.readthedocs.io/en/stable/settings.html) overrides for the job. - [Managing your jobs in the Jobs Dashboard](https://support.zyte.com/support/solutions/articles/22000200452-managing-your-jobs-in-the-jobs-dashboard) - [Understanding Job Outcomes](https://support.zyte.com/support/solutions/articles/22000200413-understanding-job-outcomes) - [Deploy Project and run Spiders with settings of different environments](https://support.zyte.com/support/solutions/articles/22000236412-deploy-project-and-run-spiders-with-settings-of-different-environments) - [Sharing data between spiders](https://support.zyte.com/support/solutions/articles/22000200420-sharing-data-between-spiders) - [What are the differences between running a spider locally and on Scrapy Cloud?](https://support.zyte.com/support/solutions/articles/22000200426-what-are-the-differences-between-running-a-spider-locally-and-on-scrapy-cloud-) - [Can I run the same spider in parallel?](https://support.zyte.com/support/solutions/articles/22000232777-can-i-run-the-same-spider-in-parallel-) - [Starting jobs programmatically (HTTP API or Python)](https://support.zyte.com/support/solutions/articles/22000200394-running-custom-python-scripts) - [Can I use an HTTP cache on Scrapy Cloud?](https://support.zyte.com/support/solutions/articles/22000201056-can-i-use-an-http-cache-on-scrapy-cloud-) - [Why do I get “Rejected message because it was too big” error?](https://support.zyte.com/support/solutions/articles/22000218173-why-do-i-get-rejected-message-because-it-was-too-big-error-) - [I clicked on STOP button but spider not stopped.](https://support.zyte.com/support/solutions/articles/22000222482-i-clicked-on-stop-button-but-spider-not-stopped-) ### Monitoring jobs For information about monitoring jobs, see: - [Getting Notifications on Certain Events](https://support.zyte.com/support/solutions/articles/22000200451-getting-notifications-on-certain-events) - [Inspecting your spider’s runtime environment with the Job Console](https://support.zyte.com/support/solutions/articles/22000200427-inspecting-your-spider-s-runtime-environment-with-the-job-console) ### Scheduling periodic jobs For information about scheduling periodic jobs, see: - [Scheduling Periodic Jobs](https://support.zyte.com/support/solutions/articles/22000200419-scheduling-periodic-jobs) - [Is it possible to schedule jobs to run sequentially?](https://support.zyte.com/support/solutions/articles/22000244891-is-it-possible-to-schedule-jobs-to-run-sequentially-) - [How Can I Set a Number of Scrapy Cloud Units to Use for a Periodic Job?](https://support.zyte.com/support/solutions/articles/22000256286-how-can-i-set-a-number-of-scrapy-cloud-units-to-use-for-a-periodic-job-) - [Can I Configure My Periodic Job To Run a Spider Every Minute?](https://support.zyte.com/support/solutions/articles/22000261680-can-i-configure-my-periodic-job-to-run-a-spider-every-minute-) ### See also - sc-items ## Scrapy Cloud job logs The **Log** tab of a job contains all the messages logged by the job. It includes all messages logged with Python’s `logging`, both those from Scrapy built-in components and your own code. ### Troubleshooting Here you can find some help to figure out the meaning of common log messages. #### Ignoring response > [scrapy.spidermiddlewares.httperror] Ignoring response <403 [https://example.com](https://example.com)>: HTTP status code is not handled or not allowed By default, after redirects have been followed and retries exceeded, Scrapy ignores responses with an HTTP status code outside the 200-299 range. Some HTTP status codes, such as 401, 403 or 429, may be the result of a ban. Consider using Zyte API to avoid bans. If you want to handle those responses in your request callback, instead of ignoring them: - Set `handle_httpstatus_all` or `handle_httpstatus_list` in your request metadata to handle such responses for a specific request: ```python Request("https://example.com", meta={"handle_httpstatus_list": {403}}) ``` - Use the `HTTPERROR_ALLOW_ALL` or `HTTPERROR_ALLOWED_CODES` settings to handle such responses for all requests. ## Scrapy Cloud job items For information about job items, see: - Downloading items - [Configuring scraped fields](https://support.zyte.com/support/solutions/articles/22000200410-configuring-scraped-fields) - [Providing feedback on scraped data](https://support.zyte.com/support/solutions/articles/22000200398-providing-feedback-on-scraped-data) - [Publishing and sharing datasets](https://support.zyte.com/support/solutions/articles/22000200453-publishing-and-sharing-datasets) - [Why do I get “Rejected message because it was too big” error?](https://support.zyte.com/support/solutions/articles/22000218173-why-do-i-get-rejected-message-because-it-was-too-big-error-) ## Downloading from Scrapy Cloud After you run a job on Scrapy Cloud, you can download your scraped data from Scrapy Cloud, be it from the Zyte dashboard, from a URL, or from the API. ### Downloading from the Zyte dashboard To download your job data from the [Zyte dashboard](https://app.zyte.com): 1. Open the details page of your job (`https://app.zyte.com/p/`). 2. Open the **Items** tab (`https://app.zyte.com/p//items`). 3. On the right-hand side, select **Download › **. **** can be one of: [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), [JSON](https://www.json.org/json-en.html), [JSON Lines](https://jsonlines.org/), [XML](https://en.wikipedia.org/wiki/XML). ![](scrapy-cloud/usage/items/download-dash.png) ### Download from a URL Download links from the Zyte dashboard are transparent. Given a job ID and your Scrapy Cloud API key, you can build one manually, for example to automate downloads. For JSON, JSON Lines and XML, download URLs follow this pattern: ```none https://storage.zyte.com/items/?apikey=&format= ``` Where: - **** is the job ID, e.g. `00000/0/0`. - **** is your [Scrapy Cloud API key](https://app.zyte.com/o/settings/apikey). - **** is the output file format, one of: `json`, `jl` (JSON Lines), `xml`. For CSV, the download URL is similar, but you: - Must specify a comma-separated list of fields to export as well, in the `fields` query string parameter. - Can use the `include_headers` query string parameter to indicate whether you want the file names in the first row (`1`) or not (`0`, default). For example: ```none https://storage.zyte.com/items/?apikey=&format=csv&fields=key,name,price,url&include_headers=1 ``` ### See also - export ## Scrapy Cloud scripts In addition to Scrapy spiders, you can include standalone Python scripts in your Scrapy project and run them on Scrapy Cloud. Scrapy Cloud scripts need to be declared under `scripts` in your `setup.py` file: ```python from setuptools import setup, find_packages setup( name="myproject", version="1.0", packages=find_packages(), scripts=["bin/hello.py"], entry_points={"scrapy": ["settings = myproject.settings"]}, ) ``` When starting a job, you can select a script instead of a spider. Scripts are listed with their file name, prefixed with `py:`; for example, `py:hello.py` for the script in the example above. To access your Scrapy project settings from a script, including those defined in Scrapy Cloud, use the `sh_scrapy.utils.get_project_settings` function: ```python from sh_scrapy.utils import get_project_settings settings = get_project_settings() ``` > ###### NOTE > > This function was introduced in [scrapinghub-entrypoint-scrapy](https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy) 0.12. > If you cannot import it, make sure you are using a modern Scrapy > stack, or add `scrapinghub-entrypoint-scrapy>=0.12` to your > requirements. ## Scrapy Cloud units Jobs run on Scrapy Cloud units. To run a job: 1. You assign between 1 and 6 units to your job. 2. Your job remains in the queue of pending jobs until the selected number of units are available. 3. When your job starts, the selected number of units are allocated to your job for the duration of the job. 4. When the job finishes, the job units are released, ready to be used by another job. You can only run as many parallel jobs as units you have. If you have 1 unit, you can only run 1 job at a time. If you have 2 units, you can run 2 jobs in parallel, each job using 1 unit. Every unit assigned to a job gives that job 1 computing unit, 1 GB of memory, and 2.5 GB of disk. A job running with 2 units has twice the compute power, memory and disk space as a job running with 1 unit. ## Scrapy Cloud reference HTTP API : HTTP API to interact with spiders, jobs, and other Scrapy Cloud resources. Entry Point API : Write custom Docker images that are compatible with Scrapy Cloud. ## Scrapy Cloud API Scrapy Cloud provides an HTTP API for interacting with your spiders, jobs and scraped data. ### Getting started #### Authentication You’ll need to authenticate using your [Scrapy Cloud API key](https://app.zyte.com/o/settings/apikey). > ###### IMPORTANT > > Scrapy Cloud uses a different API key than Zyte API. There are two ways to authenticate: HTTP Basic: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/foo ``` URL Parameter: ``` $ curl https://storage.zyte.com/foo?apikey=YOUR_SCRAPY_CLOUD_API_KEY ``` #### Example Running a spider is simple: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://app.zyte.com/api/run.json -d project=PROJECT -d spider=SPIDER ``` Where `YOUR_SCRAPY_CLOUD_API_KEY` is your Scrapy Cloud API key, `PROJECT` is the spider’s project ID, and `SPIDER` is the name of the spider you want to run. It’s possible to override Scrapy settings for a job: ``` $ curl \ -u YOUR_SCRAPY_CLOUD_API_KEY: \ https://app.zyte.com/api/run.json \ -d project=PROJECT \ -d spider=SPIDER \ -d job_settings='{"LOG_LEVEL": "DEBUG"}' ``` `job_settings` should be a valid JSON and will be merged with project and spider settings provided for given spider. ### API endpoints #### app.zyte.com #### storage.zyte.com #### Python client You can use the [python-scrapinghub](https://github.com/scrapinghub/python-scrapinghub) library to interact with Scrapy Cloud API. Check the [documentation](https://python-scrapinghub.readthedocs.io/) for installation instructions and usage examples. ### Pagination You can paginate the results for the majority of the APIs using a number of parameters. The pagination parameters differ depending on the target host for a given endpoint. #### app.zyte.com | Parameter | Description | |-------------|--------------------------------------| | count | Number of results per page. | | offset | Offset to retrieve specific records. | #### storage.zyte.com | Parameter | Description | |-------------|--------------------------------------------------------------------| | count | Number of results per page. | | index | Offset to retrieve specific records. Multiple values supported. | | start | Skip results before the given one. See a note about format below. | | startafter | Return results after the given one. See a note about format below. | > ###### NOTE > > The parameters naming inconsistency is caused by historical reasons and will be fixed in the coming platform updates. > ###### NOTE > > While `index` parameter is just a short `` (ex: `index=4`), `start` and `startafter` parameters should have the full form `///` (ex: `start=1/2/3/4`, `startafter=1/2/3/3`). ### Result formats There are two ways to specify the format of results: Using the `Accept` header, or using the `format` parameter. The `Accept` header supports the following values: * application/x-jsonlines * application/json * application/xml * text/plain * text/csv The `format` parameter supports the following values: * json * jl * xml * csv * text [XML-RPC data types](http://en.wikipedia.org/wiki/XML-RPC#Data_types) are used for XML output. #### CSV parameters | Parameter | Description | Required | |-----------------|-------------------------------------------------------------------------|------------| | fields | Comma delimited list of fields to include, in order from left to right. | Yes | | include_headers | When set to ‘1’ or ‘Y’, show header names in first row. | No | | sep | Separator character. | No | | quote | Quote character. | No | | escape | Escape character. | No | | lineend | Line end string. | No | When using CSV, you will need to specify the `fields` parameter to indiciate required fields and their order. Example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/34/7?format=csv&fields=id,name&include_headers=1" ``` ### Headers *gzip* compression is supported. A client can specify that *gzip* responses can be handled using the `accept-encoding: gzip` request header. `content-encoding: gzip` header must be present in the response to signal the *gzip* content encoding. You can use the `saveas` request parameter to specify a filename for browser downloads. For example, specifying `?saveas=foo.json` will cause a header of `Content-Disposition: Attachment; filename=foo.json` to be returned. ### Meta parameters You can use the `meta` parameter to return metadata for the record in addition to its core data. The following values are available: | Parameter | Description | |-------------|-----------------------------------------------------------------------| | \_key | The item key in the format `:project_id/:spider_id/:job_id/:item_no`. | | \_ts | Timestamp in milliseconds for when the item was added. | Example: ``` $ curl "https://storage.zyte.com/items/53/34/7?meta=_key&meta=_ts" {"_key":"1111111/1/1/0","_ts":1342078473363, ... } ``` > ###### NOTE > > If the data contains fields with the same name as the requested fields, they will both appear in the result. ## Jobs API The jobs API makes it easy to work with your spider’s jobs and lets you schedule, stop, update and delete them. > ###### NOTE > > Most of the features provided by the API are also available through the > python-scrapinghub client library. ### run.json Schedules a job for a given spider. | Parameter | Description | Required | |--------------|----------------------------------------------------------------------------------------------------------------------|------------| | project | Project ID. | Yes | | spider | Spider name. | Yes | | jobq_id | Spider ID as `spider` in `project`/`spider`/`job` identifier. | No | | add_tag | Add specified tag to job. | No | | priority | Job priority. Supported values: 0 (lowest) to 4 (highest). Default: 2. | No | | job_settings | [Scrapy settings](https://docs.scrapy.org/en/latest/topics/settings.html) to override for the job, as a JSON object. | No | | units | Amount of units to run job. Supported values: 1 to 6. | No | > ###### NOTE > > Any other parameter will be treated as a spider argument. > ###### NOTE > > In case of using `jobq_id` parameter, `spider` parameter would be not required. | Method | Description | Supported parameters | |----------|--------------------------------|-----------------------------------------------------------| | POST | Schedule the specified spider. | project, spider, jobq_id, add_tag, priority, job_settings | Example that specifies a spider name: ``` $ curl \ -u YOUR_SCRAPY_CLOUD_API_KEY: \ https://app.zyte.com/api/run.json \ -d project=123 \ -d spider=somespider \ -d units=2 \ -d add_tag=sometag \ -d spiderarg1=example \ -d job_settings='{"CLOSESPIDER_PAGECOUNT": "10"}' {"status": "ok", "jobid": "123/1/1"} ``` Example that specifies a spider ID: ``` $ curl \ -u YOUR_SCRAPY_CLOUD_API_KEY: \ https://app.zyte.com/api/run.json \ -d project=123 \ -d jobq_id=1 \ -d units=2 \ -d add_tag=sometag \ -d spiderarg1=example \ -d job_settings='{"CLOSESPIDER_PAGECOUNT": "10"}' {"status": "ok", "jobid": "123/1/1"} ``` ### jobs/list.{json,jl} Retrieve job information for a given project, spider, or specific job. | Parameter | Description | Required | |-------------|--------------------------------------|------------| | project | Project ID. | Yes | | job | Job ID. | No | | spider | Spider name. | No | | state | Return jobs with specified state. | No | | has_tag | Return jobs with specified tag. | No | | lacks_tag | Return jobs that lack specified tag. | No | Supported `state` values: `pending`, `running`, `finished`, `deleted`. | Method | Description | Supported parameters | |----------|---------------------------|-------------------------------------------------| | GET | Retrieve job information. | project, job, spider, state, has_tag, lacks_tag | Examples: ``` # Retrieve the latest 3 finished jobs $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/jobs/list.json?project=123&spider=somespider&state=finished&count=3" { "status": "ok", "count": 3, "total": 3, "jobs": [ { "responses_received": 1, "items_scraped": 2, "close_reason": "finished", "logs": 29, "tags": [], "spider": "somespider", "updated_time": "2015-11-09T15:21:06", "priority": 2, "state": "finished", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T15:20:25", "id": "123/45/14544", "errors_count": 0, "elapsed": 138399 }, { "responses_received": 1, "items_scraped": 2, "close_reason": "finished", "logs": 29, "tags": [ "consumed" ], "spider": "somespider", "updated_time": "2015-11-09T14:21:02", "priority": 2, "state": "finished", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T14:20:25", "id": "123/45/14543", "errors_count": 0, "elapsed": 3433762 }, { "responses_received": 1, "items_scraped": 2, "close_reason": "finished", "logs": 29, "tags": [ "consumed" ], "spider": "somespider", "updated_time": "2015-11-09T13:21:08", "priority": 2, "state": "finished", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T13:20:31", "id": "123/45/14542", "errors_count": 0, "elapsed": 7034158 } ] } # Retrieve all running jobs $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/jobs/list.json?project=123&state=running" { "status": "ok", "count": 2, "total": 2, "jobs": [ { "responses_received": 483, "items_scraped": 22, "logs": 20, "tags": [], "spider": "somespider", "elapsed": 17442, "priority": 2, "state": "running", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T15:25:07", "id": "123/45/13140", "errors_count": 0, "updated_time": "2015-11-09T15:26:43" }, { "responses_received": 207, "items_scraped": 207, "logs": 468, "tags": [], "spider": "someotherspider", "elapsed": 4085, "priority": 3, "state": "running", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T13:00:46", "id": "123/67/11952", "errors_count": 0, "updated_time": "2015-11-09T15:26:57" } ] } # Retrieve all jobs with the tag ``consumed`` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/jobs/list.json?project=123&lacks_tag=consumed" { "status": "ok", "count": 3, "total": 3, "jobs": [ { "responses_received": 208, "items_scraped": 208, "logs": 471, "tags": ["sometag"], "spider": "somespider", "elapsed": 1010, "priority": 3, "state": "running", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T13:00:46", "id": "123/45/11952", "errors_count": 0, "updated_time": "2015-11-09T15:28:27" }, { "responses_received": 619, "items_scraped": 22, "close_reason": "finished", "logs": 29, "tags": ["sometag"], "spider": "someotherspider", "updated_time": "2015-11-09T15:27:20", "priority": 2, "state": "finished", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T15:25:07", "id": "123/67/13140", "errors_count": 0, "elapsed": 67409 }, { "responses_received": 3, "items_scraped": 20, "close_reason": "finished", "logs": 58, "tags": ["sometag", "someothertag"], "spider": "yetanotherspider", "updated_time": "2015-11-09T15:25:28", "priority": 2, "state": "finished", "version": "1447064100", "spider_type": "manual", "started_time": "2015-11-09T15:25:07", "id": "123/89/1627", "errors_count": 0, "elapsed": 179211 } ] } ``` ### jobs/update.json Updates information about jobs. | Parameter | Description | Required | |-------------|--------------------------------|------------| | project | Project ID. | Yes | | job | Job ID. | Yes | | add_tag | Add specified tag to job. | No | | remove_tag | Remove specified tag from job. | No | | Method | Description | Supported parameters | |----------|-------------------------|-----------------------------------| | POST | Update job information. | project, job, add_tag, remove_tag | Example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://app.zyte.com/api/jobs/update.json -d project=123 -d job=123/1/2 -d add_tag=consumed ``` ### jobs/delete.json Deletes one or more jobs. | Parameter | Description | Required | |-------------|---------------|------------| | project | Project ID. | Yes | | job | Job ID. | Yes | | Method | Description | Supported parameters | |----------|----------------|------------------------| | POST | Delete job(s). | project, job | Example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://app.zyte.com/api/jobs/delete.json -d project=123 -d job=123/1/2 -d job=123/1/3 ``` ### jobs/stop.json Stops one running job. | Parameter | Description | Required | |-------------|---------------|------------| | project | Project ID. | Yes | | job | Job ID. | Yes | | Method | Description | Supported parameters | |----------|---------------|------------------------| | POST | Stop job. | project, job | Example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://app.zyte.com/api/jobs/stop.json -d project=123 -d job=123/1/1 ``` ## Comments API The comments API lets you add comments directly to scraped data, which can later be viewed on the items page. ### Comment object | Field | Description | |----------|----------------------------------------| | id | Comment ID. | | created | Created date. | | archived | Archived date. | | author | Comment author. | | avatar | User gravatar URL. | | text | Comment text | | editable | If set to true, comment can be edited. | ### comments/:comment_id Edits or archives a comment. | Parameter | Description | Required | |-------------|---------------|------------| | comment_id | Comment ID. | Yes | | text | Comment text. | PUT | | Method | Description | Supported Parameters | |----------|----------------------|------------------------| | PUT | Update comment text. | comment_id, text | | DELETE | Delete comment. | comment_id | PUT example: ``` $ curl -X PUT -u YOUR_SCRAPY_CLOUD_API_KEY: --data 'text=my+new+text' "https://app.zyte.com/api/comments/12" ``` DELETE example: ``` $ curl -X DELETE -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/12" ``` ### comments/:project_id/:spider_id/:job_id Retrieves all comments for a job indexed by item or item/field. Example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12" { "0": [comment, comment, ...], "0/title": [comment, comment, ...], "12/url": [comment, comment, ...], } ``` Where `comment` is a comment object as defined above. ### comments/:project_id/stats Retrieves the number of items with unarchived comments for each job of the project. Example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/51/stats" { "51/422/2": 1, "51/414/2": 1, "51/421/2": 1, "51/423/2": 4, "51/413/3": 3, "51/418/2": 1 } ``` ### comments/:project_id/:spider_id/:job_id/:item_no[/:field] Retrieves, updates or archives comments. | Parameter | Description | Required | |-------------|---------------|------------| | text | Comment text. | POST | | Method | Description | Supported parameters | |----------|----------------------------------------------------|------------------------| | GET | Retrieve comments for an item or field. | | | POST | Update the specified comments with the given text. | text | | DELETE | Archive the specified comment. | | GET examples: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12/11" $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12/11/logo" ``` POST examples: ``` $ curl -X POST --data 'text=some+text' -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12/11" $ curl -X POST --data 'text=some+text' -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12/11/logo" ``` DELETE examples: ``` $ curl -X DELETE -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12/11" $ curl -X DELETE -u YOUR_SCRAPY_CLOUD_API_KEY: "https://app.zyte.com/api/comments/14/13/12/11/logo" ``` ## JobQ API The JobQ API allows you to retrieve finished jobs from the queue. > ###### NOTE > > Most of the features provided by the API are also available through the > python-scrapinghub client library. ### jobq/:project_id/count Count the jobs for the specified project. | Parameter | Description | Required | |-------------|------------------------------------------------------------|------------| | spider | Filter results by spider name. | No | | state | Filter results by state (pending/running/finished/deleted) | No | | startts | UNIX timestamp at which to begin results, in milliseconds. | No | | endts | UNIX timestamp at which to end results, in milliseconds. | No | | has_tag | Filter results by existing tags | No | | lacks_tag | Filter results by missing tags | No | > ###### HINT > > It’s possible to repeat `has_tag`, `lacks_tag` multiple times. In this case `has_tag` works as an `OR` operation, while `lacks_tag` works as an `AND` operation. HTTP (assuming only 2 jobs, where 1st one is marked with `tagA`, 2nd - with `tagB`): ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://jobq.zyte.com/jobq/53/count" 2 $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://jobq.zyte.com/jobq/53/count?has_tag=tagA&has_tag=tagB" 2 $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://jobq.zyte.com/jobq/53/count?lacks_tag=tagA&lacks_tag=tagB" 0 ``` | Method | Description | Supported parameters | |----------|---------------------------------------|---------------------------------------------------| | GET | Count jobs for the specified project. | spider, state, startts, endts, has_tag, lacks_tag | #### Examples **Count jobs for a given project** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://jobq.zyte.com/jobq/53/count 32110 ``` ### jobq/:project_id/list Lists the jobs for the specified project, in order from most recent to last. | Field | Description | |---------|---------------------------------------------------| | ts | The time at which the job was added to the queue. | | Parameter | Description | Required | |-------------|------------------------------------------------------------|------------| | spider | Filter results by spider name. | No | | state | Filter results by state (pending,running,finished,deleted) | No | | startts | UNIX timestamp at which to begin results, in milliseconds. | No | | endts | UNIX timestamp at which to end results, in milliseconds. | No | | count | Limit results by a given number of jobs | No | | start | Skip N first jobs from results | No | | stop | The job key at which to stop showing results. | No | | key | Get job data for a given set of job keys | No | | has_tag | Filter results by existing tags | No | | lacks_tag | Filter results by missing tags | No | | Method | Description | Supported parameters | |----------|--------------------------------------|------------------------| | GET | List jobs for the specified project. | startts, endts, stop | #### Examples **List jobs for a given project** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://jobq.zyte.com/jobq/53/list {"key":"53/7/81","ts":1397762393489} {"key":"53/7/80","ts":1395111612849} {"key":"53/7/78","ts":1393972804722} {"key":"53/7/77","ts":1393972734215} ``` **List jobs finished between two timestamps** If you pass the `startts` and `endts` parameters, the API will return only the jobs finished between them. HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://jobq.zyte.com/jobq/53/list?startts=1359774955431&endts=1359774955440" {"key":"53/6/7","ts":1359774955439} {"key":"53/3/3","ts":1359774955437} {"key":"53/9/1","ts":1359774955431} ``` **Retrieve jobs finished after some job** JobQ returns the list of jobs, with the most recently finished first. We recommend associating the key of the most recently finished job with the downloaded data. When you want to update your data later on, you can list the jobs and stop at the previously downloaded job, through the `stop` parameter. Using HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://jobq.zyte.com/jobq/53/list?stop=53/7/81" {"key":"53/7/83","ts":1403610146780} {"key":"53/7/82","ts":1397827910849} ``` ## Job metadata API The Job metadata API allows you to get metadata for the given jobs. > ###### NOTE > > Most of the features provided by the API are also available through the > python-scrapinghub client library. ### jobs/:project_id/:spider_id/:job_id[/:field_name] Retrieve job data or specific meta field. #### Examples **Get metadata for the job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/jobs/1/2/3 { "close_reason": "finished", "completed_by": "jobrunner", "deploy_id": 1, "finished_time": 1566311833872, "pending_time": 1566311800654, "priority": 2, "project": 1, "running_time": 1566311801163, "scheduled_by": "testuser", "scrapystats": { "downloader/request_bytes": 594, "downloader/request_count": 2, "downloader/request_method_count/GET": 2, "downloader/response_bytes": 1866, "downloader/response_count": 2, "downloader/response_status_count/200": 1, "downloader/response_status_count/404": 1, "elapsed_time_seconds": 3.211014, "finish_reason": "finished", "finish_time": 1566311822568.0, "item_scraped_count": 1, "log_count/DEBUG": 3, "log_count/INFO": 11, "log_count/WARNING": 1, "memusage/max": 72433664, "memusage/startup": 72433664, "response_received_count": 2, "robotstxt/request_count": 1, "robotstxt/response_count": 1, "robotstxt/response_status_count/404": 1, "scheduler/dequeued": 1, "scheduler/dequeued/disk": 1, "scheduler/enqueued": 1, "scheduler/enqueued/disk": 1, "start_time": 1566311819357.0 }, "spider": "testspider", "spider_args": {"arg1": "val1", "arg2": "val2"}, "spider_type": "manual", "started_by": "jobrunner", "state": "finished", "tags": [ "tag1", "tag2" ], "units": 2, "version": "6d32f52-master" } ``` > ###### WARNING > > Please consider the example response with caution. Some of the fields > appear only on specific conditions: for example, after finishing/deleting > or restoring a job. Some other fields highly depend on the given spider/job > configuration. There also might be some additional fields for internal use > only which can be changed at any given moment without prior notice. **Get specific metadata field for the job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/jobs/1/2/3/tags [ "tag1", "tag2" ] ``` ## Items API > ###### NOTE > > Even though these APIs support writing, they are most often used for reading. The crawlers running on Scrapinghub cloud are the ones that write to these endpoints. However, both operations are documented here for completion. The Items API lets you interact with the items stored in the hubstorage backend for your projects. For example, you can download all the items for the job `'53/34/7'` through: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7 ``` > ###### NOTE > > Most of the features provided by the API are also available through the > python-scrapinghub client library. ### Item object | Field | Description | |------------------|---------------------------------------------------------------| | \_type | The item definition. | | \_template | The template matched against. Portia only. | | \_cached_page_id | Cached page ID. Used to identify the scraped page in storage. | Scraped fields will be top level alongside the internal fields listed above. ### items/:project_id[/:spider_id][/:job_id][/:item_no][/:field_name] Retrieve or insert items for a project, spider, or job. Where `item_no` is the index of the item. | Parameter | Description | Required | |-------------|--------------------------------------------------------------------|------------| | format | Results format. See api-overview-resultformats. | No | | meta | Meta keys to show. | No | | nodata | If set, no data will be returned other than specified `meta` keys. | No | > ###### NOTE > > Pagination and meta parameters are supported, see api-overview-pagination and api-overview-metapar. | Header | Description | |---------------|------------------------------------------------------------| | Content-Range | Can be used to specify a start index when inserting items. | | Method | Description | Supported parameters | |----------|-----------------------------------------------------|------------------------| | GET | Retrieve items for a given project, spider, or job. | format, meta, nodata | | POST | Insert items for a given job | N/A | > ###### NOTE > > Please always use pagination parameters (`start`, `startafter` and `count`) to limit amount of items in response to prevent timeouts and different performance issues. See pagination examples below for more details. #### Examples **Retrieve all items from a given job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7 ``` **Retrive first item from a given job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7/0 ``` **Retrieve values from a single field** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7/1/fieldname ``` Here 1 is the Index_no of the Item for which the value is retrieved. **Retrieve all items from a given spider** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34 ``` **Retrieve all items from a given project** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/ ``` **[Pagination] Retrieve first N items from a given job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?count=10 ``` **[Pagination] Retrieve N items from a given job starting from the given item** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?count=10&start=53/34/7/20 ``` **[Pagination] Retrieve N items from a given job starting from the item following to the given one** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?count=10&startafter=53/34/7/19 ``` **[Pagination] Retrieve a few items from a given job by their IDs** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7?index=5&index=6 ``` **Get meta field from items** To get only metadata from items, pass the `nodata=1` parameter along with the meta field that you want to get. HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/1/7?meta=_key&nodata=1" {"_key":"53/1/7/0"} {"_key":"53/1/7/1"} {"_key":"53/1/7/2"} ``` **Get items in a specific format** Check the available formats in the api-overview-resultformats section at the API Overview. JSON: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/34/7?meta=_key&nodata=1 -H \"Accept: application/json\"" [{"_key":"28144/1/1/0"},{"_key":"28144/1/1/1"},{"_key":"28144/1/1/2"}, ...] ``` JSON Lines: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/items/53/34/7?meta=_key&nodata=1 -H \"Accept: application/x-jsonlines\"" {"_key":"28144/1/1/0"} {"_key":"28144/1/1/1"} {"_key":"28144/1/1/2"} ... ``` **Add items to a job via POST** Add the items stored in the file `items.jl` (JSON lines format) to the job `53/34/7`: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7 -X POST -T items.jl ``` Use the `Content-Range` header to specify a start index: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7 -X POST -T items.jl -H "content-range: items 500-/*" ``` The API will only return `200` if the data was successfully stored. There’s no limit on the amount of data you can send, but a `HTTP 413` response will be returned if any single item is over 1M. ### items/:project_id/:spider_id/:job_id/stats Retrieve the item stats for a given job. | Field | Description | |---------------------|--------------------------------------------| | counts[field] | The number of times the field was scraped. | | totals.input_bytes | The total size of all items in bytes. | | totals.input_values | The total number of items. | | Parameter | Description | Required | |-------------|-----------------------------------|------------| | all | Include hidden fields in results. | No | | Method | Description | Supported parameters | |----------|--------------------------------------------|------------------------| | GET | Retrieve item stats for the specified job. | all | #### Example **Get the stats from a given job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/items/53/34/7/stats {"counts":{"field1":9350,"field2":514},"totals":{"input_bytes":14390294,"input_values":10000}} ``` ## Logs API The logs API lets you work with logs from your crawls. ### Log object | Field | Description | Required | |---------|--------------------------------------------------|------------| | message | Log message. | Yes | | level | Integer log level as defined in the table below. | Yes | | time | UNIX timestamp of the message, in milliseconds. | No | #### Log levels | Value | Log level | |---------|-------------| | 10 | DEBUG | | 20 | INFO | | 30 | WARNING | | 40 | ERROR | | 50 | CRITICAL | ### logs/:project_id/:spider_id/:job_id Retrieve or upload logs for a given job. | Parameter | Description | Required | |-------------|-------------------------------------------------|------------| | format | Results format. See api-overview-resultformats. | No | > ###### NOTE > > Pagination and meta parameters are supported, see api-overview-pagination and api-overview-metapar. | Method | Description | Supported parameters | |----------|----------------|------------------------| | GET | Retrieve logs. | format | | POST | Upload logs. | | #### Retrieving logs HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/logs/1111111/1/1/ {"time":1444822757227,"level":20,"message":"Log opened."} {"time":1444822757229,"level":20,"message":"[scrapy.log] Scrapy 1.0.3.post6+g2d688cd started"} ``` #### Submitting logs HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/logs/53/34/7 -X POST -T log.jl ``` ## Requests API The requests API allows you to work with request and response data from your crawls. > ###### NOTE > > Most of the features provided by the API are also available through the > python-scrapinghub client library. ### Request object | Field | Description | Required | |----------|-----------------------------------------|------------| | time | Request start timestamp in milliseconds | Yes | | method | HTTP method. Default: GET | Yes | | url | Request URL. | Yes | | status | HTTP response code. | Yes | | duration | Request duration in milliseconds. | Yes | | rs | Response size in bytes. | Yes | | parent | The index of the parent request. | No | | fp | Request fingerprint. | No | > ###### NOTE > > Seed requests from start URLs will have no parent field. ### requests/:project_id[/:spider_id][/:job_id][/:request_no] Retrieve or insert request data for a project, spider or job, where `request_no` is the index of the request. | Parameter | Description | Required | |-------------|--------------------------------------------------------------------|------------| | format | Results format. See api-overview-resultformats. | No | | meta | Meta keys to show. | No | | nodata | If set, no data will be returned other than specified `meta` keys. | No | > ###### NOTE > > Pagination and meta parameters are supported, see api-overview-pagination and api-overview-metapar. ### requests/:project_id/:spider_id/:job_id #### Examples **Get the requests from a given job** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/requests/53/34/7 {"parent":0,"duration":12,"status":200,"method":"GET","rs":1024,"url":"http://scrapy.org/","time":1351521736957} ``` **Adding requests** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/requests/53/34/7 -X POST -T requests.jl ``` ### requests/:project_id/:spider_id/:job_id/stats Retrieve request stats for a given job. | Field | Description | |---------------------|------------------------------------------| | counts[field] | The number of times the field occurs. | | totals.input_bytes | The total size of all requests in bytes. | | totals.input_values | The total number of requests. | #### Example HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/requests/53/34/7/stats {"counts":{"url":21,"parent":19,"status":21,"method":21,"rs":21,"duration":21,"fp":21},"totals":{"input_bytes":2397,"input_values":21}} ``` ## Activity API Scrapinghub keeps track of certain project events such as when spiders are run or new spiders are deployed. This activity log can be accessed in the dashboard by clicking on **Activity** in the left sidebar, or programmatically through the API described below. ### activity/:project_id Retrieve messages for a specified project. Results are returned in reverse order. | Parameter | Description | Required | |-------------|--------------------------------------|------------| | count | Maximum number of results to return. | No | | Method | Description | Supported parameters | |----------|-------------------------------------------------|------------------------| | GET | Returns the messages for the specified project. | count | | POST | Creates a message. | | GET example: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/activity/1111111/?count=2 {"event":"job:completed","job":"1111111/3/4","user":"jobrunner"} {"event":"job:cancelled","job":"1111111/3/4","user":"example"} ``` POST example: ``` $ curl -d '{"foo": 2}' https://storage.zyte.com/activity/1111111/ {"foo":4} {"foo":3} ``` ### activity/projects Retrieve messages for multiple projects. Results are returned in reverse order. | Parameter | Description | Required | |-------------|-------------------------------------------------------------|------------| | count | Maximum number of results to return. | No | | p | Project ID. Multiple values supported. | No | | pcount | Maximum number of results to return per project. | No | | meta | Meta parameter to add to results. See api-overview-metapar. | No | | Method | Description | Supported parameters | |----------|--------------------------------------------------|------------------------| | GET | Returns the messages for the specified projects. | count, p, pcount, meta | GET example: ``` # Retrieve a single result for projects 1111111 and 2222222, using the ``meta`` parameter to include the project ID in the results. $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/activity/projects/?pcount=1&meta=_project&p=1111111&p=2222222 {"_project": 2222222, "bar": 1} {"_project": 1111111, "foo": 4} ``` ## Collections API *Collections* are key-value stores for an arbitrary large number of records. They are especially useful to store information produced and/or used by multiple scraping jobs. > ###### NOTE > > The frontier API is best suited to store queues of URLs > to be processed by scraping jobs. ### Quickstart A **collection** is identified by a *project id*, a *type*, and a *name*. A **record** can be any JSON dictionary. They are identified by a `_key` field. *In the following, we use project id* `78` *, the regular storage type* `s` *for the collection named* `my_collection`. > ###### NOTE > > Avoid using multiple collections with the same name and different types like `/s/my_collection` and `/cs/my_collection`. During operations on an entire collection, like renaming or deleting, Hubstorage will treat homonyms as a single entity and rename or delete both. #### Create/Update a record: ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: -X POST -d '{"_key": "foo", "value": "bar"}' \ https://storage.zyte.com/collections/78/s/my_collection ``` #### Access a record: ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: -X GET \ https://storage.zyte.com/collections/78/s/my_collection/foo ``` #### Delete a record: ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: -X DELETE \ https://storage.zyte.com/collections/78/s/my_collection/foo ``` #### List records: ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: -X GET \ https://storage.zyte.com/collections/78/s/my_collection ``` #### Create/Update multiple records: We use the `jsonline` format by default (json objects separated by a newline): ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: -X POST -d $'{"_key": "foo", "value": "bar"}\n{"_key": "goo", "value": "baz"}' \ https://storage.zyte.com/collections/78/s/my_collection ``` ### Details The following collection types are available: | Type | Full name | Hubstorage method | Description | |--------|-----------------------|----------------------------|------------------------------------------------------------------| | s | store | new_store | Basic set store | | cs | cached store | new_cached_store | Items expire after a month | | vs | versioned store | new_versioned_store | Up to 3 copies of each item will be retained | | vcs | versioned cache store | new_versioned_cached_store | Multiple copies are retained, and each one expires after a month | > ###### NOTE > > Avoid using multiple collections with the same name and different types like `/s/my_collection` and `/cs/my_collection`. During operations on an entire collection, like renaming or deleting, Hubstorage will treat homonyms as a single entity and rename or delete both. Records are `JSON` objects, with the following constraints: - Their serialized size can’t be larger than `1 MB`; - Javascript’s `inf` values are not supported; - Floating-point numbers can’t be larger than `2^64 - 1`. ### API #### collections/:project_id/list List all collections. ```shell $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/collections/78/list {"type":"s","name":"my_collection"} {"type":"s","name":"my_collection_2"} {"type":"cs","name":"my_other_collection"} ``` #### collections/:project_id/:type/:collection Read, write or remove items in a collection. | Parameter | Description | Required | |-------------|-----------------------------------------------------------------|------------| | key | Read items with a specified key. Multiple values are supported. | No | | prefix | Read items with a specified key prefix. | No | | prefixcount | Maximum number of values to return per prefix. | No | | startts | UNIX timestamp at which to begin results, in milliseconds. | No | | endts | UNIX timestamp at which to end results, in milliseconds. | No | | Method | Description | Supported parameters | |----------|---------------------------------------------|------------------------------------------| | GET | Read items from the specified collection. | key, prefix, prefixcount, startts, endts | | POST | Write items to the specified collection. | | | DELETE | Delete items from the specified collection. | key, prefix, prefixcount, startts, endts | > ###### NOTE > > Pagination and meta parameters are supported, > see api-overview-pagination and api-overview-metapar. GET examples: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/collections/78/s/my_collection?key=foo1&key=foo2" {"value":"bar1"} {"value":"bar2"} $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/collections/78/s/my_collection?prefix=f {"value":"bar"} $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: "https://storage.zyte.com/collections/78/s/my_collection?startts=1402699941000&endts=1403039369570" {"value":"bar"} ``` Prefix filters, unlike other filters, use indexes and should be used when possible. You can use the `prefixcount` parameter to limit the number of values returned for each prefix. A common pattern is to download changes within a certain time period. You can use the `startts` and `endts` parameters to select records within a certain time window. The current timestamp can be retrieved like so: ``` $ curl https://storage.zyte.com/system/ts 1403039369570 ``` > ###### NOTE > > Timestamp filters may perform poorly when selecting a small number > of records from a large collection. #### collections/:project_id/:type/:collection/count Count the number of items in a collection. ```shell $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/collections/78/s/my_collection/count {"count":972,"scanned":972}% ``` If the collection is large, the result may contain a `nextstart` field that is used for pagination, see api-overview-pagination. #### collections/:project_id/:type/:collection/:item Read Write or Delete an individual item. | Method | Description | |----------|------------------------------------| | GET | Read the item with the given key | | POST | Write the item with the given key | | DELETE | Delete the item with the given key | ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/collections/78/s/my_collection/foo {"value":"bar"} ``` #### collections/:project_id/:type/:collection/:item/value Read an individual item value. ```shell $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/collections/78/s/my_collection/foo/value bar ``` #### collections/:project_id/:type/:collection/deleted `POST` with a list of item keys to delete them. > ###### NOTE > > This endpoint is designed to delete a large number of > non-consecutive items. To delete consecutive items use > `DELETE`-based endpoints, which are faster. ```shell $ curl -u $YOUR_SCRAPY_CLOUD_API_KEY: -X POST -d '"foo"' -d '"bar"' \ https://storage.zyte.com/collections/78/s/my_collection/deleted ``` #### collections/:project_id/delete?name=:collection Delete an entire collection immediately. ```shell $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: -X POST https://storage.zyte.com/collections/78/delete?name=my_collection ``` #### collections/:project_id/rename?name=:collection&new_name=:new_name Rename a collection and move all its items immediately. ```shell $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: -X POST https://storage.zyte.com/collections/rename?name=my_collection&new_name=my_collection_renamed ``` ## Frontier API The *Hub Crawl Frontier* (HCF) stores pages visited and outstanding requests to make. It can be thought of as a persistent shared storage for a crawl scheduler. Web pages are identified by a fingerprint. This can be the URL of the page, but crawlers may use any other string (e.g. a hash of post parameters, if it processes post requests), so there is no requirement for the fingerprint to be a valid URL. A project can have many frontiers and each frontier is broken down into slots. A separate priority queue is maintained per slot. This means that requests from each slot can be prioritized separately and crawled at different rates and at different times. Arbitrary data can be stored in both the crawl queue and with the set of fingerprints. A typical example would be to use the URL as a fingerprint and the hostname as a slot. The crawler should ensure that each host is only crawled from one process at any given time so that politeness can be maintained. > ###### NOTE > > Most of the features provided by the API are also available through the > python-scrapinghub client library. ### Batch object | Field | Description | |----------|------------------------------| | id | Batch ID. | | requests | An array of request objects. | ### Request object | Field | Description | Required | |---------|----------------------------------------------------------------------|------------| | fp | Request fingerprint. | Yes | | qdata | Data to be stored along with the fingerprint in the request queue. | No | | fdata | Data to be stored along with the fingerprint in the fingerprint set. | No | | p | Priority: lower priority numbers are returned first. Defaults to 0. | No | ### /hcf/:project_id/:frontier/s/:slot | Field | Description | |----------|--------------------------------------------------| | newcount | The number of new requests that have been added. | | Method | Description | Supported parameters | |----------|-------------------------------------------|------------------------| | POST | Enqueues a request in the specified slot. | fp, qdata, fdata, p | | DELETE | Deletes the specified slot. | | #### POST examples **Add a request to the frontier** HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: -d '{"fp":"/some/path.html"}' \ https://storage.zyte.com/hcf/78/test/s/example.com {"newcount":1} ``` **Add requests with additional parameters** By using the same priority as request depth, the website can be traversed in breadth-first order from the starting URL. HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: -d $'{"fp":"/"}\n{"fp":"page1.html", "p": 1, "qdata": {"depth": 1}}' \ https://storage.zyte.com/hcf/78/test/s/example.com {"newcount":2} ``` #### DELETE example The example belows delete the slot `example.com` from the frontier. HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: -X DELETE https://storage.zyte.com/hcf/78/test/s/example.com/ ``` ### /hcf/:project_id/:frontier/s/:slot/q Retrieve requests for a given slot. | Parameter | Description | Required | |-------------|---------------------------------------------|------------| | mincount | The minimum number of requests to retrieve. | No | HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/hcf/78/test/s/example.com/q {"id":"00013967d8af7b0001","requests":[["/",null]]} {"id":"01013967d8af7e0001","requests":[["page1.html",{"depth":1}]]} ``` ### /hcf/:project_id/:frontier/s/:slot/q/deleted Delete a batch of requests. Once a batch has been processed, clients should indicate that the batch is completed so that it will be removed and no longer returned when new batches are requested. This can be achieved by posting the IDs of the completed batches: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: -d '"00013967d8af7b0001"' https://storage.zyte.com/hcf/78/test/s/example.com/q/deleted ``` You can specify the IDs as arrays or single values. As with the previous examples, multiple lines of input is accepted. ### /hcf/:project_id/:frontier/s/:slot/f Retrieve fingerprints for a given slot. #### Example HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/hcf/78/test/s/example.com/f {"fp":"/"} {"fp":"page1.html"} ``` Results are ordered lexicographically by fingerprint value. ### /hcf/:project_id/list Lists the frontiers for a given project. #### Example HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/hcf/78/list ["test"] ``` ### /hcf/:project_id/:frontier/list Lists the slots for a given frontier. #### Example HTTP: ``` $ curl -u YOUR_SCRAPY_CLOUD_API_KEY: https://storage.zyte.com/hcf/78/test/list ["example.com"] ``` ## Scrapy Cloud Write Entrypoint > ###### NOTE > > This is the documentation of a low-level protocol that most Scrapy Cloud users don’t need to deal with. For more high-level documentation and user guides check the [Help Center](https://support.zyte.com/support/home). Scrapy Cloud Write Entrypoint is a write-only interface to Scrapy Cloud storage. Its main purpose is to make it easy to write crawlers and scripts compatible with Scrapy Cloud in different programming languages using [custom Docker images](https://support.zyte.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud). Jobs in Scrapy Cloud run inside Docker containers. When a Job container is started, a [named pipe](http://man7.org/linux/man-pages/man7/fifo.7.html) is created at the location stored in the `SHUB_FIFO_PATH` environment variable. To interface with Scrapy Cloud storage, your crawler has to open this named pipe and write messages on it, following a simple text-based protocol as described below. ### Protocol Each message is a line of ASCII characters terminated by a newline character. Message consists of the following parts: - a 3-character command (one of “ITM”, “LOG”, “REQ”, “STA”, or “FIN”), - followed by a space character, - then followed by a payload as a [JSON](http://json.org/) object, - and a final newline character `\n`. This is how example log message will look like: ``` LOG {"time": 1485269941065, "level": 20, "message": "Some log message"} ``` This example and all the following examples omit the trailing newline character because it’s a non-printable character. This is how you would write the above example message in Python: ```python pipe.write('LOG {"time": 1485269941065, "level": 20, "message": "Some log message"}\n') pipe.flush() ``` Newline characters are used as message separators. So, make sure that the serialized JSON object payload doesn’t contain newline characters between key/value pairs and that newline characters inside strings for both keys and values are properly escaped, i.e an actual `\` (reverse solidus, backslash), followed by `n`. Here’s an example of two consecutive log messages which carry a multiline messages in the payload: ``` LOG {"time": 1485269941065, "level": 20, "message": "First multiline message. Line 1\nLine 2"} LOG {"time": 1485269941066, "level": 30, "message": "Second multiline message. Line 1\nLine 2"} ``` In Python this will look like this: ```python pipe.write('LOG {"time": 1485269941065, "level": 20, "message": "First multiline message. Line 1\\nLine 2"}\n') pipe.write('LOG {"time": 1485269941066, "level": 20, "message": "Second multiline message. Line 1\\nLine 2"}\n') pipe.flush() ``` Unicode characters in JSON object MUST be escaped using standard JSON u four-hex-digits syntax, e.g. item `{"ключ": "значение"}` should look like this: ``` ITM {"\u043a\u043b\u044e\u0447": "\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435"} ``` The total size of the message MUST not exceed 1 MiB. For messages that exceed this size the error will be logged instead. #### ITM command The `ITM` command writes a single item into Scrapy Cloud storage. `ITM` payload has not predefined schema. Example: ``` ITM {"key": "value"} ``` To support very simple scripts the Scrapy Cloud Write Entrypoint allows sending plain JSON objects as items, i.e. without the 3-character command and space prefix. The following two messages are valid and equivalent: ``` ITM {"key": "value"} ``` ``` {"key": "value"} ``` #### LOG command The `LOG` command writes a single log message into Scrapy Cloud storage. The schema for the `LOG` payload is described in log-object. Example: ``` LOG {"level": 20, "message": "Some log message"} ``` #### REQ command The `REQ` command writes a single request into Scrapy Cloud storage. The schema for the `REQ` payload is described in request-object. Example: ``` REQ {"url": "http://example.com", "method": "GET", "status": 200, "rs": 10, "duration": 20} ``` #### STA command `STA` stands for stats and is used to populate the job stats page and to create graphs on the job details page. | Field | Description | Required | |---------|-------------------------------------------------|------------| | time | UNIX timestamp of the message, in milliseconds. | No | | stats | JSON object with arbitrary keys and values. | Yes | If following keys are present in the `STA` payload – their values will be used to populate Scheduled Requests graph on a job details page: - `scheduler/enqueued` - `scheduler/dequeued` The key names above were picked for compatibility with [Scrapy stats](https://doc.scrapy.org/en/latest/topics/stats.html). Example: ``` STA {"time": 1485269941065, "stats": {"key": 0, "key2": 20.5, "scheduler/enqueued": 20, "scheduler/dequeued": 15}} ``` #### FIN command The `FIN` command is used to set the outcome of a crawler execution, once it’s finished. | Field | Description | Required | |---------|----------------------------------------------------------|------------| | outcome | String with custom outcome message, limited to 255 chars | Yes | Example: ``` FIN {"outcome": "finished"} ``` ### Printing to stdout and stderr The output printed by a job in Scrapy Cloud is automatically converted into log messages. Lines printed to `stdout` are converted into `INFO` level log messages. Lines printed to `stderr` are converted into `ERROR` level log messages. For example, if the script prints `Hello, world` to stdout, the resulting [LOG command]() will look like this: ``` LOG {"time": 1485269941065, "level": 20, "message": "Hello, world"} ``` There’s very basic support for multiline standard output – if some output consists of multiple lines where first line starts with a non-space character and subsequent lines start with a space character, it would be considered as a single log entry. For example, the following traceback in stderr: ``` Traceback (most recent call last): File "", line 1, in NameError: name 'e' is not defined ``` will produce the following log messages: ``` LOG {"time": 1485269941065, "level": 40, "message": "Traceback (most recent call last):\n File \"\", line 1, in "} LOG {"time": 1485269941066, "level": 40, "message": "NameError: name 'e' is not defined"} ``` Resulting log messages are subject to 1 MiB limit – this means that output longer than 1023 KiB is likely to cause errors. > ###### WARNING > > Even though you can write log messages by printing them to stdout and stderr, we recommend you > to use the named pipe and `LOG` message instead. Due to the way data is sent between processes, > it is not possible to maintain the order of the messages coming from different sources > (named pipe, stdout, stderr). Exclusive usaged of the named pipe will both give the best performance > and guarantee that messages are received in exactly the same order they were sent. ### How to build a compatible crawler Scripts or non-Scrapy spiders have to be deployed as [custom Docker images](https://support.zyte.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud). Each spider needs to follow the pattern: 1. Get the path to the named pipe mentioned earlier from `SHUB_FIFO_PATH` environment variable. 2. Open named pipe for writing. E.g. in Python you do it like this: ```python import os path = os.environ['SHUB_FIFO_PATH'] pipe = open(path, 'w') ``` 3. Write [messages]() to the pipe. If you want to send a message instantly, you have to flush the stream, otherwise it may remain in the file buffer inside the crawler process. However this is not always required as buffer will be flushed once enough data is written or when file object is closed (depends on the programming language you use): ```python # write item pipe.write('ITM {"a": "b"}\n') pipe.flush() # ... # write request pipe.write('REQ {"time": 1484337369817, "url": "http://example.com", "method": "GET", "status": 200, "rs": 10, "duration": 20}\n') pipe.flush() # ... # write log entry pipe.write('LOG {"time": 1484337369817, "level": 20, "message": "Some log message"}\n') pipe.flush() # ... # write stats pipe.write('STA {"time": 1485269941065, "stats": {"key": 0, "key2": 20.5}}\n') pipe.flush() # ... # set outcome pipe.write('FIN {"outcome": "finished"}\n') pipe.flush() ``` 4. Close the named pipe when the crawl is finished: ```python pipe.close() ``` > ###### NOTE > > [scrapinghub-entrypoint-scrapy](https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/blob/master/sh_scrapy/writer.py) uses Scrapy Cloud Write Entrypoint, check the code if you need an example. ## Frequently Asked Questions about Scrapy Cloud ### Can I use browser automation? Of course! We recommend using zapi-browser, to enjoy automatic ban avoidance and a powerful API. If you already have browser automation code, see zapi-browser-automation. Alternatively, you can use a third-party service or remote tool. If you are a paying customer, you can use Docker to run a browser automation tool like Playwright, Puppeteer, Selenium or Splash alongside your spider. ### How can I configure an unlisted spider setting? When adding a setting to your spider settings, select the first entry from the drop-down list of settings, **Custom Name**. You can then enter the name and value of your setting. > ###### TIP > > You can also use the **Raw Settings** tab to edit all of your spider > settings as plain text. ### Do I have to use Zyte API? While Zyte API and Scrapy Cloud work great together, they are separate products that you can use independently. ### Can I use third-party services in my spider code? Yes, you can. ### What does “cancelled (stalled)” mean in my job’s outcome? It means that your job was automatically cancelled by Scrapy Cloud because it did not produce any output (logs, requests or items) for one hour. This outcome may indicate that your spider is stuck or not configured correctly, we recommend checking the job’s logs to investigate the issue. ### What does “killed by oom” mean in my job’s outcome? It means that your job was killed by the operating system’s out-of-memory (OOM) killer because it exceeded the memory available to its unit(s). To fix this, consider: - Using more units to give your job more memory. Each unit provides 1 GB of memory. - Reducing your spider’s memory usage. For example, lower `SCRAPER_SLOT_MAX_ACTIVE_SIZE` to limit how many response bodies are held in memory at once. ## Scrapy Cloud pricing On signup, you get the following for free: - A low-resource unit, with half as many resources as a regular unit. - Your job data can be retained for up to 7 days before deletion. - Your jobs can run for up to 1 hour. If you purchase 1 or more units ($9/month per unit) you get the following: - Your low-resource unit is replaced by your purchased units, each with twice as many resources. - Your job data can be retained for up to 120 days before deletion. - Your jobs have unlimited run time. - You can schedule jobs and use Docker. > ###### TIP > > Students get the benefits of purchasing 1 unit for free! [Learn more](https://www.zyte.com/scrapy-cloud-student-backpack/). ## Coding Agent Add-Ons The following free-to-use Coding Agent Add-Ons help you write, run and test web scraping code faster, with no vendor lock-in (they work with and without Zyte services): > ##### Zyte Web Data for Claude Code > > A **Claude Code** plugin for web scraping. > ##### Web Scraping Copilot > > A **Visual Studio Code** extension for web scraping. ### Comparison | | Zyte Web Data for Claude Code | Web Scraping Copilot | |-------------------------|---------------------------------------------------------|----------------------------------------------------------------------| | Supported harnesses | [Claude Code](https://code.claude.com/docs/en/overview) | [Visual Studio Code](https://code.visualstudio.com/) | | AI approach | [Agent skills](https://agentskills.io/) | Custom agent and tools | | `Spider` class creation | Yes | No | | UI tooling | Web app for reviewing schemas and extracted data | Interactive tree views, test management, Scrapy Cloud job monitoring | ### Automatic extraction vs coding agents zapi-extract is a feature of Zyte API that can extract structured data from any URL. zapi-extract is designed to be robust to site changes, which you cannot get with Coding Agent Add-Ons. It increases per-request cost and response times, but it can save you time and effort in the long run by eliminating the need to write and maintain parsing code. On the other hand, Coding Agent Add-Ons can be more flexible and cost-effective for many use cases, and while generated code may break when sites change, they make it easier and faster to adapt. You should choose the approach that best fits your needs. It is also possible to combine both, or for example use generated code by default and zapi-extract as fallback for website changes. ## Zyte Web Data for Claude Code **Zyte Web Data for Claude Code** is a free-to-use [Claude Code](https://code.claude.com/docs/en/overview) plugin powered by [agent skills](https://agentskills.io) for web scraping. > ##### Install > > Install Zyte Web Data for Claude Code. > ##### Tutorial > > Take your first steps with Zyte Web Data for Claude Code. ## Install Zyte Web Data for Claude Code To install **Zyte Web Data for Claude Code**: 1. [Install Claude Code](https://code.claude.com/docs/en/quickstart). 2. Add the Claude Skills marketplace repository and install the plugin by running the following in a terminal: ```shell claude plugin marketplace add zyte-ai/claude-skills claude plugin install zyte-web-data@zyte-ai ``` See [https://github.com/zyte-ai/claude-skills](https://github.com/zyte-ai/claude-skills) for details. Follow the tutorial to learn more. ## Web Scraping Copilot **Web Scraping Copilot** is a free [Visual Studio Code](https://code.visualstudio.com/) extension by [Zyte](https://www.zyte.com/) that helps you generate web scraping code with [GitHub Copilot](https://github.com/features/copilot). It streamlines working with Scrapy projects and includes optional integration with Scrapy Cloud, making it easier to deploy and monitor your web scraping jobs. > ##### Requirements > > Find out what you need in order to use Web Scraping Copilot. > ##### Install > > Install Web Scraping Copilot. > ##### Features > > Discover the features of Web Scraping Copilot. > ##### FAQ > > Find all the answers about Web Scraping Copilot. > ##### Tutorial > > Follow the tutorial to learn the AI-assisted web scraping workflow. > ##### User interface > > Learn about the user interface of Web Scraping Copilot. ## Web Scraping Copilot requirements Web Scraping Copilot requires [Visual Studio Code 1.109+](https://code.visualstudio.com/Download). > ###### TIP > > All other requirements can be installed and set up with the help of > Web Scraping Copilot, but are listed below for reference. ### Minimum requirements The core features require a Scrapy project and a Python virtual environment that meets the following requirements: - [Python 3.10+](https://www.python.org/downloads/) - [Scrapy 2.7.0+](https://www.scrapy.org/download) - [itemadapter 0.13.0+](https://github.com/scrapy/itemadapter) - zyte-common-items 0.29.0+ ### Code generation requirements Code generation requires: - [GitHub Copilot](https://github.com/features/copilot) and its [chat extension](https://marketplace.visualstudio.com/items?itemName=GitHub.copilot-chat). While any GitHub Copilot plan is technically supported, the limited requests of the Free plan can run out quickly, so Pro or better is recommended. - The following additional packages in your virtual environment: - web-poet 0.22.0+ - scrapy-poet 0.26.0+ (requires setup) - pytest 7.0.0+ ## Install Web Scraping Copilot [Install Web Scraping Copilot](vscode:extension/zyte.web-scraping) on Visual Studio Code ([marketplace](https://marketplace.visualstudio.com/items?itemName=zyte.web-scraping)), open the sidebar view, and follow the setup instructions. Follow the tutorial to learn more. ### Troubleshooting #### The MCP server fails to start If the MCP server fails to start, check its output by opening **View › Command Palette… › MCP: List Servers › Web Scraping Copilot › Show Output**. If you see `realpath: command not found` in the output, chances are you are running macOS 12 (Monterey). macOS 12 is end-of-life, consider upgrading to a newer macOS version. If you cannot upgrade, [install realpath](https://ports.macports.org/port/realpath/). #### Other issues If you cannot find your issue in this list, or the proposed workarounds do not work for you, please [report it](https://github.com/zytedata/web-scraping-copilot/issues). ## Web Scraping Copilot features Web Scraping Copilot provides the following features: ### Code generation Generate maintainable web scraping code with [GitHub Copilot](https://github.com/features/copilot). ![image](_static/copilot/ai-workflow-0.1.0.gif) See copilot-tutorial. ### Test management Compare extracted data to expectations, including expected exceptions, check target pages in the embedded browser, and more. ![image](_static/copilot/test-management-1.0.0.gif) ### Project setup Start a new project in seconds. ![image](_static/copilot/new-project-1.0.0.gif) ### Scrapy Cloud integration If you use Scrapy Cloud, you can deploy your spiders with a click, and monitor cloud jobs from the spiders view. ![image](_static/copilot/scrapy-cloud-integration-0.1.0.png) ## Web Scraping Copilot FAQ These are some frequently asked questions about Web Scraping Copilot: ### How much does Web Scraping Copilot cost? **Web Scraping Copilot** in itself is **free**. To use **code generation**, you do need a [GitHub Copilot](https://github.com/features/copilot) plan. The Free plan is not recommended because you would spend your requests rather quickly. To use Scrapy Cloud features, you need a Scrapy Cloud account. The free plan is fine, though. ### Does the extension use AI from Zyte? No — your [GitHub Copilot](https://github.com/features/copilot) AIs are used. The extension provides instructions and prompts, and the MCP server tools use [MCP sampling](https://modelcontextprotocol.io/specification/2025-11-25/client/sampling) to start separate chats in the background to handle the different steps of code generation. To control which models can be used by the MCP server, open the [Command Palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) (`Ctrl + Shift + P`) and select **MCP: List Servers › Web Scraping Copilot › Configure Model Access**. ### Is my code sent to Zyte? The code generation workflow that the extension facilitates does not send any code to Zyte, only to [GitHub Copilot](https://github.com/features/copilot). Scrapy Cloud deployment, if used, does upload your code to Scrapy Cloud. ### Which model is best for code generation? The model you use in the **main chat** should be somewhat smart, since workflow management can be hard for smaller models. We recommend something like **GPT‑5**, although GPT‑5 mini has shown good results in our tests. The MCP web scraping **tools**, to generate expectations and code, are designed to work well enough with models for which [GitHub Copilot](https://github.com/features/copilot) paid plans (Pro or better) allow unlimited requests, like **GPT‑5 mini**. Given the number of requests that those tools can generate, it could be very costly to use a smarter model. If you don’t mind the extra cost, however, [Claude Sonnet 4.6 offers great quality](https://www.zyte.com/blog/llm-benchmark-claude-sonnet-46/). ## Web Scraping Copilot user interface ### Sidebar The Web Scraping Copilot sidebar provides access to the main features of the extension, organized in the following views: Items, Page Objects, Spiders, Zyte API, Extension Status, and Feedback. > The **Items** view provides details and access to all your Scrapy > items. > > Specifically, it shows items: > > - Defined in your `items.py` file. > - Declared as output items by your page objects. > - Provided by the installed version of zyte-common-items. > The **Page Objects** view provides details and access to all your > web-poet page-objects. > > It also makes it easy to use AI to add new page objects or to update > some or all fields of existing page objects. See > copilot-tutorial. > > It also helps run, view, manage and debug tests for your page objects. > The **Spiders** view provides access to all your Scrapy spiders, and helps you to run them locally. > > It also makes it easy to deploy them to Scrapy Cloud and monitor their jobs. > The **Zyte API** view helps you set up scrapy-zyte-api to use Zyte API. > The **Extension Status** view helps you set up the requirements of Web Scraping Copilot, and check if they are > met. > The **Feedback** view provides a link to [our issue tracker](https://github.com/zytedata/web-scraping-copilot/issues). > > It may sometimes also include links to surveys or other feedback > channels. ## Installing the Zyte CA certificate On some operating systems or web browsers, using Zyte services like Zyte API may require installing the Zyte CA certificate. You can tell that is your case when every attempt to use a given Zyte service results in an error about SSL certificate verification. To install the Zyte CA certificate, get it and follow the instructions below for your operating system or web browser. ### Get the Zyte CA certificate Download the certificate from [here](https://docs.zyte.com/_static/zyte-ca.crt). ### Operating systems #### Windows 10 1. Press the `Win key + R` hotkey and input `mmc` in Run to open the Microsoft Management Console window. 2. Click `File` and select `Add/Remove Snap-ins`. 3. In the opened window select `Certificates` and press the `Add >` button. 4. In the Certificates Snap-in window select `Computer account > Local Account`, and press the `Finish` button to close the window. 5. Press the `OK` button in the Add or Remove Snap-in window. 6. Back in the Microsoft Management Console window, select `Certificates` under Console Root and right-click `Trusted Root Certification Authorities`. 7. From the context menu select `All Tasks > Import` to open the Certificate Import Wizard window from which you can add the Zyte CA certificate. More details can be found [here](https://windowsreport.com/install-windows-10-root-certificates/). #### macOS 1. Install Python certificates: ```bash /Applications/Python\ 3.x/Install\ Certificates.command ``` > ###### NOTE > > Replace `3.x` with your Python version. 2. Install the Zyte CA certificate: 1. Open Keychain Access window (`Launchpad > Other > Keychain Access`). 2. Select `System` tab under Keychains, drag and drop the downloaded certificate file (or select File > `Import Items...` and navigate to the file). 3. Enter the administrator password to modify the keychain. 4. Double-click the `Crawlera CA` certificate entry, expand Trust, next to When using this certificate: select `Always Trust`. 5. Close the window and enter the administrator password again to update the settings. #### Linux 1. Install the downloaded Zyte CA certificate file: ```bash sudo cp zyte-ca.crt /usr/local/share/ca-certificates/zyte-ca.crt ``` 2. Update stored Certificate Authority files: ```bash sudo update-ca-certificates ``` ### Web browsers #### Firefox 1. Open Preferences, visit Privacy & Security tab, scroll down to the Certificates section, click View Certificates… button to open Certificate Manager. 2. Under Authorities tab click Import… button, navigate to the certificate file. 3. In the opened window (You have been asked to trust a new Certificate Authority (CA)) check the first option Trust this CA to identify websites and click the OK button to finish importing the certificate. 4. Click the OK button to save settings and exit Certificate Manager. #### Chrome 1. Click the triple-dot icon in the top right corner and choose `Settings`. 2. Scroll to the `Privacy and security` section and click on `Security`. 3. Scroll down to find and click `Manage Certificates`. 4. The next steps will depend on the operating system. In the case of macOS, the previous action will open the `Keychain Access`, see ca-macos above. In the case of Windows, the Certificates application should appear, select `Trusted Root Certification Authorities` tab, click `Import...` button, navigate to the certificate file, verify the import was successful and the installed certificate is displayed under Trusted Root Certification Authorities tab, close the Certificates window. ### Tech stacks #### Node.js Point the `NODE_EXTRA_CA_CERTS` environment variable to the Zyte CA certificate. #### Python To use [requests](https://requests.readthedocs.io/en/latest/), build a CA bundle and point the `REQUESTS_CA_BUNDLE` environment variable to it. ### Alternative files #### CA bundle Sometimes you cannot specify an *extra* certificate, like the Zyte CA certificate, and instead you must specify a CA certificate *bundle* that *includes* the Zyte CA certificate. To generate such a CA certificate bundle: 1. Get a generic CA certificate bundle in PEM format, e.g. [curl’s](https://curl.se/ca/cacert.pem). 2. Append the contents of the Zyte CA certificate to the end of the generic CA certificate bundle file. You can then use the resulting file as a CA certificate bundle that supports Zyte domains. #### PKCS#12 In case of requiring a certificate with PKCS#12 format, you can generate it with the following OpenSSL command: ```bash openssl pkcs12 -export -nokeys -password pass: -in zyte-ca.crt -out zyte-ca.p12 ```