Exporting to Amazon S3 with Scrapy#

To configure a Scrapy project or spider to export scraped data to Amazon S3:

  1. Install boto3:

    pip install boto3
    

    If you are using Scrapy Cloud, remember to add the following line to your requirements.txt file:

    boto3
    
  2. Add a FEEDS setting to your project or spider, if not added yet.

    The value of FEEDS must be a JSON object ({}).

    If you have FEEDS already defined with key-value pairs, you can keep those if you want — FEEDS supports exporting data to multiple file storage service locations.

    To add FEEDS to a project, define it in your Scrapy Cloud project settings or add it to your settings.py file:

    settings.py#
    FEEDS = {}
    

    To add FEEDS to a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the Settings tab) or add it to your spider code with the update_settings method or the custom_settings class variable:

    spiders/myspider.py#
    class MySpider:
        custom_settings = {
            "FEEDS": {},
        }
    
  3. Add the following key-value pair to FEEDS:

    {
        "s3://<BUCKET>/<PATH>": {
            "format": "<FORMAT>"
        }
    }
    

    Where:

    • <BUCKET> is your bucket name, e.g. mybucket.

    • <PATH> is the path where you want to store the scraped data file, e.g. scraped/data.csv.

      The path can include placeholders that are replaced at run time, such as %(time), which is replaced by the current timestamp.

      Warning

      Any pre-existing file in the specified path will be overwritten. Amazon S3 does not support appending to a file.

    • <FORMAT> is the desired output file format.

      Possible values include: csv, json, jsonlines, xml. You can also implement support for more formats.

      Warning

      If you export in CSV format, and in your spider code you yield items as Python dictionaries, only the fields present on the first yielded item are exported for all items.

      One solution is to customize output fields through the fields feed option of FEEDS or through the FEED_EXPORT_FIELDS Scrapy setting to explicitly indicate all fields to export.

      You can alternatively yield something other than a Python dictionary that supports declaring all possible fields, such as an Item object or an attrs object.

  4. Define the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY Scrapy settings with your access key:

    settings.py#
    AWS_ACCESS_KEY_ID = "AKIAIOSFODNN7EXAMPLE"
    AWS_SECRET_ACCESS_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    

    You can alternatively define the AWS_SESSION_TOKEN setting to configure access with temporary security credentials.

    Additional settings exist to define a target region, a custom access-control list, or a custom endpoint.

Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Amazon S3 location.