Exporting to Azure Storage with Scrapy#
To configure a Scrapy project or spider to export scraped data to Azure Storage:
You need Python 3.8 or higher.
If you are using Scrapy Cloud, make sure you are using stack
scrapy:1.7-py38or higher. Using the latest stack (scrapy:2.13-20250721) is generally recommended.Install scrapy-feedexporter-azure-storage:
pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
If you are using Scrapy Cloud, remember to add the following line to your
requirements.txtfile:scrapy-feedexporter-azure-storage @ git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
In your
settings.pyfile, defineFEED_STORAGESas follows:settings.py#FEED_STORAGES = { "azure": "scrapy_azure_exporter.AzureFeedStorage", }
If the setting already exists in your
settings.pyfile, modify the existing setting to add the key-value pair above, instead of re-defining the setting.Add a FEEDS setting to your project or spider, if not added yet.
The value of
FEEDSmust be a JSON object ({}).If you have
FEEDSalready defined with key-value pairs, you can keep those if you want —FEEDSsupports exporting data to multiple file storage service locations.To add
FEEDSto a project, define it in your Scrapy Cloud project settings or add it to yoursettings.pyfile:settings.py#FEEDS = {}
To add
FEEDSto a spider, define it in your Scrapy Cloud spider-specific settings (open a spider in Scrapy Cloud and select the Settings tab) or add it to your spider code with the update_settings method or the custom_settings class variable:spiders/myspider.py#class MySpider: custom_settings = { "FEEDS": {}, }
Add the following key-value pair to
FEEDS:{ "azure://<ACCOUNT>.blob.core.windows.net/<CONTAINER>/<PATH>": { "format": "<FORMAT>" } }
Where:
<ACCOUNT>is the name of your storage account, e.g.myaccount.<CONTAINER>is the name of your container, e.g.mycontainer.<PATH>is the path where you want to store the scraped data file, e.g.scraped/data.csv.The path can include placeholders that are replaced at run time, such as
%(time), which is replaced by the current timestamp.<FORMAT>is the desired output file format.Possible values include:
csv,json,jsonlines,xml. You can also implement support for more formats.Warning
If you export in CSV format, and in your spider code you yield items as Python dictionaries, only the fields present on the first yielded item are exported for all items.
One solution is to customize output fields through the
fieldsfeed option of FEEDS or through the FEED_EXPORT_FIELDS Scrapy setting to explicitly indicate all fields to export.You can alternatively yield something other than a Python dictionary that supports declaring all possible fields, such as an Item object or an attrs object.
Define the
AZURE_ACCOUNT_URLandAZURE_ACCOUNT_KEYsettings with your credentials:settings.py#AZURE_ACCOUNT_URL = "https://<ACCOUNT>.blob.core.windows.net" AZURE_ACCOUNT_KEY = "<KEY>"
You can alternatively define the
AZURE_CONNECTION_STRINGsetting to a connection string:settings.py#AZURE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=xxxx;AccountKey=xxxx;EndpointSuffix=core.windows.net"
Or, if you have an account URL that includes a SAS token, use the
AZURE_ACCOUNT_URL_WITH_SAS_TOKENsetting instead:settings.py#AZURE_ACCOUNT_URL_WITH_SAS_TOKEN = "https://my.blob.core.windows.net/source-en/source-english.docx?sv=2019-12-12&st=2021-01-26T18%3A30%3A20Z&se=2021-02-05T18%3A30%3A00Z&sr=c&sp=rl&sig=d7PZKyQsIeE6xb%2B1M4Yb56I%2FEEKoNIF65D%2Fs0IFsYcE%3D"
Running your spider now, locally or on Scrapy Cloud, will export your scraped data to the configured Azure Storage location.