Estuary

How to Connect Data from Azure Blob Storage to Databricks

Learn how to connect Azure Blob Storage to Databricks in real time using Estuary Flow. Stream JSON, CSV, or compressed files directly into Delta Lake—no Spark jobs or scripts required.

Blog post hero image
Share this article

Introduction: Why Connect Azure Blob Storage to Databricks?

Azure Blob Storage is a go-to solution for storing structured and semi-structured data in the cloud—CSV logs, JSON files, Avro records, even compressed snapshots. And Databricks? It’s one of the most powerful engines for transforming, querying, and analyzing that data at scale.

Naturally, teams want to connect the two.

Whether you’re running analytics on raw events, staging data for machine learning, or building dashboards in real time, moving files from Azure Blob Storage into Databricks Delta Lake is a common requirement. But how you make that connection matters.

The traditional approach—mounting storage, writing Spark jobs, and manually parsing files—can be slow, error-prone, and difficult to scale. In this guide, we’ll explore both batch and real-time options for getting your Blob data into Databricks, including a low-latency, no-code pipeline using Estuary Flow.

By the end, you’ll know exactly how to:

  • Set up a working connection between Azure Blob and Databricks
  • Avoid common batch pitfalls
  • Build a future-proof streaming pipeline in minutes

Let’s get started.

Common Architectures for Azure Blob to Databricks Integration

There are several ways to move data from Azure Blob Storage into Databricks, depending on your scale, latency requirements, and technical preferences. Below are the most common approaches teams use today:

1. Direct Mounting of Azure Blob via DBFS

Databricks can access Azure Blob Storage through Azure Data Lake Storage (ADLS) or WASB protocols by mounting the container to the Databricks File System (DBFS). This method is simple and commonly used for ad hoc analysis or periodic data access.

Pros:

  • Easy setup using dbutils.fs.mount
  • No need for explicit ingestion—files can be accessed directly as a source

Cons:

  • Mounting is limited to workspace-scoped access
  • No automatic schema enforcement or updates
  • Not ideal for high-frequency updates or streaming scenarios

2. Using Auto Loader with Structured Streaming

Databricks offers Auto Loader, a feature built on Structured Streaming that incrementally processes new files arriving in Blob Storage.

Pros:

  • Efficient file discovery via notifications or file listings
  • Supports schema evolution and checkpointing
  • Scales well for streaming ingestion

Cons:

  • Requires Spark-specific setup
  • Still involves some manual config for file formats and schema inference
  • Latency is higher than real-time ingestion tools (often seconds to minutes)

3. Batch Ingestion via COPY INTO or read APIs

You can run SQL or PySpark jobs in Databricks to read files from Blob Storage using commands like COPY INTOspark.read, or df = spark.read.format("csv").load(...).

Pros:

  • Good for large batch jobs or one-time historical loads
  • Full control over file parsing and transformations

Cons:

  • No support for real-time or continuous sync
  • Prone to duplication or missed records if not managed carefully
  • Often requires orchestration via external tools (e.g., Airflow)

4. Real-Time Streaming with Estuary Flow

Estuary Flow allows you to continuously capture new or updated files from Azure Blob Storage and materialize them directly into Databricks Delta Lake with low latency—no Spark jobs required.

Pros:

  • Fully managed pipeline with automatic detection of new blobs
  • Real-time streaming delivery into Delta tables
  • Built-in schema inference, validation, and evolution
  • No need to configure Auto Loader or batch pipelines

Cons:

  • Full pipeline may not be necessary for one-off ingestion; ideal for teams seeking operational simplicity and automation

By understanding these architectures, you can choose the best path based on your project’s latency, scale, and transformation needs. In the next section, we’ll go deeper into how Estuary Flow simplifies the Azure Blob to Databricks integration.

Why Use Estuary Flow to Connect Azure Blob Storage to Databricks?

While Databricks offers several native options to pull data from Azure Blob Storage, setting them up typically requires Spark code, manual schema management, and scheduling logic. Estuary Flow offers a simpler, more automated solution—especially when you need continuous data movement with minimal effort.

Here’s why engineering teams choose Estuary Flow for this integration:

  • Streaming Instead of Polling - Estuary continuously watches your Azure Blob container for new or updated files. As soon as a matching file is added (e.g., a new JSON or CSV upload), Flow picks it up and streams it directly into your Databricks Delta Lake. No polling intervals or trigger jobs required.
  • Schema-Aware Ingestion - Flow can automatically infer schema from your data or let you define it explicitly. It validates each record against the schema and ensures clean, queryable data ends up in Databricks—no malformed rows or schema drift issues to debug downstream.
  • Support for Multiple File Formats - Whether you’re storing compressed JSON logs, zipped CSV reports, or newline-delimited Protobuf events, Flow can handle them. You define the file format once—or let Estuary detect the type—and Estuary takes care of parsing and normalization.

    Estuary supports popular formats like JSON, CSV, Avro, Protobuf, and even compressed file types (ZIP, GZIP, etc.). It auto-detects file structure in most cases, but you can override this by specifying schema, delimiters, encoding, or headers—especially useful for non-standard CSVs or mixed-type datasets.
  • Built-in Support for Delta Format - Estuary materializes data into Databricks as Delta Lake tables, supporting ACID transactions and native time travel. You don’t need to write merge statements, manage state, or build custom SCD logic—Delta handles the querying, Flow handles the ingestion.

    While Estuary detects and streams new blobs within seconds, the actual latency depends on blob creation timing and parser configuration. For most JSON/CSV files, the end-to-end sync to Databricks completes in seconds.
  • No Code, No Spark Jobs - All of this is managed from a simple UI (or YAML if you prefer infrastructure as code). No need to write or deploy Spark jobs, configure Auto Loader, or orchestrate workflows via Databricks notebooks.
  • Scales with You - Need to ingest dozens of files per minute or sync multiple containers in parallel? Estuary handles that out of the box. You can scale up data movement without reconfiguring infrastructure or overwhelming your data team.

By pairing Azure Blob Storage as a scalable, low-cost storage layer with Databricks as your processing and analytics engine—and letting Estuary manage the glue—you get a fast, reliable, and future-proof data pipeline.

How to Connect Azure Blob Storage to Databricks with Estuary (Step-by-Step)

Estuary Flow makes it easy to build a real-time pipeline from Azure Blob Storage to Databricks. No need for batch jobs, Spark code, or cron schedules — just point, configure, and stream.

Let’s walk through the process:

Step 1: Set Up Azure Access

Before using the connector, you'll need the following from your Azure account:

  • Subscription ID
  • Storage Account Name
  • Client ID & Secret
  • Tenant ID

To generate these:

  1. Log in to your Azure Portal and navigate to Azure Active Directory > App Registrations.
  2. Click New registration and create a service principal for Estuary access.
  3. Once registered, note down the Application (Client) ID and Directory (Tenant) ID.
  4. Go to Certificates & Secrets, generate a new client secret, and save it.
  5. Visit Subscriptions, select the one that owns your storage account, and copy the Subscription ID.
  6. Make sure the app registration has proper access to read from your Blob containers.

Step 2: Capture Data from Azure Blob Storage in Estuary

Azure Blob Storage Capture Connector in Estuary
  1. Log in at dashboard.estuary.dev. If you don’t have an account, sign up for free — no credit card needed.
  2. Go to the Sources tab and click + New Source.
  3. Choose Azure Blob Storage from the connector list.
  4. Provide your credentials:
    • Storage account name
    • Client ID, Client Secret
    • Tenant ID
    • Subscription ID
  5. (Optional) Specify the container name you want to read from.
  6. (Optional) Add a file filter using regex (e.g. .*\.json) to capture only specific file types.
  7. Set parser settings under the Parser Configuration section to define how Estuary should interpret your files. This defaults to “auto” to let Estuary determine file type and compression settings from your existing files.
    • Format: CSV, JSON, Avro, etc.
    • Compression: zip, gzip, etc.
    • Optional: custom delimiters, headers, encoding for CSV files
  8. Save and activate the capture. Estuary now watches your Blob container and streams new files as they arrive.

Step 3: Materialize into Databricks Delta Tables

Databricks Materialization Connector in Estuary
  1. In the dashboard, go to Destinations and click + New Materialization.
  2. Choose Databricks from the list.
  3. Enter your configuration:
    • Warehouse Address and HTTP Path
    • Catalog Name
    • Personal Access Token
  4. Select the collection created by the Blob capture.
  5. Click Save & Publish to activate your end-to-end pipeline.

Step 4: Monitor and Scale

  • Track pipeline status, throughput, and latency in the Estuary UI.
  • Add schema validations or transformations as needed using SQL or TypeScript.
  • Scale to more containers, file types, or destinations — Estuary handles parallel streams with ease.
Migrate Data to Databricks

Want to see the setup in action? Watch how to materialize real-time data into Databricks using Estuary Flow.

Note: Not all Blob Storage data is “streaming” in the traditional sense. For example, daily CSV exports or Avro snapshots may arrive on a schedule. Estuary still picks these up automatically and syncs them as soon as they appear. While the ingestion is real-time, the pipeline’s freshness depends on when files are written.

Conclusion: From Blob to Insights, Faster

Connecting Azure Blob Storage to Databricks doesn’t have to be complex. While traditional methods rely on scripts, manual parsing, and batch jobs that delay your insights, Estuary Flow gives you a faster, real-time alternative.

With Flow, you can continuously stream data from your blob containers into Databricks Delta tables, automatically handling formats, schema evolution, and delivery — no pipelines to maintain, no Spark jobs to debug.

Whether you’re ingesting JSON logs, CSV exports, or compressed snapshots, you can go from raw files in storage to query-ready Delta tables in minutes.

Ready to simplify your Azure-to-Databricks data flow? Estuary makes it easy.

Next Steps: Start Streaming from Azure Blob Storage to Databricks

If you're ready to modernize your data ingestion from Azure Blob Storage, Estuary Flow is the fastest way to get there, without Spark jobs, orchestration, or pipeline maintenance.

  • Try Estuary Flow Free
    Set up your first Azure Blob to Databricks pipeline in minutes. Get started — no credit card required.
  • Read the Documentation
    Learn more about connector setup, schema parsing, and file format support. View docs
  • Join Our Slack Community
    Connect with data engineers and get support directly from the Estuary team. Join Slack
  • Talk to an Expert
    Need help with complex formats, private networking, or scale planning? Contact us

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Team Estuary
Team EstuaryEstuary Editorial Team

Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.