Estuary

How to Load Data into Databricks in Real Time with Estuary Flow

Learn how to load data into Databricks SQL Warehouse using Estuary Flow. This step-by-step guide covers real-time CDC, Delta updates, Unity Catalog, cost optimization, and production-ready best practices — no code required.

Blog post hero image
Share this article

Loading data into Databricks is no longer just about batch jobs and CSV uploads. In today’s fast-moving landscape, data teams need real-time visibility, scalable architecture, and integrations that just work, without endless pipelines to maintain.

Whether you're migrating from legacy systems or looking to stream operational data into Databricks, the way you move data can make or break downstream analytics and AI initiatives. Traditional ETL tools often fall short on speed and flexibility.

That’s where modern platforms like Estuary Flow come in — offering a no-code, real-time way to integrate databases, SaaS sources, and more directly into Databricks SQL Warehouses with full support for schema evolution, delta updates, and Unity Catalog.

In this guide, you'll learn how to quickly and reliably load data into Databricks using Estuary Flow — while keeping your team agile and your data fresh.

Why Choose Databricks for Real-Time Analytics

Databricks isn’t just another data warehouse — it’s an all-in-one platform built for massive scale, collaborative data work, and serious analytics. It’s where teams go when they need more than just storage — they need performance, governance, and flexibility.

When you load data into Databricks, you unlock:

  • Databricks SQL Warehouse: Ideal for fast, scalable analytics with the familiarity of SQL.
  • Delta Lake: Ensures ACID transactions and versioned data, so your pipelines don’t break every time something changes.
  • Unity Catalog: Gives you centralized control over data governance, lineage, and discovery — critical for regulated industries and growing teams.

But Databricks shines brightest when paired with real-time data. Whether you're building machine learning models, streaming dashboards, or operational analytics, your outcomes are only as good as the data you feed in.

That’s why the ability to stream data into Databricks — not just batch-load it — is a game-changer.

Overview of Estuary Flow

Estuary Flow is a real-time data integration platform designed to make complex pipelines simple. It lets you stream, move, and sync data across systems — from operational databases and SaaS tools to warehouses like Databricks SQL — without writing custom code or managing brittle ETL jobs.

At its core, Flow is built on cloud-native streaming infrastructure and supports Change Data Capture (CDC) out of the box. That means instead of repeatedly extracting entire datasets, it watches for what's changed and streams only those changes downstream — in seconds, not hours.

Key features:

  • No-code + CLI workflows: Use a visual UI or YAML-based config, depending on your team’s style.
  • Connectors for 200+ sources and destinations: Including PostgreSQL, MySQL, MongoDB, Kafka, S3, Snowflake, BigQuery, Salesforce, HubSpot, and more.
  • Delta updates: For destinations like Databricks, you can choose between merge-based updates or efficient append-only streaming (Delta).
  • Schema evolution support: Flow tracks and applies schema changes intelligently to keep source and destination aligned.
  • Backfill + real-time in one pipeline: Start with a snapshot of your current data, then stream changes as they happen — no need to manage separate tools or pipelines.

With Estuary Flow, you’re not building a dozen one-off scripts. You’re defining reusable dataflows that adapt to your infrastructure and grow with your data needs.

Step-by-Step Guide: Loading Data into Databricks

Architecture diagram of a PostgreSQL to Databricks pipeline using Estuary

In this walkthrough, we’ll demonstrate how to stream data from a PostgreSQL database into Databricks SQL Warehouse using Estuary Flow. This setup supports both historical backfill and continuous real-time sync, without writing a single line of code.

We’ll use two example tables (users and transactions) from a PostgreSQL source and materialize them as Delta Lake tables in Databricks.

Prerequisites

Before we start, make sure you have the following ready:

In Databricks:

  • Databricks account with Unity Catalog enabled.
  • SQL Warehouse configured and running.
    • You can find its connection details under the Connection Details tab (includes hostname and HTTP path).
  • schema within Unity Catalog to store the new tables.
  • Personal Access Token (PAT) or Service Principal with access to the warehouse and schema.
    • If using a Service Principal, ensure it’s part of the admins group.
    • Create the token using the Databricks CLI:
plaintext
databricks token-management create-obo-token <application-id>

In Estuary Flow:

Step 1: Capture Data from PostgreSQL

  1. Log into the Estuary Flow web app.
  2. Create a new Capture, and select the PostgreSQL connector.
Selection of PostgreSQL capture connector options in Estuary
  1. Enter connection details:
    • Host, port, database, user, and password
    • Choose whether to connect using SSH forwarding or not
  2. Enable Auto-Discovery (optional) to detect all tables or select just the ones you need (users and transactions in this case).
  3. Once connected, Flow will automatically generate collections for each selected table — these are internal representations of your streaming datasets.

At this point, Flow begins a backfill to capture the current state of each table and will continue to watch for real-time changes via PostgreSQL’s replication logs.

Step 2: Set Up Databricks Materialization

  1. From the Flow UI, go to the Collections view and click Materialize on either of the PostgreSQL collections.
  2. Select the Databricks connector from the list.
Create a Databricks materialization connector in Estuary
  1. On the configuration screen, fill in the following fields:
Required fields for Databricks connector setup

Field

Description

AddressSQL Warehouse hostname (e.g., dbc-123456.cloud.databricks.com)
HTTP PathPath from Databricks SQL warehouse (e.g., /sql/1.0/warehouses/abcd1234)
Catalog NameYour Unity Catalog (e.g., main)
Schema NameTarget schema for the tables (e.g., default)
Auth TypeSet to PAT
Personal Access TokenPaste your token here
  1. Add both collections (users and transactions) as bindings in the same materialization. For each:
    • Specify the table name you want created in Databricks.
    • Optionally, set delta_updates: true for append-only performance.

Example binding configuration in YAML:

plaintext
bindings: - resource:      table: users      schema: default      delta_updates: false    source: mynamespace/users - resource:      table: transactions      schema: default      delta_updates: true    source: mynamespace/transactions
  1. Click Next, and Flow will provision the connector and initialize the sync process.
Migrate Data to Databricks

Step 3: Monitor Sync and Verify in Databricks

  • Flow will begin streaming data into your Databricks warehouse.
  • You can monitor logs in the UI to confirm successful writes or check for errors.
  • In your Databricks workspace:
    • Navigate to Data > Catalog > Schema
    • You’ll see new tables users and transactions
    • Run SELECT * FROM default.users to view the latest data

Changes made in your PostgreSQL database — inserts, updates, deletes — will now stream into Databricks in near real-time, depending on your configured sync schedule.

Want to see it in action? In this short tutorial, we walk through how to stream data into Databricks using Estuary Flow. 

Next Steps

Now that your tables are live, you can:

  • Build dashboards on top of them using Databricks SQL
  • Feed fresh data to ML models or notebooks
  • Trigger downstream workflows automatically

Best Practices for Production Deployments

Once your data is streaming into Databricks, it’s important to go beyond just “it works.” Real-world pipelines face challenges like cost spikes, schema drift, access issues, and scaling bottlenecks. Here are key best practices to help your team avoid surprises and keep things running smoothly in production.

1. Use Service Principals for Secure, Scalable Authentication

While Personal Access Tokens (PATs) on personal accounts work for quick testing, using PATs associated with Service Principals are the recommended way to authenticate in production:

  • They allow for better access control and auditing.
  • You can rotate credentials safely without interrupting pipelines.
  • Ensure the principal is added to the admins group in Databricks.
  • Use the Databricks CLI to create the token securely.

2. Schedule Syncs Strategically to Optimize Costs

Flow lets you control how frequently it pushes updates into Databricks. By default, syncs are delayed by 30 minutes, which works well for most analytics use cases.

To save on Databricks compute:

  • Set Auto-Stop on your SQL Warehouse to the lowest idle timeout.
  • Adjust the sync frequency in Flow to balance freshness with cost.
  • Consider syncing during business hours only, if that fits your workload.

3. Use Delta Updates for Append-Only Workloads

If your data source emits only new rows (e.g., event logs, IoT data), enabling delta_updates can reduce load on Databricks and improve ingestion speed.

However:

  • Avoid delta updates for collections that require deduplication or upserts.
  • Tables using delta updates won’t be “reduced” — every event is appended as-is.

Use them when:

  • You have a reliable primary key and no late-arriving data.
  • You prefer immutability for auditing or replay purposes.

4. Watch Out for Reserved Words in Table and Column Names

Databricks reserves certain SQL keywords (like JOINUSINGDEFAULT, etc.). If your source data has columns named with these terms:

  • Flow will automatically quote them to avoid syntax errors.
  • You must reference them with quotes in your SQL queries.
  • Tip: Rename fields in your source schema where possible to avoid this.

5. Plan for Schema Evolution

Estuary Flow supports schema changes, but some changes are safer than others:

  • Safe: Adding new fields, widening field types (e.g., int → float), adding nullable fields.
  • Needs caution: Removing fields, renaming fields, or changing key fields.
  • Pro tip: Use Flow’s version control and preview features before applying schema changes live.

6. Monitor Pipeline Health Proactively

  • Use Flow’s pipeline logs and status dashboard to track activity.
  • Integrate with monitoring tools via OpenMetrics (Prometheus, Datadog, etc.).
  • Set alerts for failed materializations, schema conflicts, or authentication issues.

By following these best practices, you’ll have a production-ready pipeline that’s fast, cost-efficient, and resilient to change — exactly what modern data teams need when working with Databricks.

Common Pitfalls & How to Avoid Them

Even with a smooth no-code interface and robust connector, production pipelines can still hit snags. Here are the most common issues teams run into when loading data into Databricks via Estuary Flow — and how to fix them before they become blockers.

1. Incorrect Databricks Credentials or Permissions

Problem:
The connector fails to authenticate or materialize due to missing or invalid credentials.

Fixes:

  • Double-check your Personal Access Token (PAT) or Service Principal token.
  • Ensure the user or principal has access to the target catalog, schema, and warehouse.
  • If using a service principal, it must belong to the admins group in Databricks.
  • Use the Databricks CLI to validate token creation:
plaintext
databricks token-management create-obo-token <application_id>

2. Forgetting to Enable Auto-Stop on SQL Warehouse

Problem:
The SQL Warehouse runs indefinitely, racking up unnecessary compute costs.

Fixes:

  • Set the Auto Stop timeout in Databricks to the minimum (e.g., 10 minutes).
  • Align this with your Flow sync schedule to avoid idle billing.

3. Using Delta Updates When You Shouldn’t

Problem:
Delta updates are enabled on collections that include updates or deletes, resulting in bloated tables or incorrect state.

Fixes:

  • Only use delta_updates: true if:
    • Your data is append-only
    • You don’t need upserts or deletes
  • For transactional data (e.g., orders or inventory), use standard merge updates.

4. Reserved SQL Keywords Breaking Queries

Problem:
Databricks rejects queries due to unescaped field or table names that clash with reserved SQL keywords (e.g., JOINLEFTDEFAULT).

Fixes:

  • Flow automatically quotes these identifiers when writing to Databricks.
  • In your Databricks queries, reference them with double quotes:
plaintext
SELECT "default" FROM transactions;
  • Optionally rename problematic fields at the source or during transformation.

5. Schema Drift Causing Pipeline Failures

Problem:
A field is removed or renamed in the source system, breaking downstream sync.

Fixes:

  • Use Flow’s schema evolution support with caution.
  • Avoid destructive changes (e.g., removing fields or changing key types) in production pipelines.
  • Use preview mode and version control for schema specs before applying changes live.

6. Assuming the First Sync is Instant

Problem:
Expecting near-real-time data immediately after provisioning, but the first sync takes time.

Why:
The initial backfill loads the full history of the source tables. This is expected behavior and may take minutes or hours depending on volume.

Fix:

  • Monitor progress in the Flow UI
  • Once the backfill completes, real-time CDC kicks in automatically

By addressing these common issues early, your team will spend less time debugging and more time building with fresh, reliable data in Databricks.

Real-World Use Cases

Loading data into Databricks is more than just a technical exercise — it's a strategic move that unlocks real-time decision-making, AI capabilities, and cross-functional collaboration. Here’s how teams are using Estuary Flow + Databricks across industries:

1. Real-Time Customer Dashboards

Use case:
A SaaS company syncs user activity from PostgreSQL and product analytics from Mixpanel into Databricks.

Why it matters:
With Flow streaming data in real time, their support, marketing, and product teams all use live dashboards — no more waiting for the next ETL run.

2. Continuous ML Feature Engineering

Use case:
A fintech company streams transaction events from Kafka into Databricks to power fraud detection models.

Why it matters:
Flow keeps the training dataset fresh with near-zero lag, and Databricks notebooks use Delta Lake to version and monitor features.

3. Marketing and Campaign Reporting

Use case:
An e-commerce business loads campaign data from Meta Ads, Google Ads, and Shopify orders into Databricks.

Why it matters:
Flow combines disparate data sources in a central warehouse, enabling unified ROI reporting across platforms — all automated and real-time.

4. Supply Chain Visibility

Use case:
A logistics provider syncs inventory updates from an on-premise SQL Server and IoT devices into Databricks for real-time ETA tracking.

Why it matters:
Flow handles both CDC from legacy databases and event streaming from Kafka, allowing for proactive alerts and route optimization.

5. Internal Tools and Operational Reporting

Use case:
A fast-scaling startup uses Flow to stream product and billing data into Databricks and materializes it for Looker dashboards.

Why it matters:
Non-engineering teams get real-time access to metrics without needing help from data engineering, reducing bottlenecks and boosting agility.

From high-frequency finance to fast-paced retail, the combination of Estuary Flow and Databricks makes real-time data not only possible, but practical.

Conclusion: Real-Time Data, Simplified

Loading data into Databricks doesn’t have to mean building fragile pipelines or waiting on slow ETL jobs. With Estuary Flow, you can stream, sync, or migrate data into Databricks SQL Warehouse in real time — all with the flexibility to handle schema changes, optimize costs, and support enterprise-scale governance.

Whether you're working with databases like PostgreSQL and MySQL, streaming platforms like Kafka, or SaaS apps like Salesforce or Shopify, Flow gives you a unified way to move data into Databricks with confidence.

From initial backfill to continuous sync, from development to production, Estuary Flow helps your team stay focused on what matters: making data useful.

Get Started

Let your data flow — and make Databricks truly real-time.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Team Estuary
Team EstuaryEstuary Editorial Team

Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.