databricksreal-time

13 min read

Last updated: July 8, 2025

How to Load Data into Databricks in Real Time with Estuary Flow

Learn how to load data into Databricks SQL Warehouse using Estuary Flow. This step-by-step guide covers real-time CDC, Delta updates, Unity Catalog, cost optimization, and production-ready best practices — no code required.

Team Estuary Estuary Editorial Team

Load Real-Time Data into Databricks using Estuary Flow

Share this article

Loading data into Databricks is no longer just about batch jobs and CSV uploads. In today’s fast-moving landscape, data teams need real-time visibility, scalable architecture, and integrations that just work, without endless pipelines to maintain.

Whether you're migrating from legacy systems or looking to stream operational data into Databricks, the way you move data can make or break downstream analytics and AI initiatives. Traditional ETL tools often fall short on speed and flexibility.

That’s where modern platforms like Estuary Flow come in — offering a no-code, real-time way to integrate databases, SaaS sources, and more directly into Databricks SQL Warehouses with full support for schema evolution, delta updates, and Unity Catalog.

In this guide, you'll learn how to quickly and reliably load data into Databricks using Estuary Flow — while keeping your team agile and your data fresh.

Why Choose Databricks for Real-Time Analytics

Databricks isn’t just another data warehouse — it’s an all-in-one platform built for massive scale, collaborative data work, and serious analytics. It’s where teams go when they need more than just storage — they need performance, governance, and flexibility.

When you load data into Databricks, you unlock:

Databricks SQL Warehouse: Ideal for fast, scalable analytics with the familiarity of SQL.
Delta Lake: Ensures ACID transactions and versioned data, so your pipelines don’t break every time something changes.
Unity Catalog: Gives you centralized control over data governance, lineage, and discovery — critical for regulated industries and growing teams.

But Databricks shines brightest when paired with real-time data. Whether you're building machine learning models, streaming dashboards, or operational analytics, your outcomes are only as good as the data you feed in.

That’s why the ability to stream data into Databricks — not just batch-load it — is a game-changer.

Overview of Estuary Flow

Estuary Flow is a real-time data integration platform designed to make complex pipelines simple. It lets you stream, move, and sync data across systems — from operational databases and SaaS tools to warehouses like Databricks SQL — without writing custom code or managing brittle ETL jobs.

At its core, Flow is built on cloud-native streaming infrastructure and supports Change Data Capture (CDC) out of the box. That means instead of repeatedly extracting entire datasets, it watches for what's changed and streams only those changes downstream — in seconds, not hours.

Key features:

No-code + CLI workflows: Use a visual UI or YAML-based config, depending on your team’s style.
Connectors for 200+ sources and destinations: Including PostgreSQL, MySQL, MongoDB, Kafka, S3, Snowflake, BigQuery, Salesforce, HubSpot, and more.
Delta updates: For destinations like Databricks, you can choose between merge-based updates or efficient append-only streaming (Delta).
Schema evolution support: Flow tracks and applies schema changes intelligently to keep source and destination aligned.
Backfill + real-time in one pipeline: Start with a snapshot of your current data, then stream changes as they happen — no need to manage separate tools or pipelines.

With Estuary Flow, you’re not building a dozen one-off scripts. You’re defining reusable dataflows that adapt to your infrastructure and grow with your data needs.

Step-by-Step Guide: Loading Data into Databricks

Architecture diagram of a PostgreSQL to Databricks pipeline using Estuary

In this walkthrough, we’ll demonstrate how to stream data from a PostgreSQL database into Databricks SQL Warehouse using Estuary Flow. This setup supports both historical backfill and continuous real-time sync, without writing a single line of code.

We’ll use two example tables (users and transactions) from a PostgreSQL source and materialize them as Delta Lake tables in Databricks.

Prerequisites

Before we start, make sure you have the following ready:

In Databricks:

A Databricks account with Unity Catalog enabled.
A SQL Warehouse configured and running.
- You can find its connection details under the Connection Details tab (includes hostname and HTTP path).
A schema within Unity Catalog to store the new tables.
A Personal Access Token (PAT) or Service Principal with access to the warehouse and schema.
- If using a Service Principal, ensure it’s part of the admins group.
- Create the token using the Databricks CLI:

plaintext
databricks token-management create-obo-token <application-id>

In Estuary Flow:

A Flow account (sign up at dashboard.estuary.dev/register)
Your source data prepared (tables in PostgreSQL)

Step 1: Capture Data from PostgreSQL

Log into the Estuary Flow web app.
Create a new Capture, and select the PostgreSQL connector.

Selection of PostgreSQL capture connector options in Estuary

Enter connection details:
- Host, port, database, user, and password
- Choose whether to connect using SSH forwarding or not
Enable Auto-Discovery (optional) to detect all tables or select just the ones you need (users and transactions in this case).
Once connected, Flow will automatically generate collections for each selected table — these are internal representations of your streaming datasets.

At this point, Flow begins a backfill to capture the current state of each table and will continue to watch for real-time changes via PostgreSQL’s replication logs.

Step 2: Set Up Databricks Materialization

From the Flow UI, go to the Collections view and click Materialize on either of the PostgreSQL collections.
Select the Databricks connector from the list.

Create a Databricks materialization connector in Estuary

On the configuration screen, fill in the following fields:

Required fields for Databricks connector setup

Field	Description
`Address`	SQL Warehouse hostname (e.g., `dbc-123456.cloud.databricks.com`)
`HTTP Path`	Path from Databricks SQL warehouse (e.g., `/sql/1.0/warehouses/abcd1234`)
`Catalog Name`	Your Unity Catalog (e.g., `main`)
`Schema Name`	Target schema for the tables (e.g., `default`)
`Auth Type`	Set to `PAT`
`Personal Access Token`	Paste your token here

Add both collections (users and transactions) as bindings in the same materialization. For each:
- Specify the table name you want created in Databricks.
- Optionally, set delta_updates: true for append-only performance.

Example binding configuration in YAML:

plaintextbindings:
  - resource:
      table: users
      schema: default
      delta_updates: false
    source: mynamespace/users
  - resource:
      table: transactions
      schema: default
      delta_updates: true
    source: mynamespace/transactions

Click Next, and Flow will provision the connector and initialize the sync process.

Step 3: Monitor Sync and Verify in Databricks

Flow will begin streaming data into your Databricks warehouse.
You can monitor logs in the UI to confirm successful writes or check for errors.
In your Databricks workspace:
- Navigate to Data > Catalog > Schema
- You’ll see new tables users and transactions
- Run SELECT * FROM default.users to view the latest data

Changes made in your PostgreSQL database — inserts, updates, deletes — will now stream into Databricks in near real-time, depending on your configured sync schedule.

Want to see it in action? In this short tutorial, we walk through how to stream data into Databricks using Estuary Flow.

Next Steps

Now that your tables are live, you can:

Build dashboards on top of them using Databricks SQL
Feed fresh data to ML models or notebooks
Trigger downstream workflows automatically

Best Practices for Production Deployments

Once your data is streaming into Databricks, it’s important to go beyond just “it works.” Real-world pipelines face challenges like cost spikes, schema drift, access issues, and scaling bottlenecks. Here are key best practices to help your team avoid surprises and keep things running smoothly in production.

1. Use Service Principals for Secure, Scalable Authentication

While Personal Access Tokens (PATs) on personal accounts work for quick testing, using PATs associated with Service Principals are the recommended way to authenticate in production:

They allow for better access control and auditing.
You can rotate credentials safely without interrupting pipelines.
Ensure the principal is added to the admins group in Databricks.
Use the Databricks CLI to create the token securely.

2. Schedule Syncs Strategically to Optimize Costs

Flow lets you control how frequently it pushes updates into Databricks. By default, syncs are delayed by 30 minutes, which works well for most analytics use cases.

To save on Databricks compute:

Set Auto-Stop on your SQL Warehouse to the lowest idle timeout.
Adjust the sync frequency in Flow to balance freshness with cost.
Consider syncing during business hours only, if that fits your workload.

3. Use Delta Updates for Append-Only Workloads

If your data source emits only new rows (e.g., event logs, IoT data), enabling delta_updates can reduce load on Databricks and improve ingestion speed.

However:

Avoid delta updates for collections that require deduplication or upserts.
Tables using delta updates won’t be “reduced” — every event is appended as-is.

Use them when:

You have a reliable primary key and no late-arriving data.
You prefer immutability for auditing or replay purposes.

4. Watch Out for Reserved Words in Table and Column Names

Databricks reserves certain SQL keywords (like JOIN, USING, DEFAULT, etc.). If your source data has columns named with these terms:

Flow will automatically quote them to avoid syntax errors.
You must reference them with quotes in your SQL queries.
Tip: Rename fields in your source schema where possible to avoid this.

5. Plan for Schema Evolution

Estuary Flow supports schema changes, but some changes are safer than others:

Safe: Adding new fields, widening field types (e.g., int → number), adding nullable fields.
Needs caution: Removing fields, renaming fields, or changing key fields.
Pro tip: Use Flow’s version control and preview features before applying schema changes live.

6. Monitor Pipeline Health Proactively

Use Flow’s pipeline logs and status dashboard to track activity.
Integrate with monitoring tools via OpenMetrics (Prometheus, Datadog, etc.).
Set alerts for failed materializations, schema conflicts, or authentication issues.

By following these best practices, you’ll have a production-ready pipeline that’s fast, cost-efficient, and resilient to change — exactly what modern data teams need when working with Databricks.

Common Pitfalls & How to Avoid Them

Even with a smooth no-code interface and robust connector, production pipelines can still hit snags. Here are the most common issues teams run into when loading data into Databricks via Estuary Flow — and how to fix them before they become blockers.

1. Incorrect Databricks Credentials or Permissions

Problem:
The connector fails to authenticate or materialize due to missing or invalid credentials.

Fixes:

Double-check your Personal Access Token (PAT) or Service Principal token.
Ensure the user or principal has access to the target catalog, schema, and warehouse.
If using a service principal, it must belong to the admins group in Databricks.
Use the Databricks CLI to validate token creation:

plaintext
databricks token-management create-obo-token <application_id>

2. Forgetting to Enable Auto-Stop on SQL Warehouse

Problem:
The SQL Warehouse runs indefinitely, racking up unnecessary compute costs.

Fixes:

Set the Auto Stop timeout in Databricks to the minimum (e.g., 10 minutes).
Align this with your Flow sync schedule to avoid idle billing.

3. Using Delta Updates When You Shouldn’t

Problem:
Delta updates are enabled on collections that include updates or deletes, resulting in bloated tables or incorrect state.

Fixes:

Only use delta_updates: true if:
- Your data is append-only
- You don’t need upserts or deletes
For transactional data (e.g., orders or inventory), use standard merge updates.

4. Reserved SQL Keywords Breaking Queries

Problem:
Databricks rejects queries due to unescaped field or table names that clash with reserved SQL keywords (e.g., JOIN, LEFT, DEFAULT).

Fixes:

Flow automatically quotes these identifiers when writing to Databricks.
In your Databricks queries, reference them with double quotes:

plaintext
SELECT "default" FROM transactions;

Optionally rename problematic fields at the source or during transformation.

5. Schema Drift Causing Pipeline Failures

Problem:
A field is removed or renamed in the source system, breaking downstream sync.

Fixes:

Use Flow’s schema evolution support with caution.
Avoid destructive changes (e.g., removing fields or changing key types) in production pipelines.
Use preview mode and version control for schema specs before applying changes live.

6. Assuming the First Sync is Instant

Problem:
Expecting near-real-time data immediately after provisioning, but the first sync takes time.

Why:
The initial backfill loads the full history of the source tables. This is expected behavior and may take minutes or hours depending on volume.

Fix:

Monitor progress in the Flow UI
Once the backfill completes, real-time CDC kicks in automatically

By addressing these common issues early, your team will spend less time debugging and more time building with fresh, reliable data in Databricks.

Real-World Use Cases

Loading data into Databricks is more than just a technical exercise — it's a strategic move that unlocks real-time decision-making, AI capabilities, and cross-functional collaboration. Here’s how teams are using Estuary Flow + Databricks across industries:

1. Real-Time Customer Dashboards

Use case:
A SaaS company syncs user activity from PostgreSQL and product analytics from Mixpanel into Databricks.

Why it matters:
With Flow streaming data in real time, their support, marketing, and product teams all use live dashboards — no more waiting for the next ETL run.

2. Continuous ML Feature Engineering

Use case:
A fintech company streams transaction events from Kafka into Databricks to power fraud detection models.

Why it matters:
Flow keeps the training dataset fresh with near-zero lag, and Databricks notebooks use Delta Lake to version and monitor features.

3. Marketing and Campaign Reporting

Use case:
An e-commerce business loads campaign data from Meta Ads, Google Ads, and Shopify orders into Databricks.

Why it matters:
Flow combines disparate data sources in a central warehouse, enabling unified ROI reporting across platforms — all automated and real-time.

4. Supply Chain Visibility

Use case:
A logistics provider syncs inventory updates from an on-premise SQL Server and IoT devices into Databricks for real-time ETA tracking.

Why it matters:
Flow handles both CDC from legacy databases and event streaming from Kafka, allowing for proactive alerts and route optimization.

5. Internal Tools and Operational Reporting

Use case:
A fast-scaling startup uses Flow to stream product and billing data into Databricks and materializes it for Looker dashboards.

Why it matters:
Non-engineering teams get real-time access to metrics without needing help from data engineering, reducing bottlenecks and boosting agility.

From high-frequency finance to fast-paced retail, the combination of Estuary Flow and Databricks makes real-time data not only possible, but practical.

Conclusion: Real-Time Data, Simplified

Loading data into Databricks doesn’t have to mean building fragile pipelines or waiting on slow ETL jobs. With Estuary Flow, you can stream, sync, or migrate data into Databricks SQL Warehouse in real time — all with the flexibility to handle schema changes, optimize costs, and support enterprise-scale governance.

Whether you're working with databases like PostgreSQL and MySQL, streaming platforms like Kafka, or SaaS apps like Salesforce or Shopify, Flow gives you a unified way to move data into Databricks with confidence.

From initial backfill to continuous sync, from development to production, Estuary Flow helps your team stay focused on what matters: making data useful.

Get Started

Sign up for Estuary Flow — it’s free to get started.
Check out the documentation for step-by-step connector guides.
Join the Estuary community Slack for support and inspiration.
Want to see it in action? Book a live demo with our team.

Let your data flow — and make Databricks truly real-time.

Share this article

Table of Contents

Start Building For Free

About the author

Team EstuaryEstuary Editorial Team

Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.

How to Load Data into Databricks in Real Time with Estuary Flow

Why Choose Databricks for Real-Time Analytics

Overview of Estuary Flow

Step-by-Step Guide: Loading Data into Databricks

Prerequisites

In Databricks:

In Estuary Flow:

Step 1: Capture Data from PostgreSQL

Step 2: Set Up Databricks Materialization

Step 3: Monitor Sync and Verify in Databricks

Next Steps

Best Practices for Production Deployments

1. Use Service Principals for Secure, Scalable Authentication

2. Schedule Syncs Strategically to Optimize Costs

3. Use Delta Updates for Append-Only Workloads

4. Watch Out for Reserved Words in Table and Column Names

5. Plan for Schema Evolution

6. Monitor Pipeline Health Proactively

Common Pitfalls & How to Avoid Them

1. Incorrect Databricks Credentials or Permissions

2. Forgetting to Enable Auto-Stop on SQL Warehouse

3. Using Delta Updates When You Shouldn’t

4. Reserved SQL Keywords Breaking Queries

5. Schema Drift Causing Pipeline Failures

6. Assuming the First Sync is Instant

Real-World Use Cases

1. Real-Time Customer Dashboards

2. Continuous ML Feature Engineering

3. Marketing and Campaign Reporting

4. Supply Chain Visibility

5. Internal Tools and Operational Reporting

Conclusion: Real-Time Data, Simplified

Get Started

Start streaming your data for free

About the author

Related Articles

Databricks vs Snowflake: The Ultimate Data Warehouse Showdown for 2025

How to Connect PostgreSQL to Databricks: A Step-by-Step Guide

How to Stream Kafka Data to Databricks (No Code, Real-Time)

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard

Why You Should Reconsider Debezium: Challenges and Alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming Pipelines.

Simple to Deploy.

Simply Priced.