
How to Load Data into Databricks in Real Time with Estuary Flow
Learn how to load data into Databricks SQL Warehouse using Estuary Flow. This step-by-step guide covers real-time CDC, Delta updates, Unity Catalog, cost optimization, and production-ready best practices — no code required.

Loading data into Databricks is no longer just about batch jobs and CSV uploads. In today’s fast-moving landscape, data teams need real-time visibility, scalable architecture, and integrations that just work, without endless pipelines to maintain.
Whether you're migrating from legacy systems or looking to stream operational data into Databricks, the way you move data can make or break downstream analytics and AI initiatives. Traditional ETL tools often fall short on speed and flexibility.
That’s where modern platforms like Estuary Flow come in — offering a no-code, real-time way to integrate databases, SaaS sources, and more directly into Databricks SQL Warehouses with full support for schema evolution, delta updates, and Unity Catalog.
In this guide, you'll learn how to quickly and reliably load data into Databricks using Estuary Flow — while keeping your team agile and your data fresh.
Why Choose Databricks for Real-Time Analytics
Databricks isn’t just another data warehouse — it’s an all-in-one platform built for massive scale, collaborative data work, and serious analytics. It’s where teams go when they need more than just storage — they need performance, governance, and flexibility.
When you load data into Databricks, you unlock:
- Databricks SQL Warehouse: Ideal for fast, scalable analytics with the familiarity of SQL.
- Delta Lake: Ensures ACID transactions and versioned data, so your pipelines don’t break every time something changes.
- Unity Catalog: Gives you centralized control over data governance, lineage, and discovery — critical for regulated industries and growing teams.
But Databricks shines brightest when paired with real-time data. Whether you're building machine learning models, streaming dashboards, or operational analytics, your outcomes are only as good as the data you feed in.
That’s why the ability to stream data into Databricks — not just batch-load it — is a game-changer.
Overview of Estuary Flow
Estuary Flow is a real-time data integration platform designed to make complex pipelines simple. It lets you stream, move, and sync data across systems — from operational databases and SaaS tools to warehouses like Databricks SQL — without writing custom code or managing brittle ETL jobs.
At its core, Flow is built on cloud-native streaming infrastructure and supports Change Data Capture (CDC) out of the box. That means instead of repeatedly extracting entire datasets, it watches for what's changed and streams only those changes downstream — in seconds, not hours.
Key features:
- No-code + CLI workflows: Use a visual UI or YAML-based config, depending on your team’s style.
- Connectors for 200+ sources and destinations: Including PostgreSQL, MySQL, MongoDB, Kafka, S3, Snowflake, BigQuery, Salesforce, HubSpot, and more.
- Delta updates: For destinations like Databricks, you can choose between merge-based updates or efficient append-only streaming (Delta).
- Schema evolution support: Flow tracks and applies schema changes intelligently to keep source and destination aligned.
- Backfill + real-time in one pipeline: Start with a snapshot of your current data, then stream changes as they happen — no need to manage separate tools or pipelines.
With Estuary Flow, you’re not building a dozen one-off scripts. You’re defining reusable dataflows that adapt to your infrastructure and grow with your data needs.
Step-by-Step Guide: Loading Data into Databricks
In this walkthrough, we’ll demonstrate how to stream data from a PostgreSQL database into Databricks SQL Warehouse using Estuary Flow. This setup supports both historical backfill and continuous real-time sync, without writing a single line of code.
We’ll use two example tables (users
and transactions
) from a PostgreSQL source and materialize them as Delta Lake tables in Databricks.
Prerequisites
Before we start, make sure you have the following ready:
In Databricks:
- A Databricks account with Unity Catalog enabled.
- A SQL Warehouse configured and running.
- You can find its connection details under the Connection Details tab (includes hostname and HTTP path).
- A schema within Unity Catalog to store the new tables.
- A Personal Access Token (PAT) or Service Principal with access to the warehouse and schema.
- If using a Service Principal, ensure it’s part of the
admins
group. - Create the token using the Databricks CLI:
- If using a Service Principal, ensure it’s part of the
plaintextdatabricks token-management create-obo-token <application-id>
In Estuary Flow:
- A Flow account (sign up at dashboard.estuary.dev/register)
- Your source data prepared (tables in PostgreSQL)
Step 1: Capture Data from PostgreSQL
- Log into the Estuary Flow web app.
- Create a new Capture, and select the PostgreSQL connector.
- Enter connection details:
- Host, port, database, user, and password
- Choose whether to connect using SSH forwarding or not
- Enable Auto-Discovery (optional) to detect all tables or select just the ones you need (
users
andtransactions
in this case). - Once connected, Flow will automatically generate collections for each selected table — these are internal representations of your streaming datasets.
At this point, Flow begins a backfill to capture the current state of each table and will continue to watch for real-time changes via PostgreSQL’s replication logs.
Step 2: Set Up Databricks Materialization
- From the Flow UI, go to the Collections view and click Materialize on either of the PostgreSQL collections.
- Select the Databricks connector from the list.
- On the configuration screen, fill in the following fields:
Field | Description |
Address | SQL Warehouse hostname (e.g., dbc-123456.cloud.databricks.com ) |
HTTP Path | Path from Databricks SQL warehouse (e.g., /sql/1.0/warehouses/abcd1234 ) |
Catalog Name | Your Unity Catalog (e.g., main ) |
Schema Name | Target schema for the tables (e.g., default ) |
Auth Type | Set to PAT |
Personal Access Token | Paste your token here |
- Add both collections (
users
andtransactions
) as bindings in the same materialization. For each:- Specify the table name you want created in Databricks.
- Optionally, set
delta_updates: true
for append-only performance.
Example binding configuration in YAML:
plaintextbindings:
- resource:
table: users
schema: default
delta_updates: false
source: mynamespace/users
- resource:
table: transactions
schema: default
delta_updates: true
source: mynamespace/transactions
- Click Next, and Flow will provision the connector and initialize the sync process.
Step 3: Monitor Sync and Verify in Databricks
- Flow will begin streaming data into your Databricks warehouse.
- You can monitor logs in the UI to confirm successful writes or check for errors.
- In your Databricks workspace:
- Navigate to Data > Catalog > Schema
- You’ll see new tables
users
andtransactions
- Run
SELECT * FROM default.users
to view the latest data
Changes made in your PostgreSQL database — inserts, updates, deletes — will now stream into Databricks in near real-time, depending on your configured sync schedule.
Want to see it in action? In this short tutorial, we walk through how to stream data into Databricks using Estuary Flow.
Next Steps
Now that your tables are live, you can:
- Build dashboards on top of them using Databricks SQL
- Feed fresh data to ML models or notebooks
- Trigger downstream workflows automatically
Best Practices for Production Deployments
Once your data is streaming into Databricks, it’s important to go beyond just “it works.” Real-world pipelines face challenges like cost spikes, schema drift, access issues, and scaling bottlenecks. Here are key best practices to help your team avoid surprises and keep things running smoothly in production.
1. Use Service Principals for Secure, Scalable Authentication
While Personal Access Tokens (PATs) on personal accounts work for quick testing, using PATs associated with Service Principals are the recommended way to authenticate in production:
- They allow for better access control and auditing.
- You can rotate credentials safely without interrupting pipelines.
- Ensure the principal is added to the
admins
group in Databricks. - Use the Databricks CLI to create the token securely.
2. Schedule Syncs Strategically to Optimize Costs
Flow lets you control how frequently it pushes updates into Databricks. By default, syncs are delayed by 30 minutes, which works well for most analytics use cases.
To save on Databricks compute:
- Set Auto-Stop on your SQL Warehouse to the lowest idle timeout.
- Adjust the sync frequency in Flow to balance freshness with cost.
- Consider syncing during business hours only, if that fits your workload.
3. Use Delta Updates for Append-Only Workloads
If your data source emits only new rows (e.g., event logs, IoT data), enabling delta_updates can reduce load on Databricks and improve ingestion speed.
However:
- Avoid delta updates for collections that require deduplication or upserts.
- Tables using delta updates won’t be “reduced” — every event is appended as-is.
Use them when:
- You have a reliable primary key and no late-arriving data.
- You prefer immutability for auditing or replay purposes.
4. Watch Out for Reserved Words in Table and Column Names
Databricks reserves certain SQL keywords (like JOIN
, USING
, DEFAULT
, etc.). If your source data has columns named with these terms:
- Flow will automatically quote them to avoid syntax errors.
- You must reference them with quotes in your SQL queries.
- Tip: Rename fields in your source schema where possible to avoid this.
5. Plan for Schema Evolution
Estuary Flow supports schema changes, but some changes are safer than others:
- Safe: Adding new fields, widening field types (e.g., int → float), adding nullable fields.
- Needs caution: Removing fields, renaming fields, or changing key fields.
- Pro tip: Use Flow’s version control and preview features before applying schema changes live.
6. Monitor Pipeline Health Proactively
- Use Flow’s pipeline logs and status dashboard to track activity.
- Integrate with monitoring tools via OpenMetrics (Prometheus, Datadog, etc.).
- Set alerts for failed materializations, schema conflicts, or authentication issues.
By following these best practices, you’ll have a production-ready pipeline that’s fast, cost-efficient, and resilient to change — exactly what modern data teams need when working with Databricks.
Common Pitfalls & How to Avoid Them
Even with a smooth no-code interface and robust connector, production pipelines can still hit snags. Here are the most common issues teams run into when loading data into Databricks via Estuary Flow — and how to fix them before they become blockers.
1. Incorrect Databricks Credentials or Permissions
Problem:
The connector fails to authenticate or materialize due to missing or invalid credentials.
Fixes:
- Double-check your Personal Access Token (PAT) or Service Principal token.
- Ensure the user or principal has access to the target catalog, schema, and warehouse.
- If using a service principal, it must belong to the
admins
group in Databricks. - Use the Databricks CLI to validate token creation:
plaintextdatabricks token-management create-obo-token <application_id>
2. Forgetting to Enable Auto-Stop on SQL Warehouse
Problem:
The SQL Warehouse runs indefinitely, racking up unnecessary compute costs.
Fixes:
- Set the Auto Stop timeout in Databricks to the minimum (e.g., 10 minutes).
- Align this with your Flow sync schedule to avoid idle billing.
3. Using Delta Updates When You Shouldn’t
Problem:
Delta updates are enabled on collections that include updates or deletes, resulting in bloated tables or incorrect state.
Fixes:
- Only use
delta_updates: true
if:- Your data is append-only
- You don’t need upserts or deletes
- For transactional data (e.g., orders or inventory), use standard merge updates.
4. Reserved SQL Keywords Breaking Queries
Problem:
Databricks rejects queries due to unescaped field or table names that clash with reserved SQL keywords (e.g., JOIN
, LEFT
, DEFAULT
).
Fixes:
- Flow automatically quotes these identifiers when writing to Databricks.
- In your Databricks queries, reference them with double quotes:
plaintextSELECT "default" FROM transactions;
- Optionally rename problematic fields at the source or during transformation.
5. Schema Drift Causing Pipeline Failures
Problem:
A field is removed or renamed in the source system, breaking downstream sync.
Fixes:
- Use Flow’s schema evolution support with caution.
- Avoid destructive changes (e.g., removing fields or changing key types) in production pipelines.
- Use preview mode and version control for schema specs before applying changes live.
6. Assuming the First Sync is Instant
Problem:
Expecting near-real-time data immediately after provisioning, but the first sync takes time.
Why:
The initial backfill loads the full history of the source tables. This is expected behavior and may take minutes or hours depending on volume.
Fix:
- Monitor progress in the Flow UI
- Once the backfill completes, real-time CDC kicks in automatically
By addressing these common issues early, your team will spend less time debugging and more time building with fresh, reliable data in Databricks.
Real-World Use Cases
Loading data into Databricks is more than just a technical exercise — it's a strategic move that unlocks real-time decision-making, AI capabilities, and cross-functional collaboration. Here’s how teams are using Estuary Flow + Databricks across industries:
1. Real-Time Customer Dashboards
Use case:
A SaaS company syncs user activity from PostgreSQL and product analytics from Mixpanel into Databricks.
Why it matters:
With Flow streaming data in real time, their support, marketing, and product teams all use live dashboards — no more waiting for the next ETL run.
2. Continuous ML Feature Engineering
Use case:
A fintech company streams transaction events from Kafka into Databricks to power fraud detection models.
Why it matters:
Flow keeps the training dataset fresh with near-zero lag, and Databricks notebooks use Delta Lake to version and monitor features.
3. Marketing and Campaign Reporting
Use case:
An e-commerce business loads campaign data from Meta Ads, Google Ads, and Shopify orders into Databricks.
Why it matters:
Flow combines disparate data sources in a central warehouse, enabling unified ROI reporting across platforms — all automated and real-time.
4. Supply Chain Visibility
Use case:
A logistics provider syncs inventory updates from an on-premise SQL Server and IoT devices into Databricks for real-time ETA tracking.
Why it matters:
Flow handles both CDC from legacy databases and event streaming from Kafka, allowing for proactive alerts and route optimization.
5. Internal Tools and Operational Reporting
Use case:
A fast-scaling startup uses Flow to stream product and billing data into Databricks and materializes it for Looker dashboards.
Why it matters:
Non-engineering teams get real-time access to metrics without needing help from data engineering, reducing bottlenecks and boosting agility.
From high-frequency finance to fast-paced retail, the combination of Estuary Flow and Databricks makes real-time data not only possible, but practical.
Conclusion: Real-Time Data, Simplified
Loading data into Databricks doesn’t have to mean building fragile pipelines or waiting on slow ETL jobs. With Estuary Flow, you can stream, sync, or migrate data into Databricks SQL Warehouse in real time — all with the flexibility to handle schema changes, optimize costs, and support enterprise-scale governance.
Whether you're working with databases like PostgreSQL and MySQL, streaming platforms like Kafka, or SaaS apps like Salesforce or Shopify, Flow gives you a unified way to move data into Databricks with confidence.
From initial backfill to continuous sync, from development to production, Estuary Flow helps your team stay focused on what matters: making data useful.
Get Started
- Sign up for Estuary Flow — it’s free to get started.
- Check out the documentation for step-by-step connector guides.
- Join the Estuary community Slack for support and inspiration.
- Want to see it in action? Book a live demo with our team.
Let your data flow — and make Databricks truly real-time.

About the author
Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.
