databricks

4 min read

Last updated: June 17, 2025

Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC

Stream change data to Delta Lake fast with Estuary Flow and clean it up automatically using Databricks AUTO CDC for SCD1 and SCD2 merges.

Dani Pálma Head of Data & Marketing

Share this article

Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC

Databricks quietly made one of the most elegant changes recently I’ve seen to data warehousing in a while: the AUTO CDC and AUTO CDC FROM SNAPSHOT APIs.

If you’re pushing change data into Delta tables today, you’ve probably played the “MERGE INTO” game. And you’ve probably lost a few times, either to broken ordering, slow performance, or both. But now? We can ingest changes fast and cheap, and let Databricks clean them up downstream. Estuary Flow fits this model like a glove.

Let me show you.

The Real-Time Data Chain Is Only as Strong as the MERGE

When teams try to stream real-time data into Delta Lake, the same tension shows up every time:

The ingestion layer wants to be fast, append-only, and cheap.
The query layer wants everything to be cleanly deduplicated, ordered, and structured.

Historically, MERGE INTO helped square that circle, but at a cost. When data shows up out-of-order (as it always does with CDC), you need to either:

Write nasty windowing logic to resequence everything, or
Hope for the best and risk race conditions, missed updates, or bloated costs

This tradeoff made streaming into Delta Lake feel brittle.

AUTO CDC changes the contract

The new AUTO CDC API is a huge step forward. Instead of merging updates manually, Databricks now lets you:

Append raw CDC records into a Delta table (with a sequencing column)
Use AUTO CDC ... INTO or create_auto_cdc_flow() to declaratively process them
Clean up late-arriving or out-of-order events after the fact

You just declare:

pythondlt.create_auto_cdc_flow(
  target = "my_table",
  source = "my_stream",
  keys = ["id"],
  sequence_by = col("sequence_num"),
  stored_as_scd_type = 2
)

And Databricks will:

De-dupe based on primary keys + sequencing
Track version history (with __START_AT and __END_AT) for SCD Type 2
Handle deletes and truncates
Output a clean, queryable Delta table

The ingestion side no longer has to care about order, batching, or race conditions. That’s now the warehouse’s job.

Note: The AUTO CDC APIs were previously called APPLY CHANGES, and had the same syntax.

How Estuary Flow fits in

Estuary Flow is a real-time data movement engine. We’re built for exactly the sort of fast, lightweight ingestion AUTO CDC is designed to pair with:

We capture CDC from databases (Postgres, MySQL, SQL Server, etc)
Deliver updates as JSON rows into object storage (S3, ADLS, etc) or directly into Delta tables
Support strict sequencing with resume tokens and logical timestamps
Schema evolution is handled automatically with Flow’s schemas-as-code design

In the old model, we’d push changes directly into Delta and run MERGE INTO to reconcile them. But now, with AUTO CDC, we can just:

Append CDC events from Flow into a landing table
Trigger AUTO CDC jobs inside Databricks to upsert into query-ready tables

This means:

Cheaper compute: AUTO CDC handles the heavy lifting during scheduled runs
Simpler ingestion: no fancy reordering logic in Flow configs
Faster time to insight: changes appear in Delta Lake seconds after commit

SCD 1, SCD 2: take your pick

What I especially like here is how flexible the semantics are. You can choose:

stored_as_scd_type = 1: overwrite updates, no history
stored_as_scd_type = 2: preserve row versions via __START_AT, __END_AT
Or mix and match with track_history_except_column_list to control what changes trigger a new version

This makes it really easy to tailor the integration for:

BI tools that need fresh, deduplicated views (SCD1)
Auditing and historical analysis that relies on versioning (SCD2)

With Flow as the input and AUTO CDC as the downstream logic, you basically get versioned, warehouse-native materializations of your source databases.

Bonus: Auto CDC FROM SNAPSHOT is great for batch-y sources too

Not all systems emit change feeds. Sometimes you just get periodic CSV dumps or timestamped table snapshots.

That’s where AUTO CDC FROM SNAPSHOT shines. Estuary can land these snapshots (say, from an Oracle extract or nightly job), and Databricks will compare them, track diffs, and update your target table accordingly.

We just stream snapshots into a Delta table, and Databricks figures out:

pythondlt.create_auto_cdc_from_snapshot_flow(
  target="customers_scd2",
  source="nightly_snapshot",
  keys=["customer_id"],
  stored_as_scd_type=2
)

Boom! Historical tracking, without writing diff logic.

From ingestion firehose to clean tables

To wrap it all up:

Estuary Flow lets you stream delta updates in real time.
Databricks AUTO CDC lets you merge them in a fast, fault-tolerant, declarative way.
You can handle out-of-order events, deletes, truncates, and historical replays.
And you barely need to write code to make it work.

This is what modern data pipelines should feel like: fast upstream, clean downstream, minimal glue code.

If you’re building on Delta and want to stop babysitting your MERGE statements, give this combo a spin.

FAQs

1. What is Databricks AUTO CDC?

Databricks AUTO CDC is a declarative API that simplifies change data capture by automatically merging updates into Delta Lake tables without manual SQL logic.

2. How does Estuary Flow integrate with AUTO CDC?

Estuary Flow streams change data into Delta Lake, and AUTO CDC handles the merging, deduplication, and ordering—perfect for real-time data pipelines.

3. Why use AUTO CDC instead of MERGE INTO?

AUTO CDC is faster, handles out-of-order data, and reduces complexity compared to traditional MERGE INTO statements in Databricks.

Share this article

Table of Contents

Start Building For Free

About the author

Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC

Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC

The Real-Time Data Chain Is Only as Strong as the MERGE

AUTO CDC changes the contract

How Estuary Flow fits in

SCD 1, SCD 2: take your pick

Bonus: Auto CDC FROM SNAPSHOT is great for batch-y sources too

From ingestion firehose to clean tables

FAQs

1. What is Databricks AUTO CDC?

2. How does Estuary Flow integrate with AUTO CDC?

3. Why use AUTO CDC instead of MERGE INTO?

Start streaming your data for free

About the author

Related Articles

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard

Why You Should Reconsider Debezium: Challenges and Alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming Pipelines.

Simple to Deploy.

Simply Priced.

Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC

Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC

The Real-Time Data Chain Is Only as Strong as the MERGE

AUTO CDC changes the contract

How Estuary Flow fits in

SCD 1, SCD 2: take your pick

Bonus: Auto CDC FROM SNAPSHOT is great for batch-y sources too

From ingestion firehose to clean tables

FAQs

1. What is Databricks AUTO CDC?

2. How does Estuary Flow integrate with AUTO CDC?

3. Why use AUTO CDC instead of MERGE INTO?

Start streaming your data for free

About the author

Related Articles

How to Connect PostgreSQL to Databricks: A Step-by-Step Guide

How to Stream Kafka Data to Databricks (No Code, Real-Time)

HubSpot to Databricks Integration: 2 Efficient Ways

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard

Why You Should Reconsider Debezium: Challenges and Alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming Pipelines.

Simple to Deploy.

Simply Priced.