
Delta on a Dime: Fast Ingests + Cheap Merges with Estuary and Databricks AUTO CDC
Databricks quietly made one of the most elegant changes recently I’ve seen to data warehousing in a while: the AUTO CDC and AUTO CDC FROM SNAPSHOT APIs.
If you’re pushing change data into Delta tables today, you’ve probably played the “MERGE INTO” game. And you’ve probably lost a few times, either to broken ordering, slow performance, or both. But now? We can ingest changes fast and cheap, and let Databricks clean them up downstream. Estuary Flow fits this model like a glove.
Let me show you.
The Real-Time Data Chain Is Only as Strong as the MERGE
When teams try to stream real-time data into Delta Lake, the same tension shows up every time:
- The ingestion layer wants to be fast, append-only, and cheap.
- The query layer wants everything to be cleanly deduplicated, ordered, and structured.
Historically, MERGE INTO helped square that circle, but at a cost. When data shows up out-of-order (as it always does with CDC), you need to either:
- Write nasty windowing logic to resequence everything, or
- Hope for the best and risk race conditions, missed updates, or bloated costs
This tradeoff made streaming into Delta Lake feel brittle.
AUTO CDC changes the contract
The new AUTO CDC API is a huge step forward. Instead of merging updates manually, Databricks now lets you:
- Append raw CDC records into a Delta table (with a sequencing column)
- Use AUTO CDC ... INTO or create_auto_cdc_flow() to declaratively process them
- Clean up late-arriving or out-of-order events after the fact
You just declare:
pythondlt.create_auto_cdc_flow(
target = "my_table",
source = "my_stream",
keys = ["id"],
sequence_by = col("sequence_num"),
stored_as_scd_type = 2
)
And Databricks will:
- De-dupe based on primary keys + sequencing
- Track version history (with __START_AT and __END_AT) for SCD Type 2
- Handle deletes and truncates
- Output a clean, queryable Delta table
The ingestion side no longer has to care about order, batching, or race conditions. That’s now the warehouse’s job.
Note: The AUTO CDC APIs were previously called APPLY CHANGES, and had the same syntax.
How Estuary Flow fits in
Estuary Flow is a real-time data movement engine. We’re built for exactly the sort of fast, lightweight ingestion AUTO CDC is designed to pair with:
- We capture CDC from databases (Postgres, MySQL, SQL Server, etc)
- Deliver updates as JSON rows into object storage (S3, ADLS, etc) or directly into Delta tables
- Support strict sequencing with resume tokens and logical timestamps
- Schema evolution is handled automatically with Flow’s schemas-as-code design
In the old model, we’d push changes directly into Delta and run MERGE INTO to reconcile them. But now, with AUTO CDC, we can just:
- Append CDC events from Flow into a landing table
- Trigger AUTO CDC jobs inside Databricks to upsert into query-ready tables
This means:
- Cheaper compute: AUTO CDC handles the heavy lifting during scheduled runs
- Simpler ingestion: no fancy reordering logic in Flow configs
- Faster time to insight: changes appear in Delta Lake seconds after commit
SCD 1, SCD 2: take your pick
What I especially like here is how flexible the semantics are. You can choose:
- stored_as_scd_type = 1: overwrite updates, no history
- stored_as_scd_type = 2: preserve row versions via __START_AT, __END_AT
- Or mix and match with track_history_except_column_list to control what changes trigger a new version
This makes it really easy to tailor the integration for:
- BI tools that need fresh, deduplicated views (SCD1)
- Auditing and historical analysis that relies on versioning (SCD2)
With Flow as the input and AUTO CDC as the downstream logic, you basically get versioned, warehouse-native materializations of your source databases.
Bonus: Auto CDC FROM SNAPSHOT is great for batch-y sources too
Not all systems emit change feeds. Sometimes you just get periodic CSV dumps or timestamped table snapshots.
That’s where AUTO CDC FROM SNAPSHOT shines. Estuary can land these snapshots (say, from an Oracle extract or nightly job), and Databricks will compare them, track diffs, and update your target table accordingly.
We just stream snapshots into a Delta table, and Databricks figures out:
pythondlt.create_auto_cdc_from_snapshot_flow(
target="customers_scd2",
source="nightly_snapshot",
keys=["customer_id"],
stored_as_scd_type=2
)
Boom! Historical tracking, without writing diff logic.
From ingestion firehose to clean tables
To wrap it all up:
- Estuary Flow lets you stream delta updates in real time.
- Databricks AUTO CDC lets you merge them in a fast, fault-tolerant, declarative way.
- You can handle out-of-order events, deletes, truncates, and historical replays.
- And you barely need to write code to make it work.
This is what modern data pipelines should feel like: fast upstream, clean downstream, minimal glue code.
If you’re building on Delta and want to stop babysitting your MERGE statements, give this combo a spin.
FAQs
1. What is Databricks AUTO CDC?
2. How does Estuary Flow integrate with AUTO CDC?
3. Why use AUTO CDC instead of MERGE INTO?

About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.
