Estuary

Why is Everyone Buying Change Data Capture?

Why are world-class engineering teams buying Change Data Capture instead of building it? Explore the hidden complexity of CDC, the wave of billion-dollar acquisitions, and why even top companies admit it’s easier to buy than build.

Blog post hero image
Share this article

Why are the smartest engineering organizations buying what they could build?

Something strange is happening in the data infrastructure world. Companies that pride themselves on engineering excellence (organizations with thousands of developers who could build anything) are opening their checkbooks instead of their IDEs when it comes to Change Data Capture (CDC).

The numbers tell most of the story:

  • Clickhouse acquired PeerDB (November 2024) 
  • IBM acquired StreamSets for $2.3B (December 2023)
  • Qlik acquired Talend for $2.4B (May 2023) after already buying Attunity for $560M (2019)
  • Databricks acquired Arcion (October 2023)
  • Fivetran acquired HVR (September 2021) after acquiring Teleport Data (June 2021)
  • Google Cloud acquired Alooma (February 2019)
  • Salesforce acquired Griddable (January 2019)
  • And many more…

Since 2015, we've seen over 15 major acquisitions of companies whose core offering included CDC technology. Private equity firms circled like sharks; Thoma Bravo took Talend private for $2.4B, while Permira and CPPIB grabbed Informatica for $5.3B. Even smaller players like FlyData and Stitch couldn't stay independent.

The pattern is right there: Everybody wants Change Data Capture. Nobody wants to build it.

The Deceptive Simplicity of CDC

CDC architecture

source: seattledataguy.com

"It's just tailing a log file, right?"

Every CDC build-vs-buy discussion starts the same way. An engineer looks at the problem and thinks, "How hard could it be? Databases write changes to a log. I'll read that log, parse the events, and stream them out. Give me a sprint, maybe two."

The proof-of-concept is indeed seductively simple. Here's a basic CDC implementation for PostgreSQL in Python:

python
import psycopg2 from psycopg2.extras import LogicalReplicationConnection # Connect to PostgreSQL conn = psycopg2.connect(    "dbname=mydb user=replicator",    connection_factory=LogicalReplicationConnection ) cur = conn.cursor() # Create replication slot cur.create_replication_slot('my_slot', output_plugin='wal2json') # Start streaming changes cur.start_replication(slot_name='my_slot') while True:    msg = cur.read_message()    if msg:        print(f"Change detected: {msg.payload}")        cur.send_feedback(flush_lsn=msg.data_start)

Fifty lines of code. It works! Ship it to production.

Two weeks later, at 3 AM, your pager goes off. The replication slot filled up the disk. Your primary db is down. As you try to fix it, you realize you've just discovered the first of approximately 1,000 edge cases that make CDC one of the hardest problems in data infrastructure.

The Iceberg Beneath The Surface

What that simple demo hides is an ocean of complexity that would make distributed systems researchers weep. Let me show you some of what's actually lurking beneath those innocent-looking log files.

  • Database heterogeneity is usually your first wake-up call. PostgreSQL's logical replication is nothing like MySQL's binlog, which is nothing like MongoDB's oplog, which is nothing like Oracle's LogMiner. Each database has its own format, its own quirks, its own failure modes. That elegant abstraction you designed? It shatters on contact with reality.
  • Performance at scale becomes apparent when you're parsing 100GB/hour of WAL files. Suddenly, that naive parsing loop is consuming more CPU than your actual database. You need zero-copy parsing, efficient memory management, and parallel processing – all while guaranteeing order preservation for transactions.
  • State management reveals itself during your first network partition. Where exactly were you in the replication stream? How do you resume without missing changes or creating duplicates? Your checkpoint logic needs to handle crashes, network failures, and upgrades – often simultaneously.
  • Schema evolution delivers the killing blow. Someone runs ALTER TABLE ADD COLUMN in production. Your CDC pipeline explodes. Messages in flight have the old schema. New messages have the new schema. Your downstream systems are expecting something else entirely. Welcome to the special hell of online schema changes in distributed systems.

A staff SWE at a FAANG company once told me: "We had a team of 12 working on our internal CDC solution. After 3 years and who-knows-how-many millions of dollars, we gave up and bought a commercial solution. It was the best engineering decision we ever made: admitting defeat."

But why? What makes CDC so uniquely difficult that even companies with world-class engineering teams choose to buy instead of build? The answer lies not in any single technical challenge, but in the sheer accumulation of edge cases, the unforgiving nature of data consistency requirements, and the operational complexity of keeping these systems running 24/7.

Let's take a look at the technical challenges that await anyone brave (or foolish) enough to build their own CDC system...

Hidden Complexity

image1.png

1. "Exactly Once"

Let me tell you about the time a friend of mine working at a large fintech data team processed $47 million in duplicate payments because their homegrown CDC system hiccupped during a network partition.

The problem starts like this: you need to guarantee that every change in your source database appears exactly once in your destination. Not zero times (data loss). Not twice (data corruption). Exactly once.

This is like asking for a perpetual motion machine in a distributed system. Here's what actually happens:

python
# The naive approach def process_change(change):    # Network fails here? Change is lost    send_to_destination(change)    # Network fails here? Change is duplicated    acknowledge_to_source(change.lsn)

The "solution" involves distributed transactions, two-phase commits, or idempotency tokens. But wait! Your destination is a data warehouse that doesn't support transactions - oops. Or it's an event stream where consumers might process messages out of order. Or it's a microservice that goes down for deployment right in the middle of your carefully orchestrated dance.

Suddenly, guaranteeing exactly-once delivery becomes a very hard problem to solve. Do you tackle it head-on or push the issue downstream and settle for at-least-once delivery? The most popular open source CDC framework, Debezium, chose the latter, and went with just at-least-once. 

Here’s a deep dive blog on delivery guarantees and how Estuary achieves exactly-once.

2. Initial Snapshot vs Incremental Sync

DBLog architecture

source: Netflix Tech Blog

Here's a fun exercise: try to snapshot a 10TB table while simultaneously capturing all changes to that table, ensuring perfect consistency between the snapshot and the change stream.

The conversation usually goes like this:

Engineer: "We'll take a snapshot, then start CDC from that point."
Reality: "The snapshot takes 6 hours. 50 million changes happened during the snapshot."
Engineer: "Okay, we'll capture changes during the snapshot and replay them."
Reality: "Some of those changes modified rows that were already snapshotted. Others modified rows not yet snapshotted."
Engineer: "We'll track which rows were snapshotted when..."
Reality: "Your tracking table is now larger than the original table."

A data engineer friend of mine working at a major e-commerce company shared their horror story: "Our initial sync never converged. We'd get to like 90%, then the business would run a massive UPDATE statement, and we'd essentially start over. After three weeks, we hired a consultant who told us about LSN-based consistent snapshots. Three. Weeks."

Netflix shook up the CDC scene when they released DBLog back in 2019, which specifies how a real-time CDC system can interleave snapshots and log events so they can be captured at the same time continuously.

3. Database-Specific Dragons

Each database is a unique snowflake of pain. Let me share some war stories:

PostgreSQL

plaintext
-- This innocent looking command just broke your CDC ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';

PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) stores large values out-of-line. Your CDC captures the change, but not the TOAST data. Congratulations, you're now replicating pointers to nowhere.

Then there's replication slot management. Slots prevent WAL cleanup until changes are consumed. Server goes down for maintenance? The slot doesn't advance. Disk fills up. Database: crashed. Your weekend: ruined.

Estuary’s PostgreSQL connector uses merge reductions to carry forward the last known value when a WAL update omits an unchanged TOASTed column, so downstream collections retain complete values across updates (even if the WAL doesn’t include the large field)

MySQL

MySQL's binlog comes in three flavors: STATEMENT, ROW, and MIXED. STATEMENT seems efficient until you realize that UPDATE users SET updated_at = NOW() gives different results on the replica because NOW() is non-deterministic. ROW format is safe but can be massive. MIXED tries to be smart and usually picks wrong.

Don't get me started on GTIDs (Global Transaction IDs). They were supposed to make replication easier. Instead, you get errors like: "The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires."

MongoDB

MongoDB's oplog has a fixed size. Fall behind in processing? Oldest changes get overwritten. Your CDC silently loses data. The "solution" is to size your oplog for your worst-case lag scenario. One team set it to 50GB. They still lost data during a network outage.

Resume tokens expire. Sharded clusters require reading from multiple oplogs simultaneously while maintaining global ordering. Each shard can fail independently. The complexity grows exponentially with cluster size.

4. Operational Nightmares

Running CDC in production is where dreams go to die. Consider these everyday scenarios:

"Why didn't this record replicate?" 

You have a customer complaint. One specific record didn't make it to the destination. Now you need to:

  • Check if the change was captured from the source
  • Verify it wasn't filtered out by business logic
  • Confirm it was sent to the destination
  • Validate that it was successfully processed
  • Trace through potentially millions of log entries

One company built an entire observability platform just for their CDC pipeline. It was more complex than their actual product.

Zero-downtime upgrades

You need to upgrade your CDC system. But you can't stop replication because the business needs real-time data. So you need to:

  1. Deploy the new version alongside the old
  2. Have both versions process in parallel
  3. Ensure no duplicates or gaps
  4. Seamlessly cutover
  5. Pray nothing goes wrong

The classic definition of upgrading a distributed system definitely applies here: "It's like changing the engine of a car while driving at 100mph on the highway."

Why Even Good Engineers Fail

There are 3 main areas of concern when you set out to build your own CDC solution: Time-to-value, Talent and Maintenance.

Time-to-value vs Time-to-failure

The business case always starts the same way: "We need real-time data synchronization. Our competitors have it. How long will it take?"

The engineering estimate follows a predictable pattern:

  • Month 1-2: "We have a working prototype!"
  • Month 3-4: "We're handling the main edge cases."
  • Month 5-6: "Performance optimization is going well."
  • Month 7-9: "We've hit some unexpected complexity..."
  • Month 10-12: "We need to redesign for production requirements."
  • Month 13-18: "It's mostly working, butthe  operational burden is high."
  • Month 19+: "Maybe we should look at vendor solutions."

Meanwhile, your competitor implemented Debezium or bought Estuary, which has been shipping features for a year.

The Talent Problem

Building CDC requires a rare combination of expertise:

  1. Deep database internals knowledge: Understanding WAL formats, transaction isolation levels, MVCC implementations
  2. Distributed systems expertise: Consensus protocols, partition tolerance, eventual consistency
  3. Operational excellence: Monitoring, alerting, capacity planning, incident response
  4. Performance optimization: Zero-copy buffers, memory management, parallel processing

Finding one engineer with all these skills is hard. Building a team of them? Nearly impossible. And if you do find them, is CDC really where you want them spending their time?

Endless Maintenance

The initial build is just the beginning. Every database version upgrade is a potential breaking change:

  • PostgreSQL 14 changed logical replication protocol messages
  • MySQL 8.0 modified binlog event formats
  • MongoDB 5.0 introduced new oplog entry types

Security patches need immediate attention, but might break your carefully tuned parsing logic. Performance mysteriously degrades because a cloud provider changed their network infrastructure. Your on-call rotation becomes a revolving door of burned-out engineers.

And make no mistake, databases evolve fast, and your users get automatically bumped to the latest version by their providers, so you have no choice but to follow along as quickly as possible.

Build vs Buy (vs Acquire)

Build v Buy

When Building Makes Sense

Let's be honest: there are valid reasons to build your own CDC:

1. Single database, controlled environment: If you're only replicating PostgreSQL to PostgreSQL within your own infrastructure, the complexity drops significantly.

2. Extreme performance requirements: Facebook's Wormhole processes trillions of events per day. At that scale, every optimization matters. (They also have 100+ engineers working on it.)

3. Unique filtering needs: If you need to filter out 99% of changes based on complex business logic, a custom solution might be more efficient.

4. Regulatory requirements: Some organizations can't use third-party services for data movement due to compliance requirements.

The True Cost of Building

TCO (Total Cost of Ownership) is a very important metric that should always be taken into consideration. Let's do the math that nobody wants to do:

Initial Development

  • 3-5 senior engineers × 12 months × $300k = $3.6M - $6M
  • Infrastructure costs for testing: $100k
  • Lost opportunity cost: Immeasurable

Ongoing Maintenance

  • 2-3 engineers permanently × $300k = $600k - $900k/year
  • On-call burden: 24/7 coverage across the team
  • Incident costs: Data loss/corruption events - some would say critical issues

Risk Costs

  • Customer churn from data inconsistencies
  • Compliance violations from data loss
  • Reputation damage from outages

Compare that to commercial solutions ranging from $50k-$500k/year, and the build option looks less attractive.

What Acquirers Are Really Buying

When Databricks acquired Arcion or IBM bought StreamSets, they weren't just buying code. They were buying:

  1. Battle-tested reliability: Millions of hours of production usage across diverse environments
  2. Edge case coverage: Every obscure bug fixed represents weeks of engineering time saved
  3. Connector ecosystem: Pre-built integrations with 100+ data sources
  4. Operational playbooks: Runbooks for every failure scenario
  5. Specialized talent: Engineers who live and breathe CDC
  6. Time to market: Immediate capability vs. 18-month build cycle

Most importantly, they're buying the ability to focus their engineers on their core differentiators instead of rebuilding infrastructure.

The Technical Moat

Why CDC is Hard to Commoditize

You might think CDC would be commoditized by now. It's been around for decades. Yet companies still command premium prices and major acquisitions. Why?

1. The long tail of edge cases: Every production deployment discovers new edge cases. A solution handling 99% of scenarios sounds good until you realize that 1% represents millions of critical business records.

2. Performance optimization is non-trivial: The difference between a naive implementation and an optimized one can be 100x. Those optimizations come from years of profiling, tuning, and algorithmic improvements.

3. Operational excellence requires experience: Knowing how to handle every failure mode takes years of accumulated wisdom. You can't Google "PostgreSQL replication slot corrupted during major version upgrade while processing TOAST data" and find a Stack Overflow answer.

The Innovation Layer

Modern CDC has evolved far beyond simple log tailing:

  • Automatic schema inference and evolution: Detecting and adapting to schema changes without manual intervention
  • Declarative transformations: SQL-based transformations within the CDC pipeline
  • Exactly-once guarantees: Sophisticated protocols ensuring consistency
  • Time travel capabilities: Replaying historical changes
  • Cloud-native architectures: Auto-scaling, multi-region support

The innovation hasn't stopped, it actually accelerated. The gap between build-your-own and commercial solutions widens every year.

Beyond Databases: CDC from SaaS APIs (Salesforce, etc.)

image4.png

source: Salesforce

CDC isn’t just about tailing database logs. A huge share of “change capture” today comes from SaaS apps (Salesforce, NetSuite, HubSpot, Shopify, Stripe, Zendesk) where you don’t control the database and must rely on API semantics. These sources expose changes via one (or more) of three patterns: (1) polling with cursors (e.g., updated_atsystemmodstamp, incremental IDs), (2) webhooks/event streams (Salesforce Change Data Capture/Platform Events, Shopify webhooks, Stripe events), and (3) bulk backfills to seed history, then incremental deltas.

Why it’s tricky:

  • Rate limits & back-pressure: APIs throttle; bursts or replays can starve downstreams or create gaps.
  • At-least-once delivery: Webhooks retry → duplicates; polling can miss late-arriving updates if windows are wrong.
  • Deletes & soft-deletes: Not all APIs emit tombstones; you often need extra passes to reconcile hard vs. soft deletes.
  • Partial payloads & schema drift: Events may omit unchanged fields; object schemas evolve without notice.
  • Cross-object ordering: Related objects (e.g., Opportunity → OpportunityLineItem) arrive out of order across endpoints.
  • Replay windows: Some providers (e.g., Salesforce) limit how far back you can replay with tokens/Replay IDs.

Battle-tested patterns:

  • Idempotency everywhere: Deterministic dedupe keys (event ID + version), upserts/merges in sinks.
  • Checkpointing per object & channel: Persist cursors/replay IDs atomically with delivered offsets.
  • Tombstone strategy: Infer deletes via audit logs, delta comparisons, or dedicated “deleted” endpoints.
  • Backfill + live interleave: Seed with bulk exports, then interleave live events while guaranteeing no gaps/dupes.
  • Adaptive pacing: Token-bucket rate control, exponential backoff, and dead-letter queues for poison events.
  • Schema mediation: Late-binding schemas and “merge reductions” to carry forward missing fields across partial updates.

The punchline matches the DB world: each SaaS has unique quirks. Supporting dozens of them multiplies edge cases, SLAs, and operational rules. That’s why teams often buy CDC platforms with mature SaaS connectors rather than building and babysitting a zoo of one-off adapters.

Conclusion: The Pragmatic Choice

Engineering maturity means recognizing when building isn't the best use of our time. CDC joins a select group of technologies – alongside production databases, cloud infrastructure, and ML training platforms – where the complexity-to-value ratio strongly favors buying.

The consolidation will continue. My predictions:

  • By 2027, 2-3 major players will dominate the CDC market
  • Open source solutions will improve, but remain operationally demanding
  • CDC becomes invisible infrastructure – like load balancers or message queues
  • Real innovation moves up the stack to stream processing and real-time analytics

Final Thoughts

"The best code is code you don't have to write. The second best is code you don't have to maintain. CDC is both."

Your engineering time is precious. Your business needs real-time data. Unless CDC is your core differentiator, swallow your pride and buy it. Your future self will thank you when you ship features instead of debugging why replication lag suddenly spikes at 3 AM.

FAQs

    Because CDC is deceptively complex. While a proof-of-concept may look simple, production-ready CDC requires solving challenges like exactly-once delivery, database-specific quirks, schema evolution, scaling to high volumes, and operational reliability. Even the best engineering teams often find the time, talent, and maintenance costs outweigh the benefits of building in-house.
    CDC involves deep database internals, distributed systems design, and operational excellence. Problems like WAL parsing, replication slot management, schema drift, snapshot consistency, and handling failures at scale quickly multiply. Edge cases across databases such as PostgreSQL, MySQL, and MongoDB introduce further complexity, making long-term maintenance a massive burden.
    CDC has become critical infrastructure for real-time data platforms, fueling analytics, AI, and cloud-native applications. Instead of reinventing the wheel, major players like Databricks, IBM, and Google Cloud have acquired CDC vendors to accelerate time-to-market, gain proven technology, inherit connector ecosystems, and secure specialized talent. The consolidation trend reflects CDC’s importance and difficulty to commoditize.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.