cdcchange data capture

18 min read

Last updated: May 6, 2025

Change Data Capture (CDC): The Complete Guide

Understand what Change Data Capture (CDC) is, how it works, and when to use it. Compare top CDC tools like Estuary, Debezium, Fivetran & more.

Jeffrey Richman

Share this article

What is Change Data Capture?

Change Data Capture (CDC) is a real-time data integration technique that captures and delivers changes, such as inserts, updates, and deletes, from a source database to downstream systems. Unlike traditional batch ETL, CDC ensures data is continuously synchronized with minimal latency, making it essential for real-time analytics, operational reporting, and system replication.

Unlike traditional batch-based ETL or ELT, CDC streams change events continuously, reducing latency and minimizing load on the source system. It’s become an essential approach in modern data engineering for use cases like real-time analytics, database replication, and event-driven applications.

In this guide, we’ll explain how CDC works, compare its core methods (query-based, trigger-based, and log-based), and explore when and when not to use it. You’ll also find tool comparisons, use cases, and step-by-step resources for implementing CDC for databases like MySQL, PostgreSQL, SQL Server, and more.

Benefits of Change Data Capture

Change Data Capture (CDC) has become the preferred approach for modern data integration, especially when compared to legacy batch-based ETL and ELT pipelines. While alternative extraction methods still exist, few can match the efficiency, scalability, and precision that CDC offers.

Near Real-Time Visibility

CDC enables low-latency updates by capturing and delivering changes as they happen. This real-time synchronization is critical for use cases like fraud detection, inventory updates, and real-time analytics pipelines, where stale data can lead to poor decisions.

Minimal Load on Source Systems

Traditional batch extracts can place heavy strain on production databases. CDC, especially log-based CDC, typically adds only 1–3% additional load because it reads from transaction logs instead of querying the data directly. For a deep dive into low-impact CDC, see our PostgreSQL CDC guide.

High Scalability

As data volumes grow, batch windows get longer and harder to manage. In contrast, CDC streams changes continuously, handling high-throughput environments efficiently. Platforms like Estuary Flow are designed to scale with exactly-once guarantees and many-to-many syncs.

Full Change History (Inserts, Updates, Deletes)

Unlike batch jobs that only capture the current state, CDC captures the entire lifecycle of a record, including deletes, making it ideal for data replication and audit logging.

Simplified Backfill & Recovery

Originally designed for database recovery, CDC makes it easier to backfill historical data or recover from data loss without re-extracting everything. Tools like Debezium and Estuary Flow allow backfill-on-demand without disrupting real-time syncs.

How Does Change Data Capture Work?

Change Data Capture (CDC) works by continuously monitoring a source system for change events, such as row inserts, updates, or deletes, and transmitting those changes to a downstream system like a data warehouse, data lake, or real-time application.

Whether you're working with PostgreSQL, MySQL, or MongoDB, the CDC process typically includes three stages:

Detect and Capture Change Events: CDC identifies data changes in a source system, such as a database (e.g., PostgreSQL, MySQL, or MongoDB) or an application (e.g., Salesforce). Changes include new inserts, updates to existing data, and deletions.
Stream or Transform Data: The captured changes are streamed directly to the target (replication) or processed through transformations (streaming ETL). For example, transformations may involve filtering specific fields, standardizing data formats, or aggregating data.
Deliver Changes to Target Systems: Finally, changes are delivered to your destination—whether it’s a warehouse like Snowflake, a real-time engine like Kafka, or operational systems like Salesforce or Firestore. This ensures your targets are always in sync with the latest state of the source.

For a detailed look at how CDC behaves in end-to-end pipelines—including replication and transformations—see our CDC with Kafka and CDC with Salesforce guides.

Practical Use Case: Synchronizing Databases with Change Data Capture

Consider a company that manages customer records in PostgreSQL and uses Snowflake for data analytics. When a customer updates their address in PostgreSQL, the CDC mechanism detects the change event and captures it. This update is then processed (if needed) and sent to Snowflake. As a result, Snowflake reflects the latest customer record without requiring a full batch extract.

This is a simplified but common example of using CDC to synchronize an operational database with a data warehouse in near real-time.

Understanding CDC Latency

While CDC is often described as a real-time solution, actual latency—the delay between a change in the source system and its reflection in the target—varies depending on the CDC method and infrastructure.

Low latency means minimizing this delay to seconds or milliseconds.

For example:

A query-based CDC job running every 10 minutes will introduce up to 10 minutes of delay.
A log-based CDC pipeline, like those built with Debezium or Estuary Flow, often applies changes within seconds, sometimes sub-second.

CDC latency is influenced by:

The capture method (log-based being fastest)
Transformation complexity
Load on the target system
Network throughput and connector performance

If your business requires real-time analytics, alerting, fraud detection, or dynamic dashboards, reducing CDC latency is essential. For a deeper dive into the infrastructure behind ultra-low-latency data movement, check out our CDC reference architecture.

An Analog Example of Change Data Capture: 50 First Dates

The 2004 movie 50 First Dates, starring Adam Sandler and Drew Barrymore, offers a surprisingly relatable analogy for understanding Change Data Capture (CDC).

In the film, Drew’s character suffers from severe amnesia and cannot retain new memories beyond a single day. Determined to help her, Adam’s character creates a videotape summarizing her life, including daily news and milestones. Each day, he adds updates to the end of the tape so she always wakes up with an accurate and “stateful” understanding of her life.

This process mirrors the principles of CDC:

The videotape acts as a continuous log or database, while the daily updates represent append-only change events.
Instead of re-creating the tape from scratch each day (similar to a batch extract), Adam appends only the new changes.
The updates are recorded in exact order, ensuring historical accuracy and enabling backfill for Drew’s memory.

The videotape remains intact, allowing it to be shared or accessed by others if needed—much like a scalable, append-only log in CDC systems.

While this analogy isn’t perfect, it highlights the fundamental efficiency of CDC compared to traditional methods. Rather than repeatedly extracting the full dataset, the CDC focuses on capturing and propagating incremental changes with precision.

For enterprises with growing data volumes or time-sensitive applications, this efficient approach is transformative. Let’s explore what implementing CDC correctly can mean for your organization.

Methods of Change Data Capture

There are three primary ways to implement Change Data Capture (CDC), each with trade-offs in performance, accuracy, and complexity:

Query-Based CDC
Trigger-Based CDC
Log-Based CDC (the gold standard for scalability and latency)

Your choice depends on your system’s scale, latency needs, and operational constraints. Let’s break each one down.

Query-Based Change Data Capture

This method involves running SQL queries on the source database to identify changes. Often referred to as "polling," it works by scheduling recurring queries that check for data changes since the last run.

To implement this approach, you’ll need an additional column in the source tables to track when each record was last modified (e.g., a time_updated or versioning column). A CRON job or a similar scheduler is then configured to run the query at regular intervals.

Example Table Schema:

id	firstname	lastname	address	time_updated
0001	Joe	Shmoe	123 Main St	2023-02-03 15:32:11

To find changes to the table, you’d run a query against it to select records where the timestamp value is greater than the time at which you last ran the query. For example, if you last queried the customers table at 7 AM on February 1:

SELECT * FROM customers WHERE time_updated > ‘2023-02-01 07:00:00’;

Typically, you’d configure a recurring query at a standard time interval. The selected records would be used to update the target at that same cadence.

Advantages of Query-Based CDC:

Easy to implement with basic SQL queries.
Requires only read permissions on the source database.
Suitable for smaller, slower OLTP databases where a full audit trail is unnecessary.

Disadvantages of Query-Based CDC:

Requires schema modification: Adding a new column for timestamps may not be feasible in all cases.
Cannot detect deletions: A hard delete in the source system is missed unless soft deletes are used, which can bloat the database.
Higher latency: Changes are captured at intervals, not in real-time.
Performance issues: Polling large tables frequently can overload the source system, especially as data grows.

Want a walkthrough? See our SQL CDC guide for more examples.

Trigger-Based Change Data Capture

This method uses database triggers to record changes in an audit or shadow table. Triggers are predefined functions in databases like PostgreSQL, MySQL, and SQL Server that execute whenever a specific event (e.g., an INSERT, UPDATE, or DELETE) occurs.

Triggers are commonly used for tasks such as:

Complex data validation (For example, when a new row is created in table customers, a new row should also exist in table orders).
Keeping an audit log of changes.
Handling database errors.

However, trigger-based CDC has significant limitations for high-velocity environments. The additional overhead created by triggers can impact the source database's performance, making it unsuitable for large-scale implementations.

Advantages of Trigger-Based CDC:

Real-time: Triggers capture changes immediately.
Complete data capture, Including delete events.
Supports metadata: For example, you can track which statement caused the change event.

Disadvantages of Trigger-Based CDC:

Performance impact: Triggers increase write operations on the database, which can slow down high-frequency transactions.
Complexity in data extraction: Moving data from the shadow table to a target system adds latency and complexity.
Scalability issues: Managing multiple triggers across several tables can become unmanageable.

Note: While triggers were once popular for CDC, they are rarely used at scale today due to performance concerns.

Log-Based Change Data Capture

Log-based CDC reads changes directly from the write-ahead log (WAL) or equivalent (e.g., binlog in MySQL). These logs are maintained by the database to ensure transactional integrity and serve as a reliable source for capturing data changes.

This method is widely regarded as the most efficient and scalable CDC approach because it:

Captures every change (inserts, updates, deletes) in the exact order they occur.
Minimizes latency, often processing changes in milliseconds.
Places minimal load on the source database by reading logs rather than executing queries or triggers.

Because of these three benefits, WAL-based CDC has become the most popular approach for change data capture.

Log-based CDC requires an additional component to process and capture change events from the WAL (unlike queries and triggers, which leverage native capabilities of the database). All three methods will also require some way to deliver events. Usually this is in the form of a message broker or other form of streaming with source and target connectors.

Implementation and configuration vary a bit as each database type has its own version of the write-ahead log. The Postgres log-based CDC or MySQL log-based CDC roughly looks like this:

Log-based CDC is the gold standard of change data capture implementations because it captures all events, in exact order, in real-time, all without adding any load to the source database. That's why most enterprise-scale solutions offering managed CDC include a connector that reads directly from the write-ahead log.

Building an end-to-end change data capture pipeline is complicated, especially at any reasonable scale and low latency. You are better off leveraging existing proprietary offerings, or extending open source than trying to build your own.

Advantages of Log-Based CDC:

Accurate and complete data: Captures every event, including deletions, with guaranteed ordering.
Low latency: Real-time data updates with near-zero delays.
Minimal impact on the source system: Only reads from the log files without affecting database performance.

Disadvantages of Log-Based CDC:

Complex implementation: Requires specialized connectors and expertise to set up.
Permissions: Accessing the WAL or equivalent may require elevated permissions, which can complicate implementation.

For enterprise applications, leveraging tools like Estuary Flow, Debezium, or other managed log-based CDC platforms can simplify implementation while maximizing efficiency.

Want to learn more? Gunnar Morling (lead on the Debezium project, and who has likely written more about CDC than any human alive) summed the differences between query-based and log-based methods here.

Batch vs Real-Time CDC: Latency Matters

Data changes can be captured in two ways:

Batch Process:
1. Changes are collected and processed in bulk at scheduled intervals (e.g., once daily or hourly).
2. Latency depends on the frequency of batch runs.
3. Suitable for use cases where real-time updates aren’t critical and cost savings are prioritized.
Real-Time (Streaming) Process:
1. Changes are detected and streamed as soon as they occur, ensuring minimal latency.
2. Ideal for scenarios requiring low-latency updates, such as real-time analytics or operational reporting.

Key Difference: Real-time CDC processes changes instantaneously, while batch processes introduce delays, making them less suitable for time-sensitive applications.

When Should You Use CDC?

Modern Change Data Capture implementations increasingly rely on real-time pipelines. With technologies like PostgreSQL WAL, MySQL binlog, and tools such as Snowflake, BigQuery, and Databricks, real-time CDC ensures instant reaction to change events. This reduces latency and improves the efficiency of downstream systems.

Here’s why real-time CDC is indispensable for modern operations:

E-commerce: A customer places an order for an item that appears in stock but is actually sold out. The website, out of sync with the database, displays outdated inventory data, leading to a poor customer experience.
Financial services: A user makes a decision based on outdated information about their bank balance, potentially leading to financial mistakes.
Fraud detection: Real-time CDC can identify suspicious transactions and flag them instantly. Without it, fraudulent activity may go unnoticed until it’s too late.
Logistics: Delays in shipment updates prevent companies from optimizing delivery routes or making informed decisions about new orders, leading to customer dissatisfaction and inefficiencies.

In these scenarios, batch CDC introduces latency that could disrupt operations, increase costs, and degrade user experiences. Real-time CDC, by contrast, ensures data accuracy and timeliness, making it the preferred choice for fast-paced, data-driven businesses.

When Might CDC Be Overkill?

While log-based Change Data Capture (CDC) is often the preferred method for extracting and replicating data in real-time, there are scenarios where implementing CDC might not be the best choice. Consider avoiding CDC in the following situations:

Low Priority for Real-Time Data Feeds: If your organization doesn’t require low-latency data updates and can operate effectively with periodic data refreshes, the complexity and cost of CDC may not justify its implementation.
Small Data Volumes with Minimal Workload Stress: For small tables or databases with minimal workloads, batch processes may suffice. In such cases, the overhead of setting up CDC may not provide enough value to outweigh the effort.
Restricted Data Access: If policies or regulations (e.g., HIPAA, PII compliance) restrict direct access to source data, implementing CDC may become challenging. While obfuscation or data masking rules can be applied during transformations, these must comply with your organization's data governance policies.

Common Reasons for Choosing Not to Use CDC

Historically, teams have opted against implementing CDC for the following reasons:

Reliance on Legacy Batch Processes: Many organizations still depend on legacy systems that support batch data extraction and don't require real-time updates. Transitioning to CDC may seem unnecessary if existing processes meet business needs.
Complexity of Implementation: Configuring CDC pipelines can require significant effort, especially for teams unfamiliar with database logs or streaming platforms. For smaller use cases, batch methods often prove simpler and "good enough."
Limited Access to Source Logs: Some teams lack access to the write-ahead log (WAL) or equivalent log files in their databases, relying only on specific views or tables. This limitation makes it difficult to implement log-based CDC.

A Reddit thread in May of 2023 captures the above well. You can read the full discussion here.

As a final note here, if you're building a new application, it might make more sense to make it stream-native directly from production to the new application. This can be done by, say writing to Kafka, Kinesis, or Gazette.

Change Data Capture Alternatives

With dozens of CDC tools on the market, it’s easy to get lost in feature checklists. But here’s what really matters when selecting a CDC platform:

Low latency with real-time updates
Exactly-once delivery to avoid duplication or data loss
Support for complex transforms in-flight (not just extract & load)
Support for backfill and schema changes without rework
Multiple destinations from a single source (many-to-many sync)

Most tools offer one or two of these. Estuary Flow offers them all, plus open-source flexibility and transparent pricing.

Tool / Platform	Summary	Pricing Model	Supported Databases
Estuary Flow	Fully managed + open-source log-based CDC with real-time transforms in SQL & TypeScript, exactly-once delivery, backfill, and many-to-many sync.	$0.50/GB of data moved + $100/connector/month. Open-source self-hosted option available.	PostgreSQL, MySQL, MongoDB, SQL Server, DynamoDB, Firestore
Debezium	Popular open-source log-based CDC framework built for Kafka ecosystems.	Free (Apache 2.0 License)	PostgreSQL, MySQL, MongoDB
Fivetran	Fully managed batch and log-based connectors. Minimum 5-min syncs.	Usage-based pricing per active row & destination. Free tier available.	PostgreSQL, SQL Server, MySQL, Oracle, Snowflake
Striim	Real-time log-based CDC platform with GUI, monitoring, and Azure integration.	Starts at $100 per million events; compute-based tiers.	PostgreSQL, MySQL, SQL Server, Oracle
Airbyte	Open-source batch CDC; real-time CDC in beta for select sources.	Free self-hosted; Cloud plans with usage-based pricing.	PostgreSQL, MySQL, MongoDB, SQL Server
Confluent	Managed Kafka platform with Debezium-based CDC connectors.	Pay-as-you-go or enterprise contracts (based on compute/storage).	PostgreSQL, MySQL, MongoDB

How to implement change data capture step-by-step

There are several options for implementing change data capture from different databases and applications to various targets. For those ready to implement CDC, here are some detailed guides to get started with specific databases and targets:

Additionally, follow these Best Practices for Implementing Change Data Capture to optimize your implementation and avoid common pitfalls.

Why Consider Estuary for CDC?

Estuary Flow is a no-code platform designed to simplify real-time Change Data Capture (CDC) and streaming ETL. Built on the powerful Gazette framework, it extends the capabilities of log-based CDC with unmatched flexibility and ease of use.

Key Benefits of Estuary Flow:

Many-to-Many Pipelines: Seamlessly move the same data to multiple targets for diverse use cases.
Real-Time and Batch Transforms: Use SQL and TypeScript for custom compute in both streaming and batch modes.
Historical Backfill: Add new targets and backfill historical data without re-extracting from sources.
Exactly-Once Processing: Leverage Gazette’s unique semantics to ensure accurate, deduplicated data delivery.
Massive Scalability: Handle high-change environments with true elastic scale and decoupled storage-compute architecture.
Schema Drift Support: Validate and evolve schemas without interrupting your pipeline.

Estuary Flow combines the reliability of log-based CDC with the simplicity of a no-code interface, making it a standout solution for modern data integration needs.

Ready to explore Estuary Flow? try Flow for free here!

We welcome your questions, comments, and (friendly!) debate. Find our team on Slack

Conclusion

Change Data Capture isn’t just a modern alternative to batch ETL—it’s a foundational capability for organizations that need to move fast, stay in sync, and make decisions in real time.

Whether you’re syncing operational databases with analytical platforms, feeding dashboards with fresh data, or enabling machine learning pipelines, CDC helps you avoid stale, duplicated, or missing data.

And while building your own CDC system is possible, it’s rarely worth the effort.

Tools like Estuary Flow make log-based, real-time CDC accessible to any team, combining power, flexibility, and simplicity in one platform.

Want to experience real-time data pipelines without writing connectors, managing Kafka, or worrying about duplicates? 👉 Try Estuary Flow free →

FAQs

1. What is real-time CDC vs batch?

Real-time CDC captures and delivers changes instantly (seconds or less). Batch systems collect and process changes in bulk at set intervals (minutes or hours), leading to delays and stale data.

2. Does CDC support exactly-once delivery?

Some tools like Estuary Flow offer exactly-once guarantees. Others, like Debezium, provide at-least-once delivery and require deduplication downstream.

3. Is CDC suitable for cloud-native databases?

Yes—many modern platforms (e.g., PostgreSQL on AWS/GCP, Snowflake, MongoDB Atlas) support CDC. Tools like Estuary Flow provide native integrations for these environments.

4. Can I use CDC for multiple targets?

Yes. Advanced platforms like Estuary Flow support many-to-many delivery, allowing you to sync the same data to multiple destinations simultaneously (e.g., Snowflake + Kafka + Firestore).