What is Change Data Capture?
Change data capture (CDC) is the process of identifying and capturing changes made to a database in real-time and delivering them downstream to one or more destinations. CDC ensures that data remains synchronized across systems with minimal latency.
Benefits of Change Data Capture
Change Data Capture (CDC) has emerged as a preferred method for data extraction, surpassing traditional batch-based techniques initially used in ETL pipelines and later ELT pipelines. While alternative data extraction methods exist, none match the efficiency and advantages offered by CDC.
- Near real-time visibility - CDC provides the fastest and lowest-latency method for keeping data synchronized across systems. By capturing and propagating changes as they occur, CDC ensures real-time updates, making it ideal for use cases like database replication and real-time analytics.
- Lowest load on source systems - Unlike batch extracts, which can strain source systems during scheduled runs, CDC typically adds only 1-3% additional load. This lightweight impact makes CDC an optimal choice for maintaining system performance while ensuring data consistency. Advances in technology have also reduced the load on target systems over time.
- Improved scalability - Batch pipelines often struggle to keep up as data volumes grow, with extraction times exceeding batch windows. CDC’s ability to stream data continuously over 24 hours at faster rates than database transaction speeds enables it to scale effectively, handling even the most demanding workloads.
- Preserves all changes - Batch processes only capture the current state of data, potentially overwriting previous changes. CDC, on the other hand, records every change—including updates and deletions—ensuring a comprehensive history of all events. This level of detail is critical for applications requiring full audit trails or historical analysis.
- Data backfill and recovery - Originally designed for database recovery, CDC simplifies accessing the full history of changes. This makes it easier to backfill missing data or recover from errors, significantly reducing downtime and improving operational resilience.
The few challenges traditionally associated with CDC—such as cost and implementation complexity—have been addressed by modern tools and platforms. With these advancements, CDC is now more accessible than ever, offering organizations a seamless way to stay ahead in the data-driven landscape.
How Does Change Data Capture Work?
Change Data Capture (CDC) works by recognizing change events in a source system, capturing them, and then streaming or batching these changes to a target system in near real-time. The goal is to ensure the target system reflects the latest data state with minimal latency and reduced load on the source database compared to traditional methods.
Typically, CDC involves three main steps:
- Detect and Capture Change Events: CDC identifies data changes in a source system, such as a database (e.g., PostgreSQL, MySQL, or MongoDB) or an application (e.g., Salesforce). Changes include new inserts, updates to existing data, and deletions.
- Stream or Transform Data: The captured changes are streamed directly to the target (replication) or processed through transformations (streaming ETL). For example, transformations may involve filtering specific fields, standardizing data formats, or aggregating data.
- Deliver Changes to Target Systems: The transformed or raw changes are sent to one or more destinations, such as data warehouses (e.g., Snowflake), data lakes, or operational applications. This ensures the downstream systems always have up-to-date information.
Practical Use Case: Synchronizing Databases with Change Data Capture
Consider a company that manages customer records in PostgreSQL and uses Snowflake for data analytics. When a customer updates their address in PostgreSQL, the CDC mechanism detects the change event and captures it. This update is then processed (if needed) and sent to Snowflake. As a result, Snowflake reflects the latest customer record without requiring a full batch extract.
This is a simplified but common example of using CDC to synchronize an operational database with a data warehouse in near real-time.
Low Latency in CDC Pipelines
While CDC is often marketed as a real-time data solution, the actual latency depends on the method used:
"Low latency" refers to the time taken for a change in the source system to appear in the target system. However, latency can vary depending on the CDC method.
An Analog Example of Change Data Capture: 50 First Dates
The 2004 movie 50 First Dates, starring Adam Sandler and Drew Barrymore, offers a surprisingly relatable analogy for understanding Change Data Capture (CDC).
In the film, Drew’s character suffers from severe amnesia and cannot retain new memories beyond a single day. Determined to help her, Adam’s character creates a videotape summarizing her life, including daily news and milestones. Each day, he adds updates to the end of the tape so she always wakes up with an accurate and “stateful” understanding of her life.
This process mirrors the principles of CDC:
- The videotape acts as a continuous log or database, while the daily updates represent append-only change events.
- Instead of re-creating the tape from scratch each day (similar to a batch extract), Adam appends only the new changes.
- The updates are recorded in exact order, ensuring historical accuracy and enabling backfill for Drew’s memory.
The videotape remains intact, allowing it to be shared or accessed by others if needed—much like a scalable, append-only log in CDC systems.
While this analogy isn’t perfect, it highlights the fundamental efficiency of CDC compared to traditional methods. Rather than repeatedly extracting the full dataset, CDC focuses on capturing and propagating incremental changes with precision.
For enterprises with growing data volumes or time-sensitive applications, this efficient approach is transformative. Let’s explore what implementing CDC correctly can mean for your organization.
Change Data Capture Methods
There are three primary methods for implementing Change Data Capture (CDC):
- Query-Based Change Data Capture
- Trigger-Based Change Data Capture
- Log-Based Change Data Capture (preferred method)
Each method has its own advantages and drawbacks, depending on the scale and requirements of your system.
What is Query-Based Change Data Capture?
This method involves running SQL queries on the source database to identify changes. Often referred to as "polling," it works by scheduling recurring queries that check for data changes since the last run.
To implement this approach, you’ll need an additional column in the source tables to track when each record was last modified (e.g., a time_updated
or versioning column). A CRON job or a similar scheduler is then configured to run the query at regular intervals.
Example Table Schema:
id | firstname | lastname | address | time_updated |
---|---|---|---|---|
0001 | Joe | Shmoe | 123 Main St | 2023-02-03 15:32:11 |
To find changes to the table, you’d run a query against it to select records where the timestamp value is greater than the time at which you last ran the query. For example, if you last queried the customers table at 7 AM on February 1:
SELECT * FROM customers WHERE time_updated > ‘2023-02-01 07:00:00’;
Typically, you’d configure a recurring query at a standard time interval. The selected records would be used to update the target at that same cadence.
Advantages of Query-Based CDC:
- Easy to implement with basic SQL queries.
- Requires only read permissions on the source database.
- Suitable for smaller, slower OLTP databases where a full audit trail is unnecessary.
Disadvantages of Query-Based CDC:
- Requires schema modification: Adding a new column for timestamps may not be feasible in all cases.
- Cannot detect deletions: A hard delete in the source system is missed unless soft deletes are used, which can bloat the database.
- Higher latency: Changes are captured at intervals, not in real-time.
- Performance issues: Polling large tables frequently can overload the source system, especially as data grows.
What is Trigger-Based Change Data Capture?
This method uses database triggers to record changes in an audit or shadow table. Triggers are predefined functions in databases like PostgreSQL, MySQL, and SQL Server that execute whenever a specific event (e.g., an INSERT
, UPDATE
, or DELETE
) occurs.
Triggers are commonly used for tasks such as:
- Complex data validation (For example, when a new row is created in table customers, a new row should also exist in table orders).
- Keeping an audit log of changes.
- Handling database errors.
However, trigger-based CDC has significant limitations for high-velocity environments. The additional overhead created by triggers can impact the source database's performance, making it unsuitable for large-scale implementations.
Advantages of Trigger-Based CDC:
- Real-time: Triggers capture changes immediately.
- Complete data capture, Including delete events.
- Supports metadata: For example, you can track which statement caused the change event.
Disadvantages of Trigger-Based CDC:
- Performance impact: Triggers increase write operations on the database, which can slow down high-frequency transactions.
- Complexity in data extraction: Moving data from the shadow table to a target system adds latency and complexity.
- Scalability issues: Managing multiple triggers across several tables can become unmanageable.
Note: While triggers were once popular for CDC, they are rarely used at scale today due to performance concerns.
What is Log-Based Change Data Capture?
Log-based CDC reads changes directly from the write-ahead log (WAL) or equivalent (e.g., binlog in MySQL). These logs are maintained by the database to ensure transactional integrity and serve as a reliable source for capturing data changes.
This method is widely regarded as the most efficient and scalable CDC approach because it:
- Captures every change (inserts, updates, deletes) in the exact order they occur.
- Minimizes latency, often processing changes in milliseconds.
- Places minimal load on the source database by reading logs rather than executing queries or triggers.
Because of these three benefits, WAL-based CDC has become the most popular approach for change data capture.
Log-based CDC requires an additional component to process and capture change events from the WAL (unlike queries and triggers, which leverage native capabilities of the database). All three methods will also require some way to deliver events. Usually this is in the form of a message broker or other form of streaming with source and target connectors.
Implementation and configuration vary a bit as each database type has its own version of the write-ahead log. The Postgres log-based CDC or MySQL log-based CDC roughly looks like this:
Log-based CDC is the gold standard of change data capture implementations because it captures all events, in exact order, in real-time, all without adding any load to the source database. That's why most enterprise-scale solutions offering managed CDC include a connector that reads directly from the write-ahead log.
Building an end-to-end change data capture pipeline is complicated, especially at any reasonable scale and low latency. You are better off leveraging existing proprietary offerings, or extending open source than trying to build your own.
Advantages of Log-Based CDC:
- Accurate and complete data: Captures every event, including deletions, with guaranteed ordering.
- Low latency: Real-time data updates with near-zero delays.
- Minimal impact on the source system: Only reads from the log files without affecting database performance.
Disadvantages of Log-Based CDC:
- Complex implementation: Requires specialized connectors and expertise to set up.
- Permissions: Accessing the WAL or equivalent may require elevated permissions, which can complicate implementation.
For enterprise applications, leveraging tools like Estuary Flow, Debezium, or other managed log-based CDC platforms can simplify implementation while maximizing efficiency.
Want to learn more? Gunnar Morling (lead on the Debezium project, and who has likely written more about CDC than any human alive) summed the differences between query-based and log-based methods here.
What Isn't Change Data Capture?
Not every method of extracting data qualifies as Change Data Capture (CDC). For instance, running a nightly cron job that executes a SELECT *
query to extract the entire database snapshot in batch mode does not count as CDC.
While this method reflects changes in the downstream system, it doesn’t capture individual change events. Instead, it extracts the current state of the data at a specific point in time. This approach has several drawbacks:
- Missed Changes: Any data updates overwritten between two batch runs are lost, as batch processes don’t track intermediate changes.
- High Latency: Batch jobs introduce significant delays in propagating updates to downstream systems.
- Increased System Load: Extracting the full dataset repeatedly adds considerable overhead to the source system.
Batch vs Real-Time CDC: Latency Matters
Data changes can be captured in two ways:
- Batch Process:
- Changes are collected and processed in bulk at scheduled intervals (e.g., once daily or hourly).
- Latency depends on the frequency of batch runs.
- Suitable for use cases where real-time updates aren’t critical and cost savings are prioritized.
- Real-Time (Streaming) Process:
- Changes are detected and streamed as soon as they occur, ensuring minimal latency.
- Ideal for scenarios requiring low-latency updates, such as real-time analytics or operational reporting.
Key Difference: Real-time CDC processes changes instantaneously, while batch processes introduce delays, making them less suitable for time-sensitive applications.
When to Use Change Data Capture
Modern Change Data Capture implementations increasingly rely on real-time pipelines. With technologies like PostgreSQL WAL, MySQL binlog, and tools such as Snowflake, BigQuery, and Databricks, real-time CDC ensures instant reaction to change events. This reduces latency and improves the efficiency of downstream systems.
Here’s why real-time CDC is indispensable for modern operations:
- E-commerce: A customer places an order for an item that appears in stock but is actually sold out. The website, out of sync with the database, displays outdated inventory data, leading to a poor customer experience.
- Financial services: A user makes a decision based on outdated information about their bank balance, potentially leading to financial mistakes.
- Fraud detection: Real-time CDC can identify suspicious transactions and flag them instantly. Without it, fraudulent activity may go unnoticed until it’s too late.
- Logistics: Delays in shipment updates prevent companies from optimizing delivery routes or making informed decisions about new orders, leading to customer dissatisfaction and inefficiencies.
In these scenarios, batch CDC introduces latency that could disrupt operations, increase costs, and degrade user experiences. Real-time CDC, by contrast, ensures data accuracy and timeliness, making it the preferred choice for fast-paced, data-driven businesses.
When Not to Use Change Data Capture?
While log-based Change Data Capture (CDC) is often the preferred method for extracting and replicating data in real-time, there are scenarios where implementing CDC might not be the best choice. Consider avoiding CDC in the following situations:
- Low Priority for Real-Time Data Feeds: If your organization doesn’t require low-latency data updates and can operate effectively with periodic data refreshes, the complexity and cost of CDC may not justify its implementation.
- Small Data Volumes with Minimal Workload Stress: For small tables or databases with minimal workloads, batch processes may suffice. In such cases, the overhead of setting up CDC may not provide enough value to outweigh the effort.
- Restricted Data Access: If policies or regulations (e.g., HIPAA, PII compliance) restrict direct access to source data, implementing CDC may become challenging. While obfuscation or data masking rules can be applied during transformations, these must comply with your organization's data governance policies.
Common Reasons for Choosing Not to Use CDC
Historically, teams have opted against implementing CDC for the following reasons:
- Reliance on Legacy Batch Processes: Many organizations still depend on legacy systems that support batch data extraction and don't require real-time updates. Transitioning to CDC may seem unnecessary if existing processes meet business needs.
- Complexity of Implementation: Configuring CDC pipelines can require significant effort, especially for teams unfamiliar with database logs or streaming platforms. For smaller use cases, batch methods often prove simpler and "good enough."
- Limited Access to Source Logs: Some teams lack access to the write-ahead log (WAL) or equivalent log files in their databases, relying only on specific views or tables. This limitation makes it difficult to implement log-based CDC.
A Reddit thread in May of 2023 captures the above well. You can read the full discussion here.
As a final note here, if you're building a new application, it might make more sense to make it stream-native directly from production to the new application. This can be done by, say writing to Kafka, Kinesis, or Gazette.
Change Data Capture Alternatives
Searching for Change Data Capture (CDC) solutions in databases like PostgreSQL, MySQL, SQL Server, or MongoDB can quickly become overwhelming. There are countless options across vendors, ranging from open-source projects to paid, fully managed solutions, as well as cloud-native and custom implementations.
In this next section, we’ll help you sort through the noise and provide options and steps for the 4 most popular database management systems: MySQL, PostgreSQL, SQL Server, and MongoDB.
Paid & Open-Source CDC Solutions
The following table highlights some of the leading CDC tools and their key features:
Solution | Summary | Pricing | Postgres & MySQL | SQL Server | Mongo | Dynamo | Firestore |
Estuary Flow | Open-source and fully managed streaming CDC connectors for cloud with transforms in streaming SQL. | Open-source or $1/GB on incremental data each month. | X | X | X | X | X |
Fivetran / HVR | Fully-managed batch connectors for on-prem or cloud. 5-minute syncs. | Monthly active rows. | X | X | |||
Airbyte | Open-source batch connectors with upgradable service available. 5-minute syncs. | Open-source or enterprise pricing. | X | X | X | ||
Confluent | Fully managed real-time CDC connector built on top of Debezium. | Priced on compute and data transfer used each month. | X | X | X | X | |
Striim | Fully managed real-time CDC connectors for on-prem or cloud. | Priced on compute and data transfer used each month. | X | X | X | ||
Debezium | Popular open-source project for relaying change data into Kafka. See our analysis. | NA | X | X | X | ||
Maxwell’s Daemon | Written in Java. Outputs from the event stream. Similar to Debezium but not as popular. | NA | MySQL only |
Native CDC Solutions
Most major cloud providers now offer their own zero-ETL services for basic CDC implementations within their ecosystems. While these solutions work well for specific use cases, they are limited in scope and lack advanced features like transformation or governance.
- MySQL & PostgreSQL:
Services like AWS Aurora MySQL allow direct data movement to Redshift without third-party tools. Similarly, Google Cloud and Microsoft offer CDC solutions tied to their environments. However, these tools are typically restricted to cloud-hosted databases and lack flexibility for multi-cloud or hybrid use cases.
Pro Tip: Tools like AWS DMS may also be an option, but they come with trade-offs like limited transformations and higher latency.
- SQL Server:
Microsoft continues to expand support for CDC within Azure, making it easier to move data across Azure-native services.
- MongoDB:
MongoDB includes a built-in CDC connector that streams change data into Kafka topics, making it a good fit for event-driven architectures.
Manual CDC Solutions
Manual CDC methods include the foundational approaches discussed earlier: query-based CDC, trigger-based CDC, and log-based CDC. While manual implementations give teams more control, they require significant effort to set up and maintain, especially at scale.
For most organizations, leveraging an established CDC solution (open-source or managed) is a more practical and scalable approach.
How to implement change data capture step-by-step
There are several options for implementing change data capture from different databases and applications to various targets. For those ready to implement CDC, here are some detailed guides to get started with specific databases and targets:
- MySQL CDC guide
- Postgres CDC guide
- SQL Server CDC guide
- DynamoDB CDC guide
- How to extract data from Firestore, and the Firestore CDC guide
- Salesforce change data capture
- Snowflake CDC
Additionally, follow these Best Practices for Implementing Change Data Capture to optimize your implementation and avoid common pitfalls.
Why Consider Estuary for CDC?
Estuary Flow is a no-code platform designed to simplify real-time Change Data Capture (CDC) and streaming ETL. Built on the powerful Gazette framework, it extends the capabilities of log-based CDC with unmatched flexibility and ease of use.
Key Benefits of Estuary Flow:
- Many-to-Many Pipelines: Seamlessly move the same data to multiple targets for diverse use cases.
- Real-Time and Batch Transforms: Use SQL and TypeScript for custom compute in both streaming and batch modes.
- Historical Backfill: Add new targets and backfill historical data without re-extracting from sources.
- Exactly-Once Processing: Leverage Gazette’s unique semantics to ensure accurate, deduplicated data delivery.
- Massive Scalability: Handle high-change environments with true elastic scale and decoupled storage-compute architecture.
- Schema Drift Support: Validate and evolve schemas without interrupting your pipeline.
Estuary Flow combines the reliability of log-based CDC with the simplicity of a no-code interface, making it a standout solution for modern data integration needs.
Ready to explore Estuary Flow? try Flow for free here!
We welcome your questions, comments, and (friendly!) debate. Find our team on Slack
Conclusion:
Change Data Capture (CDC) is a game-changer for keeping data synchronized across systems in real-time, powering use cases like fraud detection, e-commerce, and real-time analytics. While log-based CDC remains the gold standard for scalability and efficiency, simpler methods like batch processing can suit less time-critical needs.
For organizations seeking an easy-to-implement, real-time solution, tools like Estuary Flow offer unmatched scalability, reliability, and simplicity. Ready to modernize your data pipelines? Explore Estuary Flow today.
CDC Frequently Asked Questions (FAQ)
What is change data capture?
Change Data Capture (CDC) is the process of capturing data changes in a database as they occur and delivering them in real-time to one or more destinations.
What are the different types of change data capture?
The main types of CDC are:
- Query-based CDC (periodic polling of changes).
- Trigger-based CDC (using database triggers to track changes).
- Log-based CDC (capturing changes from the database’s write-ahead log). Log-based CDC is the most efficient and scalable method.
Does Change Data Capture guarantee exactly-once delivery?
The write-ahead log guarantees that every change is captured exactly once. However, not all CDC implementations ensure exactly-once delivery to the target system. Tools like Debezium provide "at-least-once" delivery, while others rely on deduplication at the target to achieve exactly-once semantics.
What is change data capture latency?
CDC latency refers to the time taken for a change in the source system to appear in the target system. Factors influencing latency include the CDC method, the time needed for data transformation, and the performance of the target system.
About the author
With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.