Looking for a relatively quick but no-fluff explanation of what change data capture (CDC) is? Trying to understand your CDC options for different data systems, and get nuanced comparisons without reading documentation that's liable to melt your brain?
Excellent. We'll start exactly where you'd expect...
Preamble: Change Data Capture Explained by Rom-Com
The 2004 flick 50 First Dates starring Adam Sandler and Drew Barrymore can help explain the system design behind efficient change data capture, as well as Apache Kafka.
In the movie, one of our main characters suffers from severe amnesia after an accident and is unable to retain new memories. Our other lovestruck character doesn’t lose hope, though.
Instead, he creates a videotape for her to watch each morning when she wakes up. The tape linearly summarizes her life, the news, and the major milestones. He adds on to the end of the tape each day so she is always ‘stateful’ with regard to current reality.
Our creative hopeless romantic has effectively employed an efficient, yet simple change data capture process. The videotape of events can be thought of as a database or continuous log, and new memories can be thought of as INSERT or ‘append’ events.
- Appending the new video of each new date to the end of the videotape is the essence of Apache Kafka's log design as well as change data capture...
- Instead of re-creating the tape from scratch each day, which would be very time-consuming and equivalent to a nightly full database export, Adam* appends onto the end of the tape.
- The videos are added in the exact order they occurred.
- When Drew* starts out knowing nothing each day, the videotape is able to backfill her knowledge, as if she was a data application being spun up and we're replaying the Kafka log.
- The tape is never destroyed so the videotape "log" can be shared with others should others need to remember… this scalable process of sharing the append log is at the heart of Kafka.
*We forgot the real names of Adam Sandler and Drew Barrymore's characters and did not look them up while writing this article. It's easier this way.
Yes, this analogy breaks down in places. But at a high level, it works. Adam Sandler data puns aside for now, let’s explore what efficient change data capture means when implemented within the enterprise. This article will particularly resonate with you if your data volumes are large, growing quickly, or being synced too slowly from source to destination.
Short on time? Here’s the 60-second run down:
- Early in the enterprise lifecycle, there may be no need for a complex Change Data Capture implementation due to insignificant volume; running a nightly query to extract all data is a fine replacement.
- Once data volumes grow and there are more transactions, this nightly query becomes problematic for both technical and business reasons. A more robust database replication solution is needed because:
- The production database will begin to have performance or network issues.
- Downstream applications and models aren't continuously in sync with the source.
- With a distributed (microservice-based) architecture, simple queries won't work.
- The production database will begin to have performance or network issues.
- Business: You'll miss out on increased revenue or incur more costs due to slow-moving data. You'll deal with more fraud, bad user experiences, and sub-par machine-learning algorithms
- CDC can be implemented by query, database triggers, or from the database log.
In general, only log-based or query-based CDC are used in production for full-scale replication today.
CDC can also be built manually, via open-source platforms (famously, Debezium), or with a fully managed solution.
- There are trade-offs of each approach we cover, but generally pulling change data from the database log is best long-term.
- Edge cases exist where using change data capture actually doesn’t make sense… mostly when the organization has no need.
- There are a lot of ‘gotchas’ to be wary of when choosing a CDC solution.
What is Change Data Capture?
Change data capture, or CDC, is the process of recognizing a change event in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information.
- Detects and captures change events from a source database like PostgreSQL, MySQL, or MongoDB; or an API like Salesforce.
- Processes the events.
- Allows you to move these events to business applications.
And (this is important) CDC does all these things with low latency* and with reduced stress on the source database compared to other methods.
From a technical standpoint, CDC is an umbrella term for a variety of processes that fulfill this goal (these processes are also known as ETL, ELT, data pipelines, etc).
Let’s look at a common scenario. Say you’re a company that manages customer records in a relational database, like PostgreSQL, and powers data analysis from a data warehouse, like Snowflake. When a customer record is updated in Postgres, your CDC mechanism takes note and updates the corresponding record(s) in Snowflake.
That’s a simplified use case, but the pattern is a very common one: migrating data updates from an RDBMS to a data warehouse. (Need a refresher on different types of data storage systems? We’ve got you.)
*"Low latency" is a subjective term. Though CDC is often marketed as a real-time data solution, that’s not always the case. Depending on the method you use, there can be a bit of latency in your CDC pipeline, as we'll see.
What Isn't CDC?
A nightly cron job running
SELECT * and extracting the full current snapshot of the database is not change data capture.
Yes, the changed data will ultimately be reflected in the downstream system for database replication, but, in this instance, all of the data is being extracted and loaded, as opposed to just the change events. The difference in patterns here may have a massive impact on latency potential, and database and network stress.
When Should You Use Change Data Capture?
Change data capture is a hugely useful pattern for just about any modern business. It helps companies get more value from their data in a faster and more cost-effective way.
By implementing CDC on a database like Postgres or MySQL, companies can keep the source database running more efficiently and relay data to analytics and production faster so data teams can drive more business value.
Let's ground that down with a few change data capture use cases, as told by cautionary tales.
Without a solid change data capture implementation, we start to see problems like:
- Someone places an order for an item that appears to be in stock from an online retailer... only to find out that it was not in stock. (The store's website was out of date with the database).
- Someone makes a financial mistake based on outdated information on their bank account.
- Fraudulent activity isn't caught in time, and the perpetrator rides off into the sunset.
- A logistics company doesn't know where their shipment is and can't make informed decisions for new orders.
You get the idea.
When Should Change Data Capture Not Be Used?
By and large, you should choose to implement change data capture. We’d go as far as to say the only time it shouldn’t be implemented is when:
- There is simply no value in lower latency data feeds for the organization AND
- Data volumes for the table or database are small and under no workload stress.
Historically speaking, we hear of two reasons teams have chosen not to implement CDC. Either:
- Legacy processes simply don’t allow it, or
- It was just too much work.
A Reddit thread in May of 2023 captures the above well. You can read the full discussion here.
As a final note here, if you're building a new application, it might make more sense to stream directly from production to the source system. This can be done by, say writing to Kafka, Kinesis, or Gazette.
Batch vs Real-time CDC: Exploring Latency
PostgreSQL change data, MySQL change data, or change data from any other source can be captured in one of two ways:
- Through a batch process.
- Through a real-time (streaming) process.
In most modern change data capture implementations, the change data is available immediately to be sent to downstream systems like SnowFlake, BigQuery, Databricks, and beyond. The CDC pipeline is reacting instantaneously to a change event, making this real-time CDC.
But every organization's latency needs are different. Batch CDC pipelines can run once a day, or once every five minutes. Depending on the use case, that time delay might be perfectly acceptable. The cost savings of extracting new data only might be benefit enough.
Change Data Capture Methods
Now, how do you implement change data capture?
To simplify matters, we’ll group common change data capture methods into three categories:
- Query-based change data capture
- Change data capture using triggers, and finally
- Real-time change data capture with database logs
What is Query-Based Change Data Capture?
The first change data capture method involves simple SQL queries on tables in the source database. This is also known as polling.
Since CDC sources tend to be relational databases, administrators will be quite comfortable using SQL to manipulate tables. This makes the query method approachable for most teams.
To set this up, you create a CRON job or similar to run the query at a set interval.
The source tables must have an extra column to track when each record was last changed. This is usually a timestamp column, but incremented version numbers can be used, as well, with a bit of extra legwork.
|0001||Joe||Shmoe||123 Main St||2023-02-03 15:32:11|
To find changes to the table, you’d run a query against it to select records where the timestamp value is greater than the time at which you last ran the query. For example, if you last queried the customers table at 7 AM on February 1:
plaintextSELECT * FROM customers WHERE time_updated > ‘2023-02-01 07:00:00’;
Typically, you’d configure a recurring query at a standard time interval. The selected records would be used to update the target at that same cadence.
- Easy to implement: It's a simple SQL query run on a schedule.
- Permissions are simple: Only read privileges are required.
- Good fit for: Smaller and slower OLTP databases where a full audit trail isn't required.
- Add a column to tables: Adding a new column to track timestamps might not be feasible.
- Can’t capture deletes: When you completely remove a record from the table — this is known as a hard delete — the query won’t detect it. You can work around this by using soft deletes. This means you retain all records, but add a boolean
deletedfield. However, this isn’t usually ideal because it can bloat tables with useless records.
- Latency: This is a batch method. There will always be some lag between data updates in the source and target.
- Performance: Every time you run the query, you’re polling the entire table. With a large table, you’ll run into performance issues. This tends to be particularly problematic when your table starts fairly small and grow significantly — a common occurrence. You won’t be able to query it as often as you like, and each time you run the query, your entire system can get bogged down, sometimes for hours.
What is Trigger-Based Change Data Capture?
Databases like PostgreSQL, MySQL, and SQL Server have built-in tooling for automatically invoking a function whenever a change is made in a table. These functions, or triggers, vary a bit in these respective databases, but the high-level workflow remains the same (see Postgres Trigger Docs, MySQL Trigger, and SQL Server Triggers).
You can define a function and replicate all the change events to another table internally within the database, often called an audit table or shadow table.
Triggers are useful and popular in scenarios like:
- Complex data validation (For example, when a new row is created in table
customers,a new row should also exist in table
- Keeping an audit log of the database changes.
- Handling database errors.
Here comes the big BUT… Triggers cannot be used with high-velocity data.
Before log-based CDC, triggers were quite popular for change data capture. Today, they're far less common. (We challenge you to find a single Medium article or engineering blog from a respectable company detailing their enterprise trigger-based CDC at scale!)
This is likely because, if you are looking to scalably implement change data capture, trigger-based CDC is simply not the best option.
While some triggers are more costly than others, the cost on the database will be too high as a continuous deluge of triggers will add too much overhead for any dataset at scale.
- Real-time: Triggers capture changes immediately.
- Complete data capture: Including delete events.
- Supports metadata: For example, you can track which statement caused the change event.
- Performance: Triggers impact the performance of the source database, especially at scale. This is because each trigger creates additional write operations. This may be insignificant at first, but can become debilitating on large datasets with frequent changes.
- Getting the data out of the source is complex: Even after the trigger, you still need to get the data out of the "shadow table" in the source system. This introduces more complexity, and often latency.
- Complex with multiple tables: When you need to capture data from multiple tables, each one needs its own trigger function. The upkeep often becomes unsustainable.
What is Log-Based Change Data Capture?
Log-based change data capture is a data design pattern in which database change events (updates, inserts, deletes) are sourced from the write-ahead log (WAL) of the database (or binlog in MySQL) and delivered to downstream data consumers.
Historically, the WAL was used to ensure data wasn't lost in the event of a system failure. Thus, it helped ensure database acidity and durability.
Today, the WAL has been re-purposed by many for enabling change data capture without adding overhead to the production system.
As inserts, updates, and deletes flow into this transaction log to be committed to the database, they're also asynchronously parsed for delivery into downstream applications in real time.
This CDC method requires additional components (unlike queries and triggers, which leverage native capabilities of the database). These typically include:
- A message broker; often, Apache Kafka.
- A connector to interface between the source database and message broker.
Implementation and configuration vary a bit as each database type has its own version of the write-ahead log. The Postgres log-based CDC or MySQL log-based CDC roughly looks like this:
Log-based CDC is the gold standard of change data capture implementations because it captures all events, in exact order, in real-time, all without adding any load to the source database. That's why most enterprise-scale solutions offering managed CDC include a connector that reads directly from the write-ahead log.
Building such a connector is where things get complicated. Luckily, there are many open-source and managed solutions available today to take that part off your plate.
- Ordering and completeness: Every event (including deletes) is captured in order, and you can retain full history.
- Latency: Latency as low as milliseconds.
- Easy on the database: Doesn't increase the database load.
- Complex: Custom-coded connectors are challenging to engineer.
- Permissions: Requires more permissions to use the WAL.
Want to learn more? Gunnar Morling (lead on the Debezium project, and who has likely written more about CDC than any human alive) summed the differences between query-based and log-based methods here.
How to Implement Change Data Capture
A Google search for how to implement change data capture in Postgres, MySQL, or another database will quickly become overwhelming. You'll find scores of potential solutions across vendors, as well as open-source, Cloud-native, and custom code implementations.
In this next section, we’ll help you sort through the noise and provide options and steps for the 4 most popular database management systems: MySQL, PostgreSQL, SQL Server, and MongoDB.
Paid & Open-Source CDC Solutions
|Solution||Summary||Pricing||Postgres & MySQL||SQL Server||MongoDB|
|Estuary Flow||Open-source and fully managed streaming CDC connectors for cloud with transforms in streaming SQL.||Open-source or $0.75/GB on incremental data each month.||X||X||X|
|Fivetran / HVR||Fully-managed batch connectors for on-prem or cloud. 5-minute syncs.||Monthly active rows.||X||X|
|Airbyte||Open-source batch connectors with upgradable service available. 5-minute syncs.||Open-source or enterprise pricing.||X||X||X|
|Confluent||Fully managed real-time CDC connector built on top of Debezium.||Priced on compute and data transfer used each month.||X||X||X|
|Striim||Fully managed real-time CDC connectors for on-prem or cloud.||Priced on compute and data transfer used each month.||X||X||X|
|Debezium||Popular open-source project for relaying change data into Kafka. See our analysis.||NA||X||X||X|
|Maxwell’s Daemon||Written in Java. Outputs from the event stream. Similar to Debezium but not as popular.||NA||MySQL only|
We maintain a MUCH more complete version of this table in Google Sheets, including more vendors and source systems!
Native CDC Solutions
MySQL & PostgreSQL
The major cloud providers each offer their own zero-ETL services. If you are using, say, AWS Aurora MySQL and only need to move data to Redshift, then these native services may be an affordable fit. Google Cloud and Microsoft also offer their own options.
Note this only allows you to move cloud-hosted MySQL change data only within these environments, and there are no features enabling transformation, governance, and more.
This Medium article provides a good overview of the pros and cons of using zero-etl.
Microsoft is working on functionality to more easily move data within Azure environments.
MongoDB has a built-in connector sending change data into a Kafka topic. You can read more about it here.
Manual CDC Solutions
Manual CDC solutions are those discussed above (triggers, queries, and log-based)— the same methods that underly all the paid and managed tools here.
To go deeper into how each method looks in each database, see:
Change Data Capture in Estuary Flow
So if you read that whole thing, congrats! Take a quick screen break… Welcome back.
You might have noticed Estuary Flow as an option for real-time CDC for these sources. Allow us to quickly introduce you to our open-source and fully-managed CDC connector solution.
Estuary Flow works on an ecosystem of open-source connectors, including CDC connectors we build and maintain in-house. We use the log-based CDC method, but do things a bit different.
Flow is a UI-based product, and it's core runtime is based on Gazette, a distributed pub-sub streaming framework.
Benefits of Estuary Flow include...
- Exactly-once processing guarantees using Gazette's unique de-duplication of real-time data.
- Scales for massive changes: connectors avoid DML issues when working with millions of rows.
- Backfill from the cloud: Flow uniquely stores intermediate data in your CDC pipeline in a cloud storage bucket you own. Use this reduced, de-duplicated data to backfill without re-reading from the database.
- Validating schema changes: all data is validated against a JSON schema you can evolve without halting the pipeline.
- In-flight SQL transforms: Easy-to-define stateful data transformation. SQL and TypeScript are both supported.
You can try Flow for free here!