Estuary

MySQL Change Data Capture (CDC): Complete Guide

A comprehensive guide to change data capture in MySQL for cross-functional data teams.

Blog post hero image
Share this article

MySQL Change Data Capture (CDC) is a method used to capture and track real-time data changes in a MySQL database and synchronize them with target systems, ensuring accurate and up-to-date data. This guide provides an in-depth look at MySQL CDC methods and benefits for business leaders, data engineers, and IT professionals.

Who is this guide for?

This resource is designed to support a wide range of data-focused and data-adjacent professionals who meet two simple requirements:

  • You’re part of an organization that uses the relational database MySQL, or intends to migrate to it. If you're preparing to switch, here's a detailed guide on how to migrate a MySQL database.
  • You’re aware of the industry buzz around change data capture (CDC) and want to learn more.

Business leaders: you need a no-nonsense foundation to make decisions. We cover the essentials of MySQL and CDC, plus the pros and cons of different approaches.

Engineers and IT professionals: you intimately know the systems you work with, but figuring out the best changes to make can be time-consuming. Here you’ll find a succinct intro to a range of MySQL CDC methodologies.

Implementing a new data architecture or strategy requires understanding and buy-in from a variety of teams that may not work together day-to-day. Having a common starting point can make the collaborative process easier. That’s what this guide is for.

Let’s get started.

Why Should You Use MySQL CDC?

MySQL is a popular open-source relational database. Though it’s free to use, MySQL has been owned and sponsored by a corporation since its inception: first by the Swedish MySQL AB, and later by Sun Microsystems, which is now the Oracle Corporation.

There are a handful of other popular relational databases, including PostgreSQL, Oracle, SQLServer, and DB2. 

Here are a few reasons to choose MySQL for Change Data Capture:

  • Performance: MySQL was designed for fast performance and supports multiple storage engines for flexibility.
  • Security: MySQL is known for its robust security features, making it suitable for sensitive data environments.
  • ACID Compliance: Transactions are reliable and fault-tolerant, ensuring data integrity.
  • Open-Source Flexibility: While MySQL is open-source, it also has official support from Oracle.
  • Easy Setup: Compared to other RDBMSs, MySQL is simpler to set up, making it an attractive option for organizations of all sizes.

MySQL is commonly used as the primary transactional database for a data stack. That is, it acts as the quick, robust system of record that powers applications and keeps track of rapidly changing information.

Consider one very popular application for MySQL: e-commerce. An online store needs a transactional database to track data about customers and inventory, and to process sales. MySQL is an ideal choice because it’s a fast and secure row-based relational database.

Just as an online store is unlikely to be the only application a retailer builds, MySQL is unlikely to stand alone as a data system. Relational databases simply aren’t ideal for all workflows, most notably, data analytics.

While MySQL is great at managing the transactions that make sales and record-keeping possible, it’d be a poor fit for analysis. And for our hypothetical retailer to succeed, they need to analyze their store’s performance, marketing, inventory, and more. Data warehouses are designed to power analysis like this. However, they are ill-suited for transactional workflows, making them complementary to systems like MySQL.

To learn more about different types of data storage, see this blog post.

What is Change Data Capture (CDC) for MySQL?

MySQL Change Data Capture (CDC) refers to the process of capturing and tracking data changes in a MySQL database. These changes are then reflected in other systems like data warehouses or analytics platforms.

CDC solves the problem of synchronizing data between MySQL and other systems in a way that is:

  • Timely, in a world that expects fresh data within minutes or seconds
  • Reliable in the face of inevitable errors and downtime
  • Minimally harmful to performance, even as your data grows to terabyte scale

When implemented correctly, change data capture can meet all these requirements.

Change data capture, or CDC, is the process of recognizing a change in a source data system so that a downstream system can act on that change, usually by updating a target system to reflect new information.

MySQL is one of the most commonly requested sources for CDC. Fortunately, it’s also well suited to it.

  • For database administrators, simpler CDC methods are easy to set up using SQL alone.
  • Advanced, real-time CDC methods are also attainable using MySQL’s binary log, or binlog.

In the next section, we’ll cover these methods in more depth.

What Are the Methods for MySQL Change Data Capture (CDC)?

When we talk about change data capture these days, we’re generally referring to real-time CDC. In this type of CDC setup, data changes from MySQL are reflected in the targets within seconds or milliseconds. Real-time CDC is event-driven, scales better, and has been challenging, at least historically.

However, CDC has no temporal requirement by definition. Various methods exist, and they can be batch or real-time. Your choice of MySQL connector can significantly impact the reliability and latency of your CDC pipeline.

In general, batch methods tend to be easier to implement, but introduce latency and can have a significant performance impact. With real-time CDC, data changes from the source database, like MySQL, are reflected in the target systems almost immediately. Event-driven CDC scales better, but has historically been more challenging.

In the following sections, we’ll describe three approaches that fall into both categories:

  • CDC using queries (batch)
  • CDC using triggers (batch)
  • CDC using the binlog (real-time)

Queries for MySQL Change Data Capture

SQL queries are a simple method for batch CDC in MySQL. They can be set up on the command line or in a client application like MySQL Workbench.

For this method to work, the source table’s schema needs to have a version number or timestamp column indicating when it was last updated. If it doesn’t, you’re better off using the trigger method, which we’ll cover below.

Your SQL query will look something like:

plaintext
SELECT * FROM my_table WHERE time_updated > time_last_query’;

You can run the query at a recurring interval using MySQL’s built-in Event Scheduler.

Now, the selected records still need to make it to the target system, which is where the real complexity of this method lies. Most solutions involve either a bespoke ETL pipeline or a use a challenging event-driven framework like Kafka Connect.

Table showing advantages and disadvantages of SQL queries for mysql CDC as described in the text

Triggers for MySQL Change Data Capture

An alternative method uses MySQL triggers: functions that listen for insert, update, and delete events on a given table.

You can write these events to a second table, or use them to update timestamps, as mentioned in the section on queries above.

At this point, you have a log of the changes, but it’s still inside MySQL. Once again, you have to move them to the destination system.

From there, you’ll face the same challenge as when using queries: either building or purchasing a pipeline solution to write the changes to the target system.

Triggers also have a significant performance impact on the database. A good practice is to run the trigger against a replica of the main table, which adds another layer of complexity.

Table showing advantages and disadvantages of triggers for mysql CDC as described in the text

MySQL Binary Log for Real-time Change Data Capture

MySQL Binary log for Change data capture
Figure: Binlog-based CDC architecture flow — from MySQL binary log to CDC connector, messaging layer, and target system.

The MySQL binary log (binlog) is the cornerstone of real-time CDC. It records all data-changing operations in a MySQL database and stores them as a binary stream. Originally designed for replication and recovery, the binlog also enables log-based, low-latency CDC.

When implementing CDC using the binlog:

(Hint: Check out the terminology reference at the bottom of the post for more details on these concepts.)

Table showing advantages and disadvantages of binlog for mysql CDC as described in the text

When implementing CDC using the binlog:

  1. Enable the binary log on your MySQL server. (It’s often enabled by default in managed MySQL services like Amazon RDS and Google Cloud SQL — but always confirm your configuration.)
  2. Set binlog_format to ROW, which captures changes at the row level — essential for accurate CDC.
  3. Set binlog_row_image to FULL, unless you're working with systems that can handle partial row images.
  4. Ensure every table has a primary key, which is crucial for identifying and applying changes reliably downstream.
  5. Set an appropriate binlog retention period using either expire_logs_days (older versions) or binlog_expire_logs_seconds (MySQL 8.0+). Most CDC setups use 7 to 30 days of retention to account for pipeline delays and fault recovery.
  6. Parse the binlog using a CDC connector or plugin that translates binary events into structured records (typically JSON).
  7. Send the change events to a messaging system like Apache Kafka or directly into target destinations via CDC platforms.

Setting up a binlog-based CDC pipeline manually can be complex and time-consuming, particularly steps 6 and 7, which involve stream processing, queueing, and schema management.

That’s where CDC platforms come in.

The open-source project Debezium helped standardize real-time CDC by decoding binlog events and publishing them to Kafka topics. It supports MySQL and several other databases out of the box.

Today, several platforms — including Estuary Flow — make the binlog approach even more accessible by removing the need for a separate streaming layer like Kafka and simplifying deployment, monitoring, and transformations.

When evaluating your CDC options, consider:

  • Source scale and change volume
  • Data engineering resources available
  • Destination systems and update frequency requirements
  • Security and compliance constraints

With the binlog approach and the right platform, you can achieve millisecond-latency replication from MySQL to warehouses, lakes, and analytics tools — without major strain on your transactional systems.

For example, here’s how to connect MySQL to Snowflake in real time using a log-based CDC approach.

Let’s look a closer look at one managed CDC service: Estuary Flow.

Estuary Flow for Real-Time MySQL CDC

Estuary’s DataOps platform, Flow, includes a connector for MySQL CDC.

So, what exactly does that mean?

Flow is a centralized platform for building real-time data pipelines that connect a wide variety of source and destination data systems using open-source connectors. But unlike a typical ETL platform, Flow has an event-driven runtime, allowing true, real-time CDC.

When you use the MySQL source connector, Flow acts as a log-driven, real-time CDC pipeline.

After running some basic configuration on your database, as outlined above, the MySQL connector and Flow take over the last two steps. They integrate with the binlog to capture change events to your Flow data collection. Cloud-storage-backed collection data is streamed through the Flow runtime, which is built on Gazette, an open-source streaming broker that was also built by Estuary’s engineering team.

Your data collections captured from MySQL are automatically backed up to a data lake in cloud storage. From there, they can be streamed to one or more target systems using any available materialization or destination connector. This happens with millisecond latency.

For a step-by-step example, see how to capture data from MySQL with Estuary Flow.

Features of Flow as a MySQL CDC pipeline:

  • High degree of customization using the intuitive GitOps workflow or UI: Add additional data sources or multiple destinations to your CDC pipeline, and combine or separate streams of data without direct engineering requirements on the streaming broker.
  • Ability to add transformations without an additional tool: Aggregations, joins, and stateful stream transformations are all possible.
  • Minimize performance impact of historical backfills: Backfilling data from MySQL tables is important for data integrity, even in event-driven CDC, but re-reading large tables can cause performance impacts or even errors like data duplication. Flow avoids having to re-read the whole MySQL table by writing a checkpoint of where it left off in that source table.

Key MySQL Features and Terms for Change Data Capture

Real-time MySQL CDC requires you to leverage some of MySQL’s defining features and components. Here’s a quick reference:

Binary log (binlog):

The MySQL binlog, or binary log, is a database transaction log that keeps track of change events in the source MySQL database and writes them to a binary file. It’s a native MySQL feature, but it isn’t enabled by default. But enabling the binlog allows two important workflows:

  • Data recovery operations in case of failure. When a server goes down and must be restored from a backup, the binlog events are used to catch up with what has occurred since the backup was created.
  • Replication. There are many workflows that require database replicas, and more importantly (for the sake of this guide) MySQL’s replication capabilities are the basis of CDC.

The binlog retention period varies by MySQL version and hosting provider. In MySQL 8.0+, use binlog_expire_logs_seconds. Defaults are often 7 or 30 days — adjust based on your backfill and recovery needs.

Source-replica architecture

In MySQL, the main server is called the source. Servers to which the source is replicated are called replicas.

Replication

Several types of replication are available in MySQL, all of which use the binary log. Replication can be:

  • Asynchronous (the default), semisynchronous, or delayed. Asynchronous replication can still happen in near-real-time.
  • Row-based (the default) or statement based. Row-based replication only copies the changes that have occurred, rather than the entire SQL statement. See the binlog format docs for more information.

Typically, CDC setups use the defaults.

Ready to implement MySQL CDC? Explore Estuary Flow for scalable, real-time data integration. Sign up for a free trial today.


Additional Resources

Estuary resources:

Other resources:


Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.