Estuary

Log-Based CDC vs. Traditional ETL: A Technical Deep Dive

Learn the key differences between Change Data Capture (CDC) and traditional ETL. Explore challenges like missed updates, schema changes, and scaling. See how log-based CDC with Estuary simplifies real-time data pipelines.

Log-based CDC vs traditional ETL — a deep dive into data pipeline design.
Share this article

Introduction

The primary objective of data teams is to enable the effective use of data, primarily for analytical purposes. To achieve this, most data teams must extract data from multiple sources and make it available in a user-friendly format for stakeholders, who are typically business-savvy users, analysts, or data scientists.

ETL refers to this process and stands for Extract, Transform, and Load. The data teams will build and maintain these ELT systems; the sequence of these actions may vary (e.g., ETL, ELT, ETLT, etc.).

CDC (Change Data Capture) on the other hand, is a design pattern that refers to the process of capturing every change that occurs to a source dataset. In modern systems, you typically have multiple source systems, including microservices, third-party SaaS systems, and Excel files, among others. Knowing what the data looked like at any given time is a critical need for analytical use cases. This is where the discussion of Log-Based CDC vs Traditional ETL becomes important, as the two approaches handle data history very differently.

Think of how you'd answer this question: "How many customers were based out of New York, USA, in the last year?" While this question may seem simple, it may overlook specific nuances, such as:

  • What happens if a customer moves to another state? Would they be considered living in NY?
  • How long must a customer reside in NY to be considered based in NY?

As you can see, what appears to be a simple question can quickly become complex.

The ability to see what the data looked like at some point in the past is critical for analytical questions. CDC enables capturing every change to a dataset, allowing us to see what a dataset looked like at a specific point in time.

In this article, we will delve into how ETL data pipelines are constructed without CDC and the challenges they encounter. And how CDC-based ETL can help you deliver reliable data.

By the end of this article, you will know the value a well-built CDC system can bring to your stakeholders and how it can help data teams manage changes across upstream teams and ever-increasing stakeholder asks.

Log-Based CDC vs Traditional ETL: A Comparison

This section provides a TL;DR of the comparison between Change Data Capture (CDC) vs ETL, looking at both traditional ETL pipelines and modern log-based CDC approaches, as well as how Estuary improves on both.

Estuary is a unified data movement and transformation platform for the whole data team.

Estuary enables you to build low-latency ETL and ELT pipelines using connectors for any database or data warehouse, leveraging Change Data Capture (CDC) to power your analytics, operations, and AI.

TopicDescriptionTraditional ETLLog Based CDCEstuary
Time to POCTime it takes to create a working proof-of-concept1-2 days2-5 daysA few hours
Upstream schema changeTime it takes to change data pipelines in response to an upstream change1-2 days1 - 2 daysAutomatically handled - a few hours
Adding new sourcesTime it takes to add a new source (different DB, API, Webhooks, etc)1 - 2 days2 - 5 daysA few hours. Estuary has 150+ pre-built connectors
LatencyTime taken for a change in upstream to be reflected in the warehouseHours to daysReal-timeReal-time with batch fall backs as needed
MaintenanceTime spent fixing bugs, investigating pipeline issues, and debugging why data is not showing up in the warehouseHours to daysHours to daysA few hours
ScaleAbility to deal with 100s of messages per minuteCan handle large data setsNeeds a middle layer system (e.g., Kafka) to ensure your systems can handle the influx of dataCan handle large amounts of data (7+ GB/sec)
Duplicate event handlingAbility to deal with duplicate eventsNeed custom logicNeed custom logicIn-built feature
Out of order eventsAbility to deal with events arriving out of order (e.g., Delete before Create)Need custom logicNeed custom logicIn-built feature
RecoveryAbility to restart or recover from failure without manual interventionRequires development time to define run-booksRequires development time to define run-booksUI to debug issues and restart pipelines
State managementAbility to persist information about the progress of the pipeline, in the event of failures or restarts, to prevent duplication of data or missing datasetsNeed custom logicNeed custom logicIn-built
Fall back to batchAbility to fall back to batch data pipelines in case of concerns like, backfills, DB load, governance issues with granting access, etcBatch by defaultNeed custom logicIn-built ability to fall back to batch

Want to avoid building complex CDC pipelines from scratch? Try Estuary Flow to set up real-time data pipelines in minutes.

Setting the Stage: Building a Simple ETL Pipeline

Prerequisites:

  1. Docker
  2. uv

Let's assume we are part of the data team at an e-commerce company. Our team is responsible for ensuring that data is available for analytical use cases via our warehouse. To demonstrate how a CDC ETL pipeline differs from traditional ETL, we will start by setting up a simple ETL process using PostgreSQL.

For this post, let's focus on one upstream source: the Supplier table from our application database. This table contains information about all our suppliers, with one row per supplier. Any change to a supplier will update this table. For example, if a supplier deletes their account, we will mark them as inactive.

Let's set up our application DB with a supplier table.

Run the following commands on your terminal.

plaintext
docker run --name application-db -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d postgres:16 # Create suppliers table and insert fake data docker exec -ti application-db psql -U postgres -c "DROP TABLE IF EXISTS suppliers; CREATE TABLE suppliers ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, address TEXT, is_active BOOLEAN DEFAULT TRUE, created_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Insert sample data with simplified columns INSERT INTO suppliers ( name, address, is_active ) VALUES ( 'Tech Solutions Inc.', '1234 Innovation Drive, Suite 200, San Francisco, CA 94105, USA', TRUE ), ( 'Global Supply Co.', '789 Commerce Boulevard, Chicago, IL 60601, USA', TRUE );"

The above code block starts the Postgres application DB and creates a supplier table with some fake data.

Our objective as the data team is to extract data from the supplier table and insert it into a dim_supplier table (snapshot dimension) in our warehouse.

Let's set up our Warehouse with a dim_supplier table.

Run the following commands on your terminal.

plaintext
docker run --name warehouse -e POSTGRES_PASSWORD=mysecretpassword -p 5433:5432 -d postgres:16 # Create a warehouse dim_suppliers table docker exec -it warehouse psql -U postgres -c " DROP TABLE IF EXISTS dim_suppliers; CREATE TABLE dim_suppliers ( dim_supplier_key SERIAL PRIMARY KEY, supplier_id INTEGER NOT NULL, name VARCHAR(255) NOT NULL, address TEXT, is_active BOOLEAN DEFAULT TRUE, snapshot_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP, created_ts TIMESTAMP, updated_ts TIMESTAMP ); "

The above code block starts the Postgres warehouse DB and creates a dim_supplier table with some fake data.

Now we have the upstream application database and warehouse set up.

In the following sections, we will cover how to implement ETL without CDC, with CDC, and how Estuary can enable you to focus on adding business value.

Traditional ETL: Easy to Start & Very Hard to Scale Out

Let's build a simple Python script to pull data from the supplier table into the dim_supplier table as shown below. This represents a traditional ETL pipeline without Change Data Capture (CDC).

python
#!/usr/bin/env python3 # /// script # requires-python = ">=2.13" # dependencies = [ # "psycopg2-binary", # "pypgoutput", # ] # /// import psycopg2 import pypgoutput # Connect to PostgreSQL - corrected connection string conn = psycopg2.connect("postgresql://postgres:mysecretpassword@localhost:5432/postgres") cursor = conn.cursor() # Get data from the suppliers table cursor.execute("SELECT id, name, address, is_active, created_ts, updated_ts FROM suppliers") rows = cursor.fetchall() print("#" * 100) print("INPUT DATA") for row in rows: print(row) print("#" * 100) cursor.close() conn.close() # Connect to warehouse database warehouse_conn = psycopg2.connect("postgresql://postgres:mysecretpassword@localhost:5433/postgres") warehouse_cursor = warehouse_conn.cursor() for row in rows: warehouse_cursor.execute("INSERT INTO dim_suppliers (supplier_id, name, address, is_active, created_ts, updated_ts) VALUES (%s, %s, %s, %s, %s, %s)", row) warehouse_conn.commit() warehouse_cursor.execute("SELECT * FROM dim_suppliers") output_rows = warehouse_cursor.fetchall() print("#" * 100) print("OUTPUT DATA") for row in output_rows: print(row) print("#" * 100)

Save the above code block as a file named etl.py . You can run this ETL script via your terminal, as shown below.

plaintext
uv run etl.py

You now have a dim_supplier table ready to use. However, you soon start encountering problems.

Problem 1: Missing Updates in Traditional ETL Pipelines

Assuming this pipeline runs daily, you notice updates within a day are being missed.

Let's look at an example:

plaintext
# run pipeline on day 1 uv run etl.py # Update supplier table docker exec -it application-db psql -U postgres -c "UPDATE suppliers SET address = 'New Jersey' WHERE id = 1; UPDATE suppliers SET name = 'Tech Solutions International Inc.' WHERE id = 1; UPDATE suppliers SET is_active = FALSE, address = '1234 Innovation Drive, Suite 500, San Francisco, CA 94105, USA' WHERE id = 1; " # run pipeline on day 2 uv run etl.py docker exec -it warehouse psql -U postgres -c "SELECT name, address, is_active FROM dim_suppliers WHERE supplier_id = 1;"

Although this supplier was based in New Jersey (NJ) for a specific period, we do not capture this information in our warehouse.

A business user will assume that this supplier has always been based in California (CA). Now they are basing their analysis on incorrect information.

Let’s look at a simple example below:

Traditional ETL pipeline missing intermediate updates over time

You increase the pipeline frequency, hoping to capture all the updates, but something else breaks this time.

Problem 2: Handling Schema Changes in ETL

When there is an upstream schema change, you will need to update the code to prevent it from breaking.

Let's look at an example. Run the following commands on your terminal.

plaintext
# Update to table schema docker exec -it application-db psql -U postgres -c "ALTER TABLE suppliers RENAME COLUMN id TO supplier_id;" uv run etl.py Traceback (most recent call last): cursor.execute("SELECT id, name, address, is_active, created_ts, updated_ts FROM suppliers ") ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^ psycopg2.errors.UndefinedColumn: column "id" does not exist LINE 1: SELECT id, name, address, is_active, created_ts, updated_ts ...

Now we will need to update the Python script (etl.py) to account for the schema change, or implement a system that can handle schema changes.

While this is a simple schema change, in production systems, upstream teams often move quickly and make numerous changes. When you don't have a system or process in place to handle schema changes, a significant amount of developer time will be spent resolving schema issues.

If not managed, this can cause significant chaos and waste the time of data engineers as they attempt to fix the issues.

plaintext
# reverse the schema change docker exec -it application-db psql -U postgres -c "ALTER TABLE suppliers RENAME COLUMN supplier_id TO id;"

Problem 3: ETL Maintenance Overhead

A simple Python script ETL is relatively simple to maintain. However, between handling schema changes, connecting to different source systems, ETL-ing multiple tables/APIs/Webhooks, etc. The amount of development work needed to build these systems will increase significantly.

Then comes maintenance

What happens when the schema of the upstream source changes?

What happens when the upstream system is upgraded and you need to update the connection script?

What happens when a pipeline fails and an on-call engineer needs to fix issues?

What should the run books look like?

What are the systems for monitoring and alerting? How would potential issues be resolved?

What about issue resolution end-users (when applicable)?

While building ETL scripts can help you get started real quick, maintenance will soon become a giant time sink.

Log-Based CDC: Advantages and Scaling Challenges

Change Data Capture is a software design pattern that enables the capture of every change that occurs to a dataset. Compared to traditional ETL, a log-based CDC ETL pipeline brings stronger data integrity but introduces new challenges at scale.

There are multiple ways to implement CDC, such as

  1. Custom code: To capture every change that happens, e.g., outbox pattern. An outbox pattern is a system where an application making a change to the data will also send out an event describing the change.
  2. Log tables: In this approach, every change to a table is logged in a log table. A log table-based approach can be implemented as part of an application or as database triggers. Log tables significantly increase the amount of data that needs to be stored in the application database.
  3. Log-based replication: In this system, every change to the data is pulled directly from the database's log. Most databases have a log system where changes to their data are first written to before being written to disk storage.

Both the outbox pattern and log table pattern involve a significant amount of custom code from the upstream teams. And upstream teams generally don't have the time or are external, making these feasible.

Log-based replication systems offer a decoupled architecture, reducing the workload required from upstream teams. Let's continue with this section, assuming a log-based replication system.

While every database has its own log system, Postgres's WAL (write-ahead log) is indicative of most database logs, so let's consider that.

Let's set up WAL for our suppliers table. Run the following commands on your terminal.

plaintext
docker exec -it application-db psql -U postgres -c "ALTER SYSTEM SET wal_level = logical;" docker restart application-db && sleep 5 docker exec -it application-db psql -U postgres -c "ALTER ROLE postgres WITH REPLICATION;" docker exec -it application-db psql -U postgres -c "GRANT pg_read_all_data TO postgres;" docker exec -it application-db psql -U postgres -c "ALTER TABLE suppliers REPLICA IDENTITY FULL;" docker exec -it application-db psql -U postgres -c "CREATE PUBLICATION cdc_publication FOR ALL TABLES;" docker exec -it application-db psql -U postgres -c "SELECT pg_create_logical_replication_slot('cdc_slot', 'test_decoding');"

Let's review what happens in the code above.

We set wal_level = logical, which ensures all the changes to a table are stored in the log.

Restart for WAL level changes to take effect.

Grant our user (Postgres) to read the logs.

Create a publication called cdc_publication for all the tables, enabling our supplier table to publish logs.

Create a slot (from which the logs will be read). Note that we use a test_decoding plugin here, which is intended for testing purposes.

Note: In production systems, we would use the richer and performant pg_output plugin.

PostgreSQL write-ahead log showing captured changes for CDC

Let's examine some WAL entries. Run the following commands on your terminal.

plaintext
docker exec -it application-db psql -U postgres -c "INSERT INTO suppliers ( name, address, is_active ) VALUES ( 'Solutions Inc.', 'Japan', TRUE );" docker exec -it application-db psql -U postgres -c "update suppliers set name = 'Asia Soln' where name = 'Solutions Inc.';" # Let's look at our WAL entries docker exec -it application-db psql -U postgres -c "select * from pg_logical_slot_peek_changes('cdc_slot', NULL, NULL, 'include-xids', '0');"
lsnxiddata
0/155B398749BEGIN
0/155B400749table public.suppliers: INSERT: id[integer]:3 ...
0/155B7E8749COMMIT
0/155B820750BEGIN
0/155B820750table public.suppliers: UPDATE: id[integer]:3 ...
0/155B8C0750COMMIT

Note: The lsn and xid may be different on your run.

The BEGIN .. COMMIT block indicates a transaction (ie, the logs are written to PostgreSQL disk storage). The transaction block ensures ACID compliance.

Every row has an increasing key called LSN (log sequence number), which is used to uniquely identify a query that was run.

The logical readers (which most databases provide) ensure that only committed transactions are readable.

When we use the pg_output plugin, we will also retrieve the old values (as shown below).

lsnxiddata
0/155B820750BEGIN
0/155B820750table public.suppliers: UPDATE: id[integer]:3 ... 'Asia Soln' (new value) 'Solutions Inc.' (old value)
0/155B8C0750COMMIT

From the above explanation, we can see how CDC handles the problems we faced using traditional ETL.

  1. Missing updates: Since all changes to our table are captured, we will not miss any updates.
  2. Schema evolution: The CDC captures both past and current row values and indicates when schema changes occur. Some connectors enable various approaches to handle schema evolution.

CDC Challenge 1: Recovery from Failures

While the script appears simple enough, any failure in the script will necessitate rerunning it. To rerun the script, we need to know the last event that was processed.

Postgres WAL has an increase key called LSN (Log Sequence Number). Your CDC system will need to be able to restart the data processing, where it had failed before.

For this, we will need to store the LSN in a persistent storage. This information (aka state) representing the current location up to which our system has processed records will need to be stored, updated, and retrieved as needed.

CDC Challenge 2: Handling Primary Key Changes

When modeling data in the warehouse, a table's primary key is often used to detect updates to existing rows. In that case, the primary key is updated, and it will not be possible to identify which of the older rows were updated.

You would need to create your own CDC key, such as using an MD5 hash of all the values (except create/update ts columns) to create a custom primary key.

CDC Challenge 3: Scaling to Large Data Volumes

While a simple script may work for small data sizes, as the data grows to medium or large sizes, a simple script will not suffice, as there may be numerous changes to catch up on.

For such cases, we will need to incorporate a middle layer to store the events to be processed. Kafka is typically used for this purpose to enable scaling up for large table sizes and scaling out to consume change events from multiple tables.

With scale, we face the following concerns:

  1. Event Duplication: With a large-scale CDC, there are 3 systems: a system to pull log events, a system to store in-flight change events, and a system to process the log events to the warehouse. With Kafka in the mix, you might get multiple copies of the same event. At the same time, there are ways to design events for at least once, at most once, & exactly once(Reference Docs). You will have to ensure that your producer and consumers are designed for this, along with the fact that each event will need a unique key (which not all data sources have).
  2. Event Order: Another concern is the order of events. A significant amount of effort has been invested in ensuring the consistency of database logs to maintain their order, as this forms the cornerstone of ACID compliance. However, when we introduce a distributed middle layer(Kafka), we may inadvertently lose the order of events. To overcome this, we will need to design the system in a way that preserves the event order. This is typically handled by using one partition per table (Kafka can guarantee event order within a single partition), which must be set as a Kafka setting. This key will likely be the primary key of the table.
Change Data Capture event order not guaranteed for consumers

CDC Challenge 4: Bootstrapping and Backfills

Suppose any issues require downtime or reprocessing data from the beginning (e.g. backfilling an upstream table). In that case, you will need to simultaneously perform a snapshot dump and start reading the change events.

When you begin a snapshot (depending on the size of the source data), your system will take a while to read in all the information and ingest it, during which the snapshot-ed rows in the source may be updated. This bootstrapping problem becomes significantly worse as the table size increases.

Big Tech has developed systems like DBLog to address this issue, ensuring that change events are captured and applied appropriately, along with initial dumps.

CDC Challenge 5: Database Upgrades and Complexities

Sometimes DB upgrades break WAL formats, and your CDC system will need to be able to handle these cases. In addition to these, there are database-specific nuances, such as Postgres TOAST, where a large amount of data is stored on disk and only a pointer to the data's location on disk is stored in the WAL. In such cases, your system will need to query the particular table and row to get the toast data.

To recap, we saw how CDC-based systems can overcome many of the issues faced by traditional ETL systems. However, with the increasing scale and variability of input source systems, you will need to spend a significant amount of time building industry-standard features for data movement.

You know your data team's time would be better spent adding business value; there is a way to streamline data movement for your enterprise: enter Estuary.

Estuary: Easy to Start & Easy to Scale Out

As we saw in the comparison between Log-Based CDC vs Traditional ETL, building and maintaining reliable data pipelines requires significant effort. Estuary Flow simplifies CDC ETL pipelines by combining ingestion, transformation, and delivery into one scalable platform.

Estuary is a unified data movement and transformation platform for the whole data team.

It enables you to build low-latency ETL and ELT pipelines using connectors for any database or data warehouse, leveraging Change Data Capture (CDC) to power your analytics, operations, and AI.

Check out the following links to learn about:

  1. Estuary’s built-in schema evolution system
  2. How Estuary achieves exactly-once delivery
  3. How to get started with data streaming on Estuary

Your data team can focus on enabling stakeholders to use data effectively, without worrying about whether all upstream changes are captured or spending time building and maintaining pipelines for a multitude of input sources.

In addition to supporting CDC, Estuary can also fall back to batch processing when CDC is not possible, either due to governance issues or performance concerns related to writing a full WAL in a production environment.

Get started now 👇

Move Data in Minutes - ETL,  ELT, CDC  - Real-time Data Integration

Conclusion

To recap, we compared Log-Based CDC vs Traditional ETL, showed how Change Data Capture overcomes many of ETL’s limitations, and explained how Estuary helps data teams scale without the heavy maintenance burden.

  1. We looked at a high-level comparison of traditional ETL, CDC-based ETL, and Estuary.
  2. We explored how CDC solves issues like missed updates, schema evolution, and recovery challenges that traditional ETL pipelines face.
  3. We saw how Estuary Flow enables data teams to deliver reliable data to stakeholders without manually managing complex data movement systems.

Empower your data team to deliver value and see their impact, without having to manage the data movement system manually. Let the data team handle the tasks that add immediate value to your business by utilizing Estuary to facilitate data movement.

Estuary Flow is rapidly gaining adoption among teams that want to simplify their data stacks by consolidating real-time ingestion, transformation, and delivery into a single, scalable system.

With Estuary, you can capture changes in real time, handle schema evolution automatically, and scale effortlessly. Get started free today.

 

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Joseph Machado
Joseph MachadoData Engineer

Joseph Machado is a Software engineer specializing in data with over a decade of experience optimizing data infrastructure and driving organizational growth. Throughout his career, he has contributed to both enterprise-level organizations and small to medium-sized businesses, gaining comprehensive insights into the unique operational challenges and opportunities across different business scales. His expertise spans the full spectrum of data engineering solutions, from building data platforms to modeling data to enable decision-making.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.