
The Problem with Change Data Capture (CDC)
Change Data Capture (CDC) has evolved significantly, becoming a core part of modern data architectures. However, it still encounters persistent challenges, particularly around reliability and auditability. The reference architecture must be rethought to address these, focusing on improving data consistency, system resilience, and operational efficiency.
This article explores a new CDC architecture designed to overcome key limitations such as log storage constraints, expensive backfills, and schema drift - offering a scalable, reliable, and audit-ready approach to data flows.
Common Challenges in CDC Architecture
Let's analyze the shortcomings and flaws inherent in the majority of conventional CDC architectures.
Unbounded Transaction Log Storage
CDC implementations must acknowledge transaction logs in systems like PostgreSQL to prevent unbounded growth. Failure to manage this properly can lead to storage issues, potentially impacting production environments.
This challenge manifests in several scenarios:
- Small Table Captures: When CDC is configured to capture changes from small tables with infrequent updates, the Write-Ahead Log (WAL) can accumulate unacknowledged changes, leading to excessive storage usage. For example, imagine a finance application tracking audit logs in a small table that only receives updates during monthly reviews. The WAL will grow unchecked until the next update, risking storage overflow.
- Destination Downtime: CDC systems must handle backpressure without allowing transaction logs to fill up if the destination system is unavailable.
- Backfills: During historical data backfills, storage can quickly become overwhelmed if WAL acknowledgments are paused.
Auditability & Time Travel
CDC provides a foundational mechanism for auditability through transaction logs, which are critical for complying with regulations like SOX. Additionally, retaining these logs enables time travel capabilities, allowing organizations to reconstruct historical states for debugging, analytics, or recovery purposes. The challenge lies in balancing the storage costs with the operational benefits of retaining detailed historical data.
For a deep dive into CDC best practices, check out CDC Done Correctly
Expensive Backfills for New Destinations
Adding new destinations often requires extensive backfills, which can strain source databases. A well-architected CDC system minimizes this load by leveraging stored transaction logs, enabling historical data to be replayed without repeatedly querying the production database.
For example, a SaaS company expanding to a new region might need to replicate historical customer data to a regional data center. Without CDC log storage, this process would place a significant load on the live production database, affecting user performance.
Backfill Automation
Efficient backfills are essential for frictionless data integration. An ideal CDC implementation integrates backfill processes with real-time replication, ensuring data consistency without manual intervention. Moreover, WAL growth should be prevented during these operations to maintain system stability.
Schema Drift and Migrations
Managing schema changes is a complex aspect of CDC. Automated mechanisms are needed to detect and propagate schema modifications downstream, including handling new columns, datatype changes, and primary key adjustments. Effective CDC systems reduce the risk of data discrepancies during schema migrations.
Key Requirements for a Modern CDC Architecture
- Transaction Acknowledgment: Timely acknowledgment of processed transactions prevents unbounded log growth.
- Real-Time Data Capture: Continuous data capture ensures low-latency updates, which is critical for real-time analytics and applications.
- Heartbeats: Periodic heartbeat signals verify system health and ensure transaction logs are regularly trimmed.
- Watermarking: This technique tracks data processing progress, aiding recovery and ensuring data consistency.
- Durable Storage for Transaction Logs: Reliable, external storage mitigates data loss risks and supports historical data replay.
- Decoupled Replay Capability: The ability to replay transaction logs independently of source systems enhances flexibility and reduces operational dependencies.
A New Reference Architecture for CDC
To address these requirements, the following architectural components are recommended:
- Real-Time CDC Captures: Implement CDC systems with native support for watermarking and efficient change tracking.
- Durable External Storage: Store transaction logs in resilient, always-available storage systems like cloud object storage.
- Flexible Replay Mechanism: This enables data replays to downstream systems without impacting source databases, facilitating easy integration with new data consumers.
Estuary Flow represents a significant shift in how CDC architectures are designed, moving away from the traditional architecture. Instead, Estuary Flow leverages cloud-native object storage, providing users with the flexibility to plug in their preferred storage solutions.
At the heart of Flow is a decoupled architecture that separates data capture and materialization from processing and storage. This means that real-time CDC captures are ingested and immediately stored in durable object storage, such as AWS S3, Google Cloud Storage, or Azure Blob Storage. Users can configure their own object storage, ensuring data sovereignty and compliance with any organizational policies.
This object storage-centric approach offers several benefits and solves the challenges of the traditional approach:
- Scalability: By using object storage, we can handle massive volumes of CDC data without the limitations of traditional log storage systems. The architecture automatically scales to accommodate growing data needs, making it suitable for enterprises with large, dynamic datasets.
- Durability and Resilience: Object storage provides high durability, ensuring that CDC logs are securely stored and can be accessed for replay or recovery at any time. This eliminates concerns about data loss and enhances system resilience.
- Flexible Data Replay: The decoupled nature of Flow's architecture allows for independent data replay from object storage. This means that historical data can be replayed to new destinations or for recovery purposes without impacting the source systems. This flexibility is critical for minimizing the load on production databases and simplifying data integration workflows.
- Simplified Operations: By abstracting away the complexities of managing traditional log-based systems, Flow reduces the operational burden on engineering teams. The architecture automates many processes, including backfill management and schema evolution, allowing organizations to focus on leveraging their data rather than managing infrastructure.
- Cost Efficiency: Using object storage can be more cost-effective than traditional storage methods, especially when considering the high costs associated with cross-AZ networking and durable replication in systems like Kafka. Estuary Flow's architecture optimizes storage costs while maintaining high performance and reliability.
This architecture not only addresses the common challenges faced by traditional CDC implementations but also sets a new standard for modern data integration.
CDC Implementation Options
Debezium + Kafka + Cloud Storage
This combination represents a traditional, open-source-based CDC architecture. Debezium acts as the CDC engine, Kafka handles real-time data streaming, and cloud storage ensures durable log retention.
- Strengths:
- Self-hostable, providing flexibility for organizations with specific infrastructure requirements.
- Real-time data streaming capabilities with robust scalability.
- Strong community support and extensibility.
- Weaknesses:
- Replaying logs requires routing through production Kafka brokers, which can introduce capacity bottlenecks.
- High operational overhead for managing Kafka, Kafka Connect, and Debezium components.
- Limited native support for features like automated backfills and schema evolution.
- Tiered Storage suffers from various issues, such as the inability to compact segments.
- Cross-AZ networking costs for durable replication can quickly eat up your cloud budget.
Estuary Flow
Estuary Flow offers a modern, managed CDC solution to minimize operational complexity. It automates many traditionally manual processes, making it ideal for organizations looking to scale quickly without heavy engineering involvement.
- Strengths:
- Fully managed service offering real-time data capture with high scalability.
- Automated backfill processes that minimize production impact.
- Simplified operations with minimal engineering effort required in any private networking environment.
- Weaknesses:
- Limited self-hosting options may be a consideration for organizations with strict on-premise requirements, though those are likely mitigated by private deployments.
Conclusion
While CDC has become a staple in data integration strategies, its challenges around reliability, auditability, and operational complexity remain. Organizations can build more resilient and efficient CDC pipelines by adopting a reference architecture emphasizing real-time capture, durable storage, and flexible replay mechanisms.
Whether self-managed with tools like Debezium or fully automated with solutions like Estuary Flow, the implementation choice depends on specific business needs and technical constraints. Regardless of the approach, the goal remains: ensuring data integrity, availability, and ease of use across the data ecosystem.

About the author
David Yaffe is a co-founder and the CEO of Estuary. He previously served as the COO of LiveRamp and the co-founder / CEO of Arbor which was sold to LiveRamp in 2016. He has an extensive background in product management, serving as head of product for Doubleclick Bid Manager and Invite Media.
Popular Articles
