
What are your best options for using change data capture (CDC) in a data pipeline for analytics, operations, data science/ML, generative AI projects, or all of these projects at once?
This guide helps answer that question, both for a single project with a single destination, and as a broader platform across multiple projects and destinations.
It first summarizes the major challenges with CDC that you need to consider for CDC-based pipelines, including reliable capture, end-to-end schema evolution and change, snapshots and backfilling, the different types and speeds of loading data, and managing end-to-end data consistency.
This guide then continues with a detailed comparison of leading vendors or technologies before moving into a more detailed description of their strengths, weaknesses, and when to use them.
In the end, ELT and ETL, messaging, and streaming analytics/stream processing vendors and technologies were all included as the major options to consider:
- ELT: Fivetran was included to represent the category. Hevo, Airbyte, and Meltano were close enough to Fivetran that it didn’t make sense to cover all of them. But the column with Fivetran really evaluates all the vendors. If you’d like a more detailed comparison of Fivetran, Hevo, and Airbyte, you can read A Data Engineer’s Guide to Fivetran Alternatives.
- ETL: Estuary was the main option for real-time and hybrid (streaming + batch) ETL. You could also consider some of the more traditional ETL vendors such as Informatica, Matillion, and Talend.
- Streaming integration: Striim is one of the leading vendors in this category. Compared to others, they’ve added the most data integration capabilities on top of their CDC technology.
- Replication: Debezium is the most widely used open-source CDC option. There can’t be any CDC comparison without Debezium. GoldenGate and Amazon DMS deserve consideration as well (more on them below).
This guide evaluates each vendor for the main CDC use cases:
- Database replication - replication across databases for operational use cases
- Operational data store (ODS) - creating a store for read offloading or staging data
- Historical (traditional) analytics - loading a data warehouse
- Operational data integration - data synchronization across apps
- Data migration - retiring an old app or database and replacing it with a new one
- Stream processing and streaming analytics - real-time processing of streaming data
- Operational analytics - analytics or reporting for operational use cases
- Data science and machine learning - data mining, statistics, and data exploration
- AI - data pipelines for generative AI and AI-assisted applications
It also evaluates each vendor based on the following categories and features:
- Connectors - CDC sources, non-CDC sources, and destinations
- Core features - full/incremental snapshots, backfilling, delivery guarantee, time travel, schema evolution, DataOps
- Deployment options - public (multi-tenant) cloud, private cloud, and on-premises
- The “abilities” - performance, scalability, reliability, and availability
- Security - including authentication, authorization, and encryption
- Costs - Ease of use, vendor and other costs you need to consider
Key Takeaways
Change data capture (CDC) is most valuable when you need low-latency change capture with minimal load on source databases.
The hardest CDC problems are not extraction, but snapshots, backfills, schema evolution, and keeping multiple destinations consistent.
ELT tools (such as Fivetran and similar vendors) work well for warehouse-first analytics, but typically operate as scheduled syncs rather than strict real-time pipelines.
Debezium with Kafka is the most common open-source CDC foundation, but it requires building and operating additional services for replay, backfills, and schema management.
Stream processing platforms (such as Striim) are powerful for real-time transformations, but their complexity and connector behavior must be evaluated carefully.
Platforms designed for multi-destination reuse, messaging, and reliable backfills (such as Estuary) are better suited when the same CDC data must support analytics, operations, and AI workloads.
Use cases for CDC
You have two options when choosing CDC technologies today. The first is to choose a technology purpose-built for that type of project. The second is to start to build a common real-time data backbone to support all your projects.
There are great reasons, especially business reasons, for choosing a specialized or general-purpose CDC technology. The most important thing you can understand are the strengths and weaknesses of each technology.
Where each vendor started continues to be one of the best ways to understand their strengths and weaknesses for different use cases. Most vendors continue to only be well suited for their original market.
If you review the history of the CDC, you can easily understand the strengths and weaknesses of each vendor.
Build CDC pipelines at right time
See how Estuary supports CDC, batch, and streaming across multiple destinations in one platform.
Replication
CDC was invented as a built-in feature for database-specific replication. Someone realized that the “write-ahead” transaction log file, which is used to save transactions made after a database snapshot and recover after a database failure, was an ideal source of changes for replication. Just stream writes to the log as they happen and use them to create exact replicas. It’s not only the most real-time option. It puts the least load on the database itself.
But database-specific CDC remained just that. It was used by each vendor just to scale a database deployment, not for general-purpose data sharing. In general, you can only create database replicas. You can’t merge data from different databases, for example.
Decades of hardening built-in replication helped identify the best architecture for more general-purpose cross-database replication. By now, log-based CDC is mature and hardened. It is the most cost-effective, and lowest-load way to extract real-time data from databases.
GoldenGate, founded in 1995, was one of the first major vendors to release general-purpose replication.
Operational data stores (ODS)
Companies started to use GoldenGate and other similar vendors to create operational data stores (ODS), read-only stores with up-to-date operational data, for data sharing, offloading reads, and staging areas for data pipelines. Eventually, Oracle acquired GoldenGate in 2009.
They’re not included in the final comparison because this guide focuses on broader CDC pipeline platforms, not replication-first products. But if you’re only looking for a replication product and need to support Oracle as a source, GoldenGate should be on your short list.
Most data replication vendors like GoldenGate are pre-SaaS and DataOps. As a result, they are a little behind in these areas.
Historical analytics
CDC has been used for roughly 20 years now to load data warehouses, starting with ETL vendors. It is not only proven; it provides the lowest latency, least load on the source, and the greatest scalability for a data pipeline.
ETL vendors had their own staging areas for data, which allowed them to implement streaming from various sources, processing the data in batch, and then loading destinations in batch intervals. Historically, data warehouses preferred batch loading because they are columnar stores. (A few like Teradata, Vertica, or Greenplum could do real-time data ingestion and support fast query times as well.)
The original ETL vendors like Informatica continue to evolve, having been one of the first to release a cloud offering (Informatica Cloud). Other competitors one would consider include Talend, now part of Qlik, and Matillion. While these three and many other smaller vendors are good options for ETL, especially for on-premises data warehouses, data migration, and operational data integration (see below), and they do offer solid CDC support, they are missing some of the more recent CDC innovations, including incremental snapshots and backfilling. They are also missing modern data pipeline features including schema evolution and DataOps support.
Modern ELT vendors like Fivetran, Hevo, and Airbyte implemented CDC-based replication to support cloud data warehouse projects. But these ELT vendors don’t have an intermediate staging area. They need to extract and load at the same interval.
For this reason, most ELT platforms prioritize scheduled syncs, so they’re usually better for analytics workloads than strict real-time CDC requirements.
CDC is meant to be real-time. It’s designed for real-time capture of change data from a transaction or write-ahead log (WAL), which we’ll refer to as a WAL. The WAL is used by CDC as a buffer when read delays happen. It was never intended to support large-scale buffering for batch-based reads. Increasing WAL/log retention to buffer batch-style reads can add latency and operational risk under load.
Make sure you understand each vendor’s CDC architecture and limitations like this, and evaluate whether it will meet your needs. If you’d like to learn more about this, read Change Data Capture (CDC) Done Correctly (CDC).
Operational Data Integration and Data Migration
While ETL vendors started as tools for loading data warehouses, they quickly ended up with 3 major use cases: data warehousing (analytics), operational data integration (data synchronization), and data migration. By 2010 ETL vendors were also using CDC across all three use cases.
Data migration is very similar to operational data integration, so it’s not included as its own use case.
ELT vendors are typically not used for operational integration or data migration because both require real-time and ETL.
Stream Processing and Streaming Analytics
Around the same time replication and ETL was evolving, messaging and event processing software was being used for various real-time control systems and applications. Technologies started to develop on top of messaging. There were many different variants. Collectively, you could call all these technologies stream processing and streaming analytics. This category has also been described as ‘operational intelligence’ in analyst research. These technologies have always been more complex than ETL or ELT.
- Stream processing was initially called complex event processing (CEP) when it became a little more popular around 2005. It was more generally known as stream processing. Following the creation of Apache Kafka in 2011, technologies like Apache Samza and Flink started to emerge in this category.
- Streaming analytics, which was more directly tied to BI and analytics, was another version. Many streaming analytics vendors used rules engines, or machine learning techniques to look for patterns in data.
Several of the people involved with GoldenGate went off to found Striim in 2012 to focus on the use of CDC replication for replication, stream processing and analytics. Over time they added connectors to support integration use cases.
But Striim remains more complex than ELT and ETL with its proprietary SQL-like language (TQL) and streaming concepts like windowing. This is what mostly limits Striim to stream processing and sophisticated real-time use cases.
Operational analytics
Messaging software started to evolve to support operational analytics over two decades ago. In 2011, Kafka was open-sourced after originating at LinkedIn. Very quickly people started to use Kafka to support real-time operational analytics use cases. Today, if you look at the myriad of high-performance databases that get used to support sub-second analytics in operations - ClickHouse, Druid/Imply, ElasticSearch, Firebolt, Pinot/StarTree, Materialize, RisingWave, Rockset (acquired by OpenAI in 2024) - Kafka is one of the most common ingestion backbones used to load them.
Modern Real-time Data Pipelines
Ever since the rise of modern technology giants - including Facebook, Amazon, Apple, Netlfix, and Google (formerly grouped under FAANG) - we’ve watched their architectures, read their papers, and tried out their open source projects to help get a glimpse into the future. It’s not quite the same; the early versions of these technologies are often too complex for many companies to adopt. But those first technologies end up getting simplified. This includes cloud computing, Hadoop/Spark, BigQuery and Snowflake, Kafka, stream processing, and yes, modern CDC.
Over the last decade CDC has been changing to support modern streaming data architectures. By 2012 people also started talking about turning the database inside out with projects like Apache Samza trying to turn the database inside out into streams of change data. Netflix and others built custom replication frameworks whose learnings led to modern CDC frameworks like Debezium and Gazette. Both have started to simplify initial (incremental) snapshotting and add modern pipeline (DataOps) features including schema evolution.
All of these companies have had real-time backbones for a decade that stream data from many sources to many destinations. They use the same backbone for operations, analytics, data science and ML, and generative AI. These technologies continue to evolve into the core of modern real-time data pipelines.
Shared streaming data pipelines across operations are becoming more common, and they’re also working their way towards analytics. They’re more the norm for newer AI projects.
- Tens of thousands of companies use Kafka and other messaging technologies in operations.
- Kafka is one of the most popular connection options for BigQuery, Databricks, and Snowflake as just one destination, and their streaming support has improved. Ingestion latency has dropped into the low seconds at scale.
- Spark and Databricks, BigQuery, and Snowflake are increasingly used for operational workloads. Just look at the case studies and show sessions. Query performance continues to improve, though it’s not quite sub-second at scale.
- Many high-performance OLAP databases now see a significant share of deployments using streaming ingestion, especially Kafka.
- AI projects are only accelerating the shift, and generative AI use cases support operations. They are helping drive more destinations for data.
You can either choose replication, ELT, or streaming technologies for specific use cases, or you start to use this latest round of technologies to build a modern real-time data pipeline as a common backbone that shares data across operations, analytics, data science/ML, and AI.
The requirements for a common modern data pipeline is relatively well defined. It’s a mix of traditional requirements for operational data integration with newer data engineering best practices. They include support for:
- Real-time and batch sources and destinations, including extracting data with real-time CDC and batch-loading data warehouses. If you want to learn why CDC should be real-time you can read change data capture (CDC) done correctly (CDC).
- Many destinations at the same time, with only one extract per source. Most ETL only supports one destination, which means you need to re-extract for each destination.
- A common intermediate data model for mapping source data in different ways to destinations.
- Schema evolution and automation to minimize pipeline disruption and downtime. This is in addition to modern DataOps support, which is a given.
- Transformations for those destinations that do not support dbt or other types of transformations.
- Stream-store-and-forward to enable exactly-once extraction semantics, reliable delivery, and data reuse. Kafka is one of the most common streaming backbones today. But it is missing services for backfilling, time travel, and updating data during recovery or schema evolution. Kafka retention is configurable but typically finite. You can either build these services along with a data lake or get some of the services as part of your pipeline technology.
These features are not required for individual projects, though a few are nice to have. They only become really important once you have to support multiple projects.
Comparison Criteria
Now that you understand the different use cases and their history, and the concept of a modern data pipeline, you can start to evaluate different vendors.
The detailed comparison below covers the following categories:
1. Use cases - Over time, you will end up using data integration for most of these use cases. Make sure you look across your organization for current and future needs. Otherwise you might end up with multiple data integration technologies, and a painful migration project.
- Replication - Read-only and read-write load balancing of data from a source to a target for operational use cases. CDC vendors are often used in cases where built-in database replication does not work.
- Operational data store (ODS) - Real-time replication of data from many sources into an ODS for offloading reads or staging data for analytics.
- Historical analytics - The use of data warehouses for dashboards, analytics, and reporting. CDC-based ETL or ELT is used to feed the data warehouse.
- Operational data integration - Synchronizing operational data across apps and systems, such as master data or transactional data, to support business processes and transactions.
NOTE: none of the vendors in this evaluation support many apps as destinations. They only support data synchronization for the underlying databases. - Data migration - This usually involves extracting data from multiple sources, building the rules for merging data and data quality, testing it out side-by-side with the old app, and migrating users over time. The data integration vendor used becomes the new operational data integration vendor.
NOTE: like operational data integration, the vendors only support database destinations. - Stream processing - Using streams to capture and respond to specific events
- Operational analytics - The use of data in real-time to make operational decisions. It requires specialized databases with sub-second query times, and usually also requires low end-to-end latency with sub-second ingestion times as well. For this reason the data pipelines usually need to support real-time ETL with streaming transformations.
- Data science and machine learning - This generally involves loading raw data into a data lake that is used for data mining, statistics and machine learning, or data exploration including some ad hoc analytics. For data integration vendors this is very similar to data warehousing.
- AI - the use of large language models (LLM) or other artificial intelligence and machine learning models to do anything from generating new content to automating decisions. This usually involves different data pipelines for model training, and model execution.
2. Connectors - The ability to connect to sources and destinations in batch and real-time for different use cases. Most vendors have so many connectors that the best way to evaluate vendors is to pick your connectors and evaluate them directly in detail.
- Number of connectors - The number of source and target connectors. What’s important is the number of high-quality and real-time connectors, and that the connectors you need are included. Make sure to evaluate each vendor’s specific connectors and their capabilities for your projects. The devil is in the details.
- CDC sources - Does the vendor support your required CDC sources now for current and future projects?
- Destinations - Does the vendor support all the destinations that need the source data, or will you need to find another way to load for select projects?
- Non-CDC - when needed for batch-only destinations or to lower costs, is there an option to use batch loading into destinations?
- Support for 3rd party connectors - Is there an option to use 3rd party connectors?
- CDK - Can you build your own connectors using a connector development kit (CDK)?
- API - Is an admin API available to help integrate and automate pipelines?
3. Core features - How well does each vendor support core data features required to support different use cases? Source and target connectivity are covered in the Connectors section.
- Batch and streaming support - Can the product support streaming, batch, and both together in the same pipeline?
- Transformations - What level of support is there for streaming and batch ETL and ELT? This includes streaming transforms, and incremental and batch dbt support in ELT mode. What languages are supported? How do you test?
- Delivery guarantees - Is delivery guaranteed to be exactly once, and in order?
- Data types - Support for structured, semi-structured, and unstructured data types.
- Backfilling - The ability to add historical data during integration, or later additions of new data in targets.
- Time travel - The ability to review or reuse historical data without going back to sources.
- Schema drift - support for tracking schema changes over time, and handling it automatically.
- DataOps - Does the vendor support multi-stage pipeline automation?
4. Deployment options - does the vendor support public (multi-tenant) cloud, private cloud, and on-premises (self-deployed)?
5. The “abilities” - How does each vendor rank on performance, scalability, reliability, and availability?
- Performance (latency) - what is the end-to-end latency in real-time and batch mode?
- Scalability - Does the product provide elastic, linear scalability
- Reliability - How does the product ensure reliability for real-time and batch modes? One of the biggest challenges, especially with CDC, is ensuring reliability.
6. Security - Does the vendor implement strong authentication, authorization, RBAC, and end-to-end encryption from sources to targets?
7. Costs - the vendor costs, and total cost of ownership associated with data pipelines
- Ease of use - The degree to which the product is intuitive and straightforward for users to learn, build, and operate data pipelines.
- Vendor costs - including total costs and cost predictability
- Labor costs - Amount of resources required and relative productivity
- Other costs - Including additional source, pipeline infrastructure or destination costs
Comparison Matrix
Debezium/Kafka | Fivetran (+ Airbyte + Hevo + Meltano) | Striim (and Goldengate) | Estuary | |
| Use cases | ||||
| Database Replication | Real-time replication (sub-second to seconds) | Scheduled sync (1 min to hours, plan/connector dependent) | Real-time (and batch) replication (sub-second to hours) | Real-time and batch replication (sub-second to hours) |
| Replication to ODS | Yes (real-time only) | Possible (DB destinations), but analytics-first | Yes | Yes (real-time and batch) |
| Historical analytics | Replication only, Limited sources | Yes. Batch only ELT | Yes (real-time + incremental batch; more complex) | Yes. batch and stream ELT/ETL |
| Op. Data integration | No | No no ETL support) | Yes Real-time replication Transforms via TQL | Yes |
| Data migration | No | No no ETL support) | No | Yes |
| Stream processing | Yes (via Kafka and coding or streaming into destination) | No | Yes (Striim via TQL; GoldenGate is replication-first) | Yes (using SQL, TypeScript) |
| Operational Analytics | Yes | Higher latency batch ELT only | Yes (TQL transforms) | Yes (Streaming SQL, TypeScript) |
| Data science and ML | Possible (not turnkey; requires Kafka ecosystem tooling) | ELT only | Not used | Yes (ELT/ETL) |
| AI pipeline | Kafka support by vector database vendors, custom coding (API calls to LLMS, etc.) | Limited (vector DB destinations like Milvus in preview) | Limited (stream to Google, Microsoft, AWS, Snowflake) | Vector database support, API calls to ChatGPT & other AI, data prep/transform |
Debezium/Kafka | Fivetran (+ Airbyte + Hevo + Meltano) | Striim (and Goldengate) | Estuary | |
| Connectors | ||||
| Number of connectors | 100+ Kafka sources and destinations (via Confluent, vendors) | <300 connectors | 100+ | 200+ |
| CDC Connectors (sources) | PostgreSQL (includes TimescaleDB), MySQL, SQL Server, MongoDB, Oracle, Db2 (plus community connectors depending on distro) | MySQL, SQL Server, Postgres, Oracle Scheduled CDC; single destination per sync | CosmosDB, MariaDB, MongoDB, MySQL, Oracle, Postgres, SQL Server | AlloyDB, Firestore, MySQL, Postgres, MariaDB, MongoDB, Oracle, Salesforce, Snowflake, SQL Server |
| Destinations | Kafka ecosystem (indirect) Data warehouses, OLAP databases, Time series databases, etc. | https://fivetran.com/docs/destinations Data warehouses, Databases (above), SingleStore, Materialize, Motherduck
| https://www.striim.com/docs/en/targets.html | https://estuary.dev/integrations/ |
| Non-CDC connectors | Kafka ecosystem | Batch only. | No | 200+ native batch and real-time connectors incl. |
| Support for 3rd party connectors | Kafka ecosystem | No | No | Support for 500+ Airbyte, Stitch, and Meltano connectors |
| Custom SDK | Kafka Connect | Yes → Limited (Functions; Lite connectors) | No | Yes |
| API | Kafka API | Yes | No | Yes Estuary API docs |
Debezium/Kafka | Fivetran (+ Airbyte + Hevo + Meltano) | Striim (and Goldengate) | Estuary | |
| Core features | ||||
| Batch and streaming (extract and load) | Streaming-centric (subscribers and pick up in intervals) | Scheduled sync (1 min to 24 hrs) | Streaming-centric but can do incremental batch | Can mix streaming and batch in the same pipeline |
| Snapshots | Full or incremental | Initial full sync + incremental updates | Incremental | Incremental |
| ETL Transforms | Coding (SMT) | None | TQL transforms | SQL, TypeScript, Python |
| Workflow | Coding or 3rd party (including OSS) | None | Striim Flows | Many-to-many pub-sub ETL (SQL, Typescript) |
| ELT transforms | No | ELT only with | dbt | Dbt. Integrated orchestration. |
| Delivery guarantees | At-least-once by default; exactly-once possible with Kafka Connect. | Exactly once (Fivetran) At least once (Airbyte) | Exactly-once processing (per Striim), connector-dependent | Exactly once (streaming, batch, mixed). |
| Multiple destinations | Yes (identical data by topic) | No | Yes (same data, but different flows) | Yes (different data per destination) |
| Backfilling | Yes (requires re-extract for each destination) | Yes (requires re-extract for each destination) | Yes (requires re-extract for new destinations) | Yes (extract once, backfills for multiple destinations) |
| Time travel | No | No | No | Yes |
| Schema inference and drift | Support for message-level schema evolution (Kafka Schema Registry) with limits by source and destination | Good schema inference, automating schema evolution | Yes, with some limits by destination. | Good schema inference, automating schema evolution |
| DataOps support | CLI, API | CLI, | CLI, API | CLI |
Debezium/Kafka | Fivetran (+ Airbyte + Hevo + Meltano) | Striim (and Goldengate) | Estuary | |
| Deployment options | Open source, Confluent Cloud (Public) | Public cloud, private cloud (Airbyte, Meltano OSS) | On prem, Private cloud, Public cloud | Open source, Private cloud, Public cloud |
| The abilities | ||||
| Performance (minimum latency) | <100ms (streaming) | Scheduled syncs; as low as 1 minute on Enterprise / Business Critical plans (connector-dependent) | <100ms (streaming) | < 100 ms (in streaming mode) |
| Scalability | High | Medium–High (HVR higher; Fivetran lower per-connector throughput) | High (GB/sec) | High |
| Reliability | High (Kafka); medium (Debezium connectors) | Medium–High (CDC reliability varies by source and connector) | High | High |
| Security | ||||
| Data Source Authentication | SSL/SSH | OAuth, HTTPS, SSH, SSL, API tokens | SAML, RBAC, SSH/SSL, VPN | OAuth 2.0 / API Tokens |
| Encryption | Encryption in transit (Kafka topic security; at-rest depends on broker/storage) | Encryption in transit | Encryption in transit | Encryption at rest and in transit |
Debezium/Kafka | Fivetran (+ Airbyte + Hevo + Meltano) | Striim (and Goldengate) | Estuary | |
| Support | Community support (Debezium); enterprise support via Confluent Cloud | Medium (generally good ratings; support quality varies by plan and connector) | High (enterprise support model) | High (direct vendor support; fast response and resolution reported by customers) |
| Costs | ||||
| Ease of use | Hard (requires Kafka, connector setup, and ongoing maintenance) | Easy for ingestion; dbt and cost management require learning | Medium–Hard (requires learning Flows and TQL) | Easy for core pipelines; streaming transforms require some learning |
| Vendor costs | Low (open source); Medium–High with Confluent Cloud | High | High (enterprise licensing) | Low (usage-based pricing; generally lower at higher volumes) |
| Data engineering costs | High (custom development for pipelines, recovery, schema handling) | Low–Medium (simplified ingestion; dbt-based transforms) | High (proprietary language and stream-processing expertise required) | Low–Medium (higher productivity via built-in transforms and schema automation) |
| Admin costs | High (infrastructure, Kafka ops, connector management) | Medium–High (monitoring, troubleshooting, CDC edge cases) | High (platform administration and tuning) | Low (managed platform; minimal operational overhead) |
Evaluating CDC Tools
The sections below compare leading CDC technologies, outlining where each works well, where it falls short, and which use cases it best supports.
1. Debezium
Debezium has been sponsored by Red Hat since the project started (2015) following the release of Kafka, and Kafka Connect. It aligns with ideas popularized by Martin Kleppmann’s ‘turning the database inside out’ work.
Debezium is the open source option for general-purpose replication, and it does many things right for replication, from scaling to incremental snapshots (use Debezium’s incremental snapshots to avoid long, blocking full snapshots when possible.) If you are committed to open source, have the specialized resources needed, and need to build your own pipeline infrastructure for scalability or other reasons, Debezium is a great choice.
Otherwise think twice about using Debezium because it will be a big investment in specialized data engineering and admin resources. While the core CDC connectors are solid, you will need to build the rest of your data pipeline including:
- The many non-CDC source connectors you will eventually need. You can leverage all the Kafka Connect-based connectors to over 100 different sources and destinations. But they have a long list of limits (see confluent docs on limits).
- Data schema management and evolution - while the Kafka Schema Registry does support message-level schema evolution, the number of limitations on destinations and the translation from sources to message makes this much harder to manage.
- Kafka retention is configurable but typically finite. Kafka supports replay within configured retention, but long-term backfills/time travel typically require additional storage and tooling.
- By default, snapshot (backfill) events and CDC events are written to the same topics. So if you need to redo a snapshot, all destinations will get it. If you want to change this behavior you need to have separate source connectors and topics for each destination, which adds costs and source loads.
- You will need to maintain your Kafka cluster(s), which is no small task.
If you are already invested in Kafka as your backbone, it does make good sense to evaluate Debezium. Using Confluent Cloud does simplify your deployment, but at a cost.
2. Fivetran
Fivetran is listed here as the main ELT vendor. But this section is meant to represent Fivetran, Airbyte, Hevo, Meltano, Stitch, and other ELT vendors. For more on the other vendors including Airbyte and Hevo you can read A Data Engineer’s Guide to Fivetran Alternatives.
If you are only loading a cloud data warehouse, you should evaluate Fivetran and other ELT vendors. They are easy-to-use, mature options.
If you want to understand and evaluate Fivetran, it’s important to know Fivetran’s history. It will help you understand Fivetran’s strengths and weaknesses relative to other vendors, including other ELT vendors.
Fivetran was started in late 2012 by George Fraser and Taylor Brown, who wanted an integrated stack to capture and analyze data. The name was a play on Fortran and meant to refer to a programming language for big data. After a few years the focus shifted to providing just the data integration part because that’s what so many prospects wanted. Fivetran was designed as an ELT (Extract, Load, and Transform) architecture because in data science you don’t usually know what you’re looking for, so you want the raw data.
In 2018, Fivetran raised their series A, and then added more transformation capabilities in September 2020 when it released Data Build Tool (dbt) support. Fivetran supports CDC for key databases and expanded its enterprise CDC capabilities significantly with the HVR acquisition (announced 2021, completed Oct 2021).
Fivetran’s design worked well for many companies adopting cloud data warehouses starting a decade ago. While all ETL vendors also supported “EL” and it was occasionally used that way, Fivetran was cloud-native, which helped make it much easier to use. The “EL” is mostly configured, not coded, and the transformations are built on dbt core (SQL and Jinja), which many data engineers are comfortable using.
But today Fivetran often comes in conversations as a vendor customers are trying to replace. Understanding why can help you understand Fivetran’s limitations.
The most common points that come up in these conversations and online forums are about needing lower latency, improved reliability, and lower, more predictable costs:
- Latency: Fivetran supports plan-based sync frequencies (including 1 minute on Enterprise or Business Critical for most standard connectors), but this is still scheduled batch syncing and end-to-end freshness also depends on source limits, destination loading, and downstream transforms.
- Costs: Another major complaint are Fivetran’s high vendor costs, which have been anywhere from 2-10x the cost of Estuary as stated by customers. Fivetran costs are based on monthly active rows (MAR) that change at least once per month. This may seem low, but for several reasons (see below) it can quickly add up.
Lower latency is also very expensive. To reduce latency from 1 hour to 15 minutes can cost you 33-50% more (1.5x) per million MAR, and 100% (2x) or more to reduce latency to 1 minute. Even then, you still have the latency of the data warehouse load and transformations. The additional costs of frequent ingestions and transformations in the data warehouse can also be expensive and take time. Companies often keep latency high to save money. - Unpredictable costs: Another major reason for high costs is that MAR are based on the unique identifiers (primary keys) Fivetran uses to track transfers each month, counted separately per destination, connection, and table.
For some data sources you have to extract all the data across tables, which can mean many more rows. Fivetran also converts data from non-relational sources such as SaaS apps into highly normalized relational data. Both make MARs and costs unexpectedly soar. - Reliability: Another reason for replacing Fivetran is reliability. Customers have struggled with a combination of alerts of load failures, and subsequent support calls that result in a longer time to resolution. CDC reliability and freshness depend on connector behavior, sync scheduling, source constraints, and how long syncs take relative to the configured frequency; validate with the status page and connector-specific docs.Review the vendor’s public status page and incident history during evaluation, and validate how incidents are counted toward uptime commitments for your contract. Make sure you understand Fivetran’s current SLA in detail. Fivetran has had an SLA definitions vary by plan and contract; verify how downtime is defined (connector-level vs platform-level), what’s excluded, and what credits apply.
- Support: Customers also complain about Fivetran support being slow to respond. Combined with reliability issues, this can lead to a substantial amount of data engineering time being lost to troubleshooting and administration.
- DataOps: Fivetran does not provide much control or transparency into what they do with data and schema. Fivetran applies standard naming conventions, so destination column names can differ from the source; type changes may introduce renamed or _deprecated columns, and renaming is typically handled downstream (for example in dbt). This can make it harder to migrate to other technologies. Fivetran also doesn’t always bring in all the data depending on the data structure, and does not explain why.
3. Striim
Pronounced “Stream”, Striim is a replication and stream processing vendor that has since moved into data integration.
Several of the people involved with GoldenGate moved on to found Striim in 2012 to focus on the use of CDC replication for replication, stream processing and analytics. Over time, they added connectors to support integration use cases.
If you just need replication and need to do more complex stream processing, Striim is a great option and proven vendor. It has some of the best CDC around, including its Oracle support. It is designed for high-throughput, low-latency CDC at scale. If you have the skill sets to do stream processing then it can also be a good option for supporting both stream processing and data integration use cases.
But its origins in stream processing are its weakness for data integration use cases.
- ELT tools are much easier to learn and use than Striim. This comes up during evaluations.
- Tungsten Query Language (TQL) is a SQL-like language designed for stream processing. While it’s very powerful, it’s not as simple as SQL.
- Striim Flows are a great graphical tool for stream processing, but it does take time to learn, and you do need to build flows (or use TQL) for each capture. This makes it much more complex than ELT vendors for building CDC flows from sources to a destination.
- While Striim can persist streams (for example to Kafka) for recovery, it does not provide a built-in long-term storage layer for managed backfills or time travel; backfilling typically requires taking a new snapshot. That leads to a lack of older change data in newer destinations or losing older change data.
Striim is a great option for stream processing and related real-time use cases. If you do not have the skill sets or the appetite to spend the extra time learning TQL and the intricacies of stream processing, an ELT/ETL vendor might make more sense.
4. Estuary
Estuary was founded in 2019. Its core technology is based on Gazette, an open-source streaming and storage system that has been developed and used for large-scale, real-time data workloads for over a decade, particularly in high-throughput environments such as ad tech.
Estuary is the right-time data platform in this comparison: it is designed from the ground up to support change data capture (CDC), batch ingestion, and event streaming together, while letting teams choose when data moves (sub-second, near real-time, or batch) using a single platform.
Unlike tools that originated as ELT, replication, or stream processing systems, Estuary was built specifically to serve as a shared data backbone across multiple use cases and destinations at once.
- Architecture: Estuary captures data once and stores it durably as collections, which act as a transactional intermediate layer between sources and destinations. This architecture supports both streaming and batch pipelines while minimizing repeated reads from source systems.
- Multi-destination pipelines: Data captured into collections can be delivered to multiple destinations without re-extracting from the source. This makes it easier to support analytics, operational systems, and downstream applications from the same data.
- Delivery guarantees: Estuary provides exactly-once delivery semantics across streaming and batch pipelines. This applies to CDC, batch ingestion, and fan-out to multiple destinations.
- Backfilling and reprocessing: Because data is stored durably, Estuary supports managed backfills and reprocessing. New destinations can be added using historical data without taking a new snapshot from the source system.
- Messaging and streaming: Estuary supports messaging and event-streaming use cases alongside analytics pipelines. Data can be delivered to Kafka destinations, and using Dekaf, Estuary collections can be consumed as Kafka topics by standard Kafka clients.
- Transformations and DataOps: Estuary supports SQL (SQLite), TypeScript, and Python for transformations. Schema evolution is handled automatically where possible, and pipelines are managed through versioned specifications with built-in testing.
- Connectivity: Estuary provides 200+ native connectors designed for low latency and scale, and also supports Airbyte, Meltano, and Stitch connectors to extend coverage. Third-party connectors are validated before production use.
- Costs: Estuary pricing is based on data movement, priced at $0.50 per GB of data movement. This avoids per-row pricing and repeated extraction costs, and can result in lower and more predictable costs at scale.
Honorable Mentions
There are a few other vendors you might consider outside of this short list.
GoldenGate
GoldenGate was the original replication vendor, founded in 1995. It is rock-solid replication technology that was acquired by Oracle in 2009. Supports filtering, mapping, and row-level transforms (not DDL). and it was built long before modern DataOps was even a twinkle in any data engineer’s eye. If you just need replication or are building an operational data store, it is arguably the best option for you, and the most expensive. It should be on your short list. GoldenGate also has a host of source and destination connectors.
You can also use it for replication (EL) into a low latency destination. But it is not well suited for other use cases. GoldenGate supports transformations (for example via column conversion functions and mapping rules), but it isn’t designed for multi-destination pipeline reuse, time travel, or modern schema-change automation.
Amazon DMS
DMS stands for Data Migration Service, not CDC service. So if you’re thinking about using it for general-purpose CDC … think really carefully.
In 2016 Amazon released Data Migration Service (DMS) to help migrate on-prem databases to AWS services. In June 2023, AWS released DMS Serverless, which reduces infrastructure management by automatically provisioning and scaling migration resources, but it still comes with service-specific constraints (for example, serverless replications rely on VPC endpoints for several AWS endpoint types).
DMS is a great data migration service for moving to Aurora, DynamoDB, RDS, Redshift, or an EC2-hosted database. But it’s not for general-purpose change data capture (CDC).
Whenever you use something outside of its intended use, you can expect to get some weird behaviors and errors. Look at the forums for DMS issues, then look at what they’re trying to use it for, and you’ll start to see a pattern.
By many accounts, DMS was built using an older version of Attunity. If you look at how DMS works and compare it to Attunity, you can see the similarities. That should tell you some of its limitations as well.
Here are some specific reasons why you shouldn’t use DMS for general-purpose CDC:
- You’re limited to CDC database sources and Amazon database targets. DMS requires using VPC for all targets. You also can’t do some cross-region migrations, such as with DynamoDB. If you need more than this for your data pipeline, you’ll need to add something else.
NOTE: Serverless is even more restricted, so make sure to check the limitations pages. - The initial full load of source data by table is the table at that point in time. Changes to the table are cached and applied before it goes into CDC-based replication. This is old-school Attunity. Full load + CDC can impact source performance depending on configuration. That puts a load on the database at scale.
- You’re limited in scale. For example, DMS capacity limits depend on the replication instance class (standard DMS) or the serverless replication configuration you choose. For DMS Serverless, there are also feature limitations like: not all endpoint types are supported, serverless requires VPC endpoints for several AWS services, and it does not support custom CDC start points.
Also, please remember the difference between migration and replication! Using a database migration tool with replication can lead to mistakes. For example, if you forget to drop your Postgres replication slot at the end of a migration, bad things can happen to your source.
Talk to a CDC expert
Discuss your CDC use case, architecture, and performance requirements with the Estuary team.
How to choose the best option
For teams evaluating CDC technologies, the right choice depends on how broadly data needs to be shared and how much operational complexity you’re willing to manage.
If you are looking for a cloud-based option that supports real-time and batch pipelines, scales to high data volumes, and minimizes operational overhead, Estuary is worth evaluating.
- Real-time and batch support: Estuary supports sub-second CDC as well as native batch ingestion, allowing teams to combine streaming and batch workloads in the same pipelines.
- Scalability: Estuary is designed for high-throughput data movement and operates in the same general scalability class as Debezium/Kafka and Striim, while ELT tools typically prioritize scheduled syncs over sustained throughput.
- Efficiency: By capturing data once and reusing it across multiple destinations, Estuary can reduce repeated extraction and source load compared to tools that require separate pipelines per destination.
- Reliability: Estuary’s transactional delivery model and durable intermediate storage help support exactly-once semantics across streaming and batch pipelines, which simplifies recovery and backfilling.
- Cost considerations: Estuary’s data-movement-based pricing model tends to be more predictable at higher volumes, particularly when supporting multiple downstream systems from the same source data.
- Operational overhead: For teams that want to avoid building and maintaining their own streaming and CDC infrastructure, Estuary offers a managed alternative to assembling Kafka, CDC tooling, and batch systems independently.
If you are committed to open source and want to build a custom data backbone, Debezium with Kafka is the primary alternative. This approach is proven at large scale but typically requires significant investment in engineering, operations, and long-term maintenance.
Ultimately, the best approach is to evaluate tools against your current and future requirements for connectivity, latency, scalability, reliability, security, and cost — and to choose a solution that will still fit as your data use cases expand.
Getting Started with Estuary
If you want to explore Estuary hands-on, you can start with a free account.
- Create an account to build your first CDC pipeline.
- Review the documentation, especially the getting started guides, to understand core concepts and pipeline setup.
- Join the Estuary Slack community to ask questions and get help during evaluation and early implementation.
- Watch the Estuary 101 webinar for an end-to-end walkthrough of common CDC use cases.
- For architecture questions or help evaluating fit for your use case, you can also contact the Estuary team.
FAQs
What is the difference between snapshotting and backfilling in CDC?
When is Debezium plus Kafka the best option?
How should teams choose between replication tools, ELT, streaming integration, and a shared data backbone?

About the author
Rob has worked extensively in marketing and product marketing on database, data integration, API management, and application integration technologies at WS02, Firebolt, Imply, GridGain, Axway, Informatica, and TIBCO.















