Estuary

7 Best Tools to Stream & Ingest Data into Apache Iceberg (2026)

Discover 7 powerful tools to stream and ingest data into Apache Iceberg. Build real-time, scalable pipelines for your data lakehouse with ease.

Stream data into apache iceberg
Share this article

TL;DR

Apache Iceberg supports transactional, scalable lakehouses, but ingesting data into Iceberg requires tools that handle streaming, CDC, and batch updates correctly. This article compares 7 tools for ingesting data into Iceberg, highlighting how they differ in latency, schema handling, and operational complexity.


Streaming Data into Apache Iceberg: Tools for a Scalable Data Lakehouse

Apache Iceberg has transformed how organizations handle large-scale data, offering features like ACID transactions, schema evolution, and time travel. It allows businesses to build robust data lakehouses that unify structured and unstructured data for analytics and machine learning.

To fully leverage Iceberg’s capabilities, efficient data ingestion and streaming are crucial. Whether it’s real-time streaming, batch processing, or change data capture (CDC), choosing the right ingestion tool can ensure data consistency, performance, and ease of use.

This article explores 7 top tools for streaming and ingesting data into Apache Iceberg. From real-time data integration platforms to scalable batch processing engines, these solutions cater to a range of use cases and organizational needs, making it easier to harness the full power of your data lakehouse.

7 Best Tools to Stream and Ingest Data into Apache Iceberg

Building an efficient, scalable Iceberg-based data lakehouse starts with choosing the right pipeline tools. Here are 7 solutions that help make real-time streaming and ingestion to Iceberg faster and more reliable:

1. Estuary

Ingest Data into Apache Iceberg With Estuary Flow
Stream and Ingest Real-Time Data into Apache Iceberg With Estuary

Estuary is the Right-Time Data Platform that unifies CDC, streaming, and batch ingestion into a single dependable system. It enables teams to build right-time data pipelines into Apache Iceberg, allowing data to be delivered in near real-time or on scheduled intervals based on cost, latency, and analytics needs.

Rather than writing directly to Iceberg tables row by row, Estuary materializes data into Apache Iceberg by orchestrating Spark jobs on AWS EMR Serverless. Data changes are continuously captured from source systems, staged in Amazon S3, and then merged transactionally into Iceberg tables using Iceberg’s ACID guarantees. This approach ensures correctness, fault tolerance, and predictable performance for production lakehouse workloads.

Estuary supports ingesting data from operational databases, SaaS applications, and event streams, enforcing schemas at ingestion time and safely propagating compatible schema changes into Iceberg tables. This makes it well-suited for teams that need reliable CDC-driven lakehouse ingestion without managing Spark jobs, orchestration layers, or custom pipelines.

Key Features:

  • Right-Time Iceberg Ingestion: Control how frequently data is merged into Iceberg tables, from near real-time micro-batches to scheduled batch updates, balancing freshness, cost, and query performance.
  • Change Data Capture (CDC): Continuously captures inserts, updates, and deletes from source systems and applies them as transactional upserts and deletes in Iceberg tables.
  • Schema Enforcement and Evolution: Enforces schemas upstream and safely handles compatible schema changes, preventing corrupt or incompatible writes to Iceberg.
  • Transactional and Fault-Tolerant Writes: Uses Spark-based transactional merges with exactly-once materialization semantics, ensuring consistency even in the presence of retries or failures.
  • Native Apache Iceberg Integration: Supports Iceberg REST catalogs backed by Amazon S3, including AWS Glue, AWS S3 Tables, and Snowflake Open Catalog, with additional cloud support planned.

Related Articles on Using Estuary to Ingest Data into Apache Iceberg:

2. Dremio

Dremio is a data lakehouse platform that simplifies data management and analytics. It offers an enterprise data catalog for Apache Iceberg, providing features like data versioning and governance. Dremio's SQL query engine delivers high-performance queries, and its unified analytics support self-service across various data sources.

3. Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. It integrates with Apache Iceberg, allowing users to perform batch and streaming data processing with ease. Spark's DataFrame API enables complex transformations and actions on Iceberg tables, supporting operations like reading, writing, and managing table metadata. 

Apache Flink is a framework and distributed processing engine for stateful computations over data streams. It integrates with Apache Iceberg to provide real-time data ingestion and processing capabilities. Flink's support for event-time processing and exactly-once state consistency ensures accurate and reliable data pipelines when working with Iceberg tables.

5. Kafka Connect

Kafka Connect is a framework for integrating Apache Kafka with external systems, including data lakes and analytical storage. It enables teams to move streaming data from Kafka topics into Apache Iceberg tables, supporting real-time analytics use cases.

Kafka Connect typically relies on third-party Iceberg sink connectors rather than native support. As a result, production deployments often require custom configuration to handle schema evolution, upserts, and deletes, particularly when ingesting CDC-style data. While Kafka Connect integrates well within Kafka-centric ecosystems, managing correctness and transactional consistency with Iceberg can require additional operational effort.

6. Upsolver

Upsolver is a cloud-native data integration platform designed for high-scale workloads. It simplifies the ingestion and transformation of streaming data into Apache Iceberg tables. In January 2025, Upsolver was acquired by Qlik, a global leader in data integration, data quality, analytics, and AI. This acquisition enhances Qlik's ability to provide real-time data streaming and Iceberg optimization solutions.

7. Fivetran

Fivetran is an automated data movement platform that provides managed connectors for replicating data from a wide range of sources into analytical destinations, including Apache Iceberg. It emphasizes ease of setup and fully managed operations, making it a common choice for batch-oriented ingestion workflows.

Fivetran’s Iceberg support is primarily batch-oriented and may not support fine-grained CDC merge semantics out of the box. While it works well for periodic data replication and analytics use cases, teams with strict real-time or transactional ingestion requirements into Iceberg may need additional tooling or downstream processing.

Conclusion

Streaming and ingesting data into Apache Iceberg is a critical step in building an efficient, scalable data lakehouse. Each tool in this list offers distinct strengths, from high-performance processing engines to user-friendly integration platforms.

While solutions like Apache Spark, Kafka Connect, and Fivetran provide reliable ingestion capabilities, Estuary stands out as the most flexible and dependable option for right-time data delivery. Its combination of real-time streaming, CDC, and schema evolution ensures that Iceberg tables always reflect the most accurate version of your data, with minimal latency and zero manual effort. 

Take control of your data pipelines today! Register for Estuary and start for free. Experience real-time data integration with Apache Iceberg, designed to fit your needs effortlessly.

FAQs

    What is Apache Iceberg, and why is it important?

    Apache Iceberg is an open table format built for large-scale data storage and analytics. It brings features like ACID transactions, schema evolution, and time travel to modern data lakehouses. Iceberg helps organizations unify structured and unstructured data, ensuring consistent, reliable, and high-performance analytics across massive datasets.
    The right tool depends on how frequently your data changes and how quickly you need insights. Estuary is ideal if you want dependable, right-time data streaming into Iceberg. It combines CDC, batch, and streaming ingestion in one platform, making it easy to maintain accurate Iceberg tables without complex engineering or scheduling jobs.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.