Estuary

How to Stream Data from Kafka to Apache Iceberg in Minutes

Learn how to stream data from Apache Kafka into Apache Iceberg efficiently. This guide will introduce you to the foundations of a streaming data lakehouse and showcase the best options for data integrations.

Kafka to Iceberg
Share this article

To move data from Apache Kafka to Apache Iceberg, teams typically use one of two approaches:
a managed real-time pipeline for minimal operational overhead, or Kafka Connect with the Apache Iceberg Sink Connector for full infrastructure control. The right choice depends on whether you need low-latency analytics, schema evolution, upserts, and minimal maintenance, or prefer a fully self-managed setup.

Kafka is excellent at ingesting high-throughput event streams, but it is not designed for complex analytical queries. Apache Iceberg solves this by providing a transactional, scalable table format optimized for large analytical datasets and query engines like Spark and Trino. Combining Kafka with Iceberg lets you power real-time ingestion with reliable lakehouse analytics.

Key Takeaways

  • Kafka handles real-time event streaming, while Iceberg enables fast, scalable analytics on that data

  • Kafka data can be written to Iceberg using managed streaming pipelines or Kafka Connect

  • Iceberg supports schema evolution, snapshot isolation, and hidden partitioning, making it well-suited for streaming data lakes

  • Kafka Connect offers fine-grained control but introduces higher operational complexity and ongoing maintenance

  • Managed Kafka-to-Iceberg pipelines such as Estuary reduce setup time, schema-handling overhead, and failure recovery effort

In this guide, you’ll learn two proven ways to stream data from Kafka to Iceberg, including architecture considerations, step-by-step setup instructions, and best practices to avoid common production issues like small files, schema incompatibilities, and commit failures.

Choose the Right Kafka to Iceberg Integration Method

There is no single “best” way to move data from Kafka to Iceberg. The right approach depends on latency requirements, schema evolution complexity, operational ownership, and scale. Most teams choose between a managed streaming pipeline or a self-managed Kafka Connect deployment.

Option 1: Managed Kafka to Iceberg Streaming (Estuary)

A managed Kafka-to-Iceberg pipeline is the best choice when you want low-latency ingestion, automatic schema handling, and minimal operational overhead.

Estuary manages the full data movement lifecycle, including batching Kafka events, handling schema evolution, committing data to Iceberg tables, and recovering from failures. This approach is well-suited for production analytics pipelines where reliability and simplicity matter more than deep connector customization.

Choose this option if you need:

  • Near real-time Kafka to Iceberg ingestion
  • Automatic schema evolution and compatibility handling
  • Built-in batching and file sizing to avoid Iceberg small-file issues
  • Minimal infrastructure to deploy, monitor, and maintain
  • Fast setup without managing Kafka Connect workers or control topics

Trade-offs:

  • Less low-level connector customization compared to Kafka Connect
  • Requires using a managed data platform

Option 2: Kafka Connect with Apache Iceberg Sink Connector

Kafka Connect with the Apache Iceberg Sink Connector is best for teams that require full control over infrastructure, connector configuration, and deployment topology.

This approach allows you to manage task parallelism, commit behavior, routing logic, and catalog configuration directly. However, it also introduces significant operational complexity, including connector upgrades, failure handling, and ongoing table maintenance.

Choose this option if you need:

  • Full control over Kafka Connect and Iceberg configurations
  • Custom routing from Kafka topics to multiple Iceberg tables
  • Tight integration with an existing Kafka Connect ecosystem
  • On-prem or air-gapped deployment requirements

Trade-offs:

  • Higher operational overhead and maintenance costs
  • Manual handling of schema evolution edge cases
  • Greater responsibility for file compaction, snapshot expiration, and failure recovery

Kafka to Iceberg Method Comparison

RequirementManaged Streaming (Estuary)Kafka Connect Iceberg Sink
Setup timeMinutesHours to days
Operational overheadLowHigh
Schema evolution handlingAutomaticLimited, manual
Small-file managementBuilt-in batchingManual tuning required
Exactly-once guaranteesHandled by platformConnector-dependent
Infrastructure ownershipMinimalFull responsibility
Best forProduction analyticsCustom, self-managed pipelines

Which Kafka to Iceberg Approach Should You Use?

  • Use a managed streaming pipeline like Estuary if you want the fastest path to production with minimal operational burden
  • Use Kafka Connect if you already operate Kafka Connect at scale and need deep connector-level control
  • For most analytics-focused teams, managed Kafka-to-Iceberg pipelines offer the best balance of reliability, performance, and simplicity

Simplify Kafka to Iceberg Streaming

Estuary lets you stream Kafka data into Apache Iceberg with low latency and minimal operational overhead.

What Is Apache Kafka? 

kafka logo

Apache Kafka is an open-source distributed event streaming platform used to ingest, store, and deliver high-throughput data streams in real time. It is commonly used for event-driven architectures, streaming pipelines, and real-time analytics ingestion.

Kafka organizes data into topics, which are split into partitions and distributed across a cluster of brokers. Producers write events to topics, while consumers read events independently, enabling scalable and fault-tolerant data streaming.

Modern Kafka deployments manage cluster metadata using KRaft (Kafka Raft), removing the dependency on ZooKeeper and simplifying cluster operations. Kafka excels at durable, low-latency event ingestion, but it is not designed for complex analytical queries across large historical datasets.

What Is Apache Iceberg?

iceberg logo

Apache Iceberg is an open-source table format for large-scale analytics that brings transactional guarantees and schema evolution to data lakes. Iceberg tables are optimized for query engines such as Apache Spark, Trino, Flink, and Presto, enabling consistent analytics over large datasets stored in object storage.

Iceberg uses table-level metadata to track data files, partitions, and snapshots, allowing query engines to efficiently plan scans even for multi-petabyte tables. This metadata-driven design enables fast query planning, snapshot isolation, and safe concurrent writes.

Key Iceberg features include:

  • Schema evolution: Safely add, rename, or reorder columns without rewriting data files
  • Hidden partitioning: Automatic partition management that improves performance without manual tuning
  • Snapshot isolation: Consistent reads and time travel across table versions
  • Concurrent writes: Optimistic concurrency control for multiple writers

These capabilities make Iceberg well-suited for streaming data lakes, where data arrives continuously from systems like Kafka and must remain queryable over time.

To dive deeper into how to load data into Apache Iceberg tables efficiently, explore this detailed guide: Loading Data into Apache Iceberg

Common Challenges of Kafka to Iceberg Integration

While Kafka and Iceberg complement each other well, integrating them introduces several production challenges:

Efficient and Reliable Ingestion

Streaming Kafka data into Iceberg requires balancing latency, correctness, and cost. Pipelines must handle batching, exactly-once semantics, retries, and schema changes without introducing duplicates or partial commits.

Small Files and Table Maintenance

Frequent Kafka writes can generate many small files, degrading Iceberg query performance. Production pipelines must manage batch sizing, file compaction, snapshot expiration, and metadata growth to keep tables efficient.

Schema Evolution and Compatibility

Kafka topics often evolve over time. Handling schema changes safely while maintaining Iceberg table consistency is difficult, especially when dealing with type changes, deletes, or incompatible message formats.

Security and Access Control

Iceberg abstracts object storage into tables, which requires table-level access controls rather than traditional file-based permissions. Integrating Iceberg with enterprise identity, authorization, and governance systems adds complexity.

These challenges are why teams typically choose between managed Kafka-to-Iceberg pipelines or self-managed Kafka Connect deployments, depending on their operational tolerance and control requirements.

Watch this quick video to learn how Apache Iceberg structures data lakes and how Estuary enables seamless Kafka-to-Iceberg integration.

This method uses two Estuary connectors:

It is the fastest way to get a production Kafka to Iceberg pipeline running without operating Kafka Connect, managing control topics, or writing custom ingestion code.

Prerequisites

Kafka prerequisites

  • Have a Kafka cluster with:
  • bootstrap.servers reachable from Estuary
  • An authentication mechanism (recommended for production)
  • TLS enabled (recommended for production; TLS is the only supported connection security mechanism for this connector)
  • Kafka messages in Avro or JSON
    • For Avro, you must configure a schema registry
    • For JSON, schema registry is optional but recommended if you want better discovery of keys and schemas

If using a schema registry, you need:

  • Schema registry endpoint
  • Username + password (for Confluent Cloud, these are the Schema Registry API key and secret)
  • Flat schemas only (schema references like import or $ref are not supported)

Authentication options supported by the Kafka connector:

  • SASL/SCRAM-SHA-256
  • SASL/SCRAM-SHA-512
  • SASL/PLAIN (common for Confluent Cloud)
  • AWS MSK IAM authentication is supported for AWS MSK

Iceberg prerequisites (AWS based)

The Apache Iceberg materialization connector requires:

  • An Iceberg catalog implementing the Iceberg REST Catalog API
  • An AWS EMR Serverless Application with Spark runtime (the connector submits Spark jobs and monitors them)
  • An S3 staging bucket (used to stage files to be merged into Iceberg tables)
  • IAM setup:
    • An EMR execution role (used by EMR jobs)
    • An IAM user or role for submitting EMR Serverless jobs
    • Catalog credentials depending on catalog auth type (AWS SigV4, AWS IAM, or OAuth2 client credentials)

Important operational note:

The connector does not automatically run Iceberg table maintenance (compaction, snapshot expiration, metadata cleanup). Plan to run maintenance separately.

Step 1: Configure Apache Kafka as the Source

  • Sign in to your Estuary account.
  • Click the Sources menu from the left navigation pane of the dashboard.
kafka to iceberg - kafka source new capture
  • Click the + NEW CAPTURE button on the Sources page and search for Apache Kafka using the Search connectors field.
kafka to iceberg - apache kafka search
  • When you see the Kafka connector in the search results, click the connector’s Capture button.
kafka to iceberg - kafka source capture details
  • You will be redirected to the Kafka connector configuration page; provide a unique name for your capture in the Name field within the Capture Details section.
kafka to iceberg - kafka endpoint config and credentials
  • Expand the Endpoint Config section and specify the following mandatory fields:
    • Bootstrap Servers: comma-separated brokers (example: broker1:9092,broker2:9092)
    • TLS: use system certificates or your chosen TLS setting
    • Credentials (choose one):
      • UserPassword (SASL): mechanism + username + password
      • AWS (MSK IAM): AWS access key + secret + region
  • Configure Schema Registry:
    • Choose Confluent Schema Registry if using Avro or JSON schema discovery
    • Choose No schema registry if you are capturing JSON without registry
  • Once you enter all the necessary information, click NEXT > SAVE AND PUBLISH.

This creates a live capture that continuously ingests Kafka topic events into Estuary collections.

Step 2: Create an Iceberg Materialization in Estuary

  • Go to Destinations and click + New Materialization.
  • Search for Apache Iceberg and select the connector.
kafka to iceberg - iceberg search
  • Fill in the required Iceberg configuration:

Catalog settings

  • Base URL: your REST catalog URL
    • For AWS Glue Iceberg REST endpoint, use: https://glue.<region>.amazonaws.com/iceberg
  • Warehouse:
    • For AWS Glue: your AWS Account ID without hyphens (example 012345678901)
  • Namespace: the namespace used for bound collection tables
  • Base Location:
    • Required for AWS Glue
    • Must be an S3 path like: s3://your-table-bucket/your-prefix/

Catalog authentication

Choose one:

  • AWS SigV4
  • AWS IAM
  • OAuth 2.0 Client Credentials (common for non-Glue REST catalogs)
kafka to iceberg - iceberg materialization

Compute settings (AWS EMR Serverless)
Provide:

  • Region (must match the staging bucket region)
  • EMR Application ID
  • Execution Role ARN
  • Staging Bucket
  • EMR credentials
  • AWS Access Key credentials, or
  • “Use Catalog Auth” depending on how you’ve configured permissions

Optional but common:

  • Bucket Path: prefix within the staging bucket for staged data files
  • Systems Manager Prefix: used only for OAuth2 credentials storage (can be left blank for Glue)
  • Add your source collections:
    • In Source Collections, bind the collections produced by your Kafka capture (auto-suggested or select manually)
    • Map each collection to a destination table name
  • Click Save and Publish.

How the Iceberg connector writes data:

  • It stages data to S3, then runs Spark merge jobs on EMR Serverless to apply new updates into the Iceberg tables as new data arrives.

Optional configuration notes (only when relevant)

Hard deletes vs soft deletes

  • By default, deletes are handled as soft-delete behavior.
  • Enable Hard Delete only if you want delete events in source collections to delete rows in Iceberg.

Lowercase column names

If you plan to query Iceberg tables using AWS analytics services like Athena, enable Lowercase Column Names to avoid issues with uppercase field names.

Kafka to Iceberg

If you prefer a fully self-managed approach, the next method shows how to stream Kafka data into Iceberg using Kafka Connect and the Apache Iceberg Sink Connector. This option gives you fine-grained control over connector configuration and deployment, but it also requires managing Kafka Connect workers, connector upgrades, and ongoing table maintenance.

Method 2: Kafka to Iceberg Replication Using Kafka Connect Apache Iceberg Sink Connector

Kafka Connect is a popular framework that enables you to migrate data in and out of your Kafka through various connectors. For Kafka to Iceberg integration, Kafka Connect offers a sink connector called the Apache Iceberg Sink Connector.

The Apache Iceberg sink connector guarantees that each record from Kafka is written to the Iceberg tables exactly once, even during failures or retries. Besides this, exactly-once delivery semantics, the Kafka Iceberg connector has a multi-table fan-out capability. This helps you move data from a single Kafka topic to multiple Iceberg tables.

Here are the steps to create a Kafka to Iceberg data pipeline using the Apache Iceberg sink connector:

Prerequisites:

  • Download and install Kafka 2.5 or higher.
  • Assume a source Kafka topic “events” already exists.
  • Configure the Iceberg catalog.
  • Create a Kafka topic to serve as the Iceberg Sink Connector control topic. 

Step 1: Install the Apache Iceberg Connector in the Kafka Connect Instance

  1. Download the Apache Iceberg sink connector from Confluent Hub or GitHub repository. Alternatively, you can build the Kafka Iceberg Connector by executing the following Gradle command in your Confluent CLI terminal:
plaintext
./gradlew -x test -x integrationTest clean build

This will generate a ZIP archive containing the connector.

  1. Once the build is completed, the ZIP file will be located at 
plaintext
./kafka-connect/kafka-connect-runtime/build/distributions/

You can extract the zip archive file. You will find two versions of the archive: one that includes the Hive Metastore client and related dependencies and one without the Hive Metastore client. Select the version that suits your needs, and copy it into the Kafka Connect plugin directory. 

To do this, you can create a plugin directory using the following command and copy the required archive. 

plaintext
mkdir -p CONFLUENT_HOME/share/kafka/plugins

Replace CONFLUENT_HOME with the actual path to your Confluent installation.

  1. Once the connector is copied to the plugin directory, add this to the plugin.path property in your Kafka Connect worker configuration JSON file as given below:

plugin.path=/usr/local/share/kafka/plugins

Kafka will then identify the plugin using its path. A plugin.path is a comma-separated list of directories in the Kafka Connect’s worker configuration file. 

  1. After updating the configuration, restart your Kafka Connect instance to load the Iceberg sink connector across all Kafka Connect nodes.

Step 2: Create a Destination Table

Install Spark SQL interface to create an Iceberg table that can receive all incoming records from the Kafka topic “events.”

plaintext
CREATE TABLE default.events (    id STRING,    type STRING,    ts TIMESTAMP,    payload STRING) PARTITIONED BY (hours(ts))

You can modify this query to meet your requirements. 

Step 3: Configure the Iceberg Connector using Kafka Connect

Modify the following Kafka Connect worker configuration file to connect the Kafka topic to an Iceberg REST catalog. 

plaintext
{ "name": "events-sink", "config": {    "connector.class": "org.apache.iceberg.connect.IcebergSinkConnector",    "tasks.max": "2",    "topics": "events",    "iceberg.tables": "default.events_list,default.events_create",    "iceberg.tables.route-field": "type",    "iceberg.table.default.events_list.route-regex": "list",    "iceberg.table.default.events_create.route-regex": "create",    "iceberg.catalog": "demo",    "iceberg.catalog.type": "rest",    "iceberg.catalog.uri": "https://localhost",    "iceberg.catalog.credential": "<credential>",    "iceberg.catalog.warehouse": "<warehouse name>"    } }

This configuration is tailored for a project using a REST-based Iceberg catalog with S3 as the storage layer. You must adjust values like the catalog URI, credential, and warehouse name based on your actual environment. After changing the values, you can save the configuration file as events-sink.json

Step 4: Launch the Connector

  1. After creating the configuration file, you can use the Kafka Connect REST API to launch the connector. To do this, use the curl command to send the configuration to your Kafka Connect cluster:
plaintext
curl -X PUT http://localhost:8080/connectors/events-sink/config \     -i -H "Content-Type: application/json" -d @events-sink.json
  1. Then, you can verify if the connector is running by querying the status via the REST API:
plaintext
curl -s http://localhost:8080/connectors/events-sink/status | jq

This command uses jq to format the JSON output for better readability. If everything is working correctly, the response should look like this:

plaintext
{  "name": "events-sink",  "connector": {    "state": "RUNNING",    "worker_id": "connect:8080"  },  "tasks": [    {      "id": 0,      "state": "RUNNING",      "worker_id": "connect:8080"    },    ...  ],  "type": "sink" }

Step 5: Query Data in Iceberg Using PySpark

To verify that data is flowing from Kafka to Iceberg, use PySpark, a Python API for Apache Spark, to query the data in the Iceberg table:

plaintext
df = spark.table("demo.default.events_list") df.show(5)

This will display the first five records on your Iceberg catalog's “events_list” table. 

By following these five steps, you will have:

  • Set up an Iceberg Kafka Sink Connector.
  • Deployed it via Kafka Connect.
  • Confirmed that data is successfully flowing into your Iceberg tables.

Limitations of Kafka Connect for Kafka to Iceberg Pipelines

  • Challenges in Schema Evolution
    While Apache Iceberg supports schema evolution, the Iceberg Sink Connector does not always handle complex schema changes reliably. Changes such as column type modifications can cause mapping failures or write errors, which are especially problematic in streaming pipelines where data is continuously ingested from Kafka into Iceberg.
  • Higher Operational Overhead
    Running Kafka Connect at scale introduces operational complexity. Teams must manage Kafka Connect clusters, connector configurations, upgrades, monitoring, and failure recovery. Additional effort is often required to handle schema compatibility, data transformations, and connector tuning.
  • Debezium Change Event Format Dependencies
    For CDC-based pipelines, the Iceberg Sink Connector typically expects Kafka events to follow the Debezium change event format. If the source system does not emit Debezium-compatible events, additional preprocessing or transformation steps may be required before data can be written to Iceberg.

Best Practices for Kafka to Iceberg Integration

Whether you use a managed streaming pipeline or Kafka Connect, the following best practices help ensure reliable, scalable, and cost-effective Kafka to Iceberg integrations.

  • Prepare a Data Migration Plan: You must develop a comprehensive data migration plan by clearly identifying the data to be migrated, mapping out any dependencies, and assessing the resources required. This practice ensures efficient migration with minimal downtime.
  • Batch Data Writes: Instead of writing each Kafka topic individually, group messages in larger batches to reduce the number of small files in Iceberg. Small files can lead to slow read performance and high storage costs.
  • Perform Data Transformations: You must standardize, clean, or enrich data before or after writing from Kafka to Iceberg, ensuring that downstream users get high-quality data.
  • Manage Schema Evolution: If you prefer manual ingestion, use schema management tools like Avro or Confluent Schema Registry. These tools will help you smoothly manage schema updates between Kafka and Iceberg.
  • Benchmark Performance: Test the Kafka to Iceberg integration setup under different workloads. This will help evaluate throughput and latency, resource utilization, and failure recovery mechanisms.
  • Use Compatible Versions: Always use compatible and stable versions of Kafka, Iceberg, and other middleware. This involves regular updates to the latest releases for improved performance, bug fixes, and new features.

Conclusion

Streaming data from Apache Kafka into Apache Iceberg enables teams to combine real-time ingestion with scalable, reliable analytics. Kafka excels at capturing high-volume event streams, while Iceberg provides a transactional table format optimized for long-term analytical workloads.

There are two primary ways to build Kafka-to-Iceberg pipelines. Kafka Connect with the Iceberg Sink Connector offers deep configuration control for teams that already operate Kafka Connect at scale, but it also introduces operational complexity around schema evolution, maintenance, and failure recovery. Managed streaming pipelines simplify ingestion by handling batching, schema compatibility, and Iceberg commits automatically, making them a strong fit for production analytics use cases where reliability and simplicity matter.

Choosing the right approach depends on factors such as latency requirements, schema change frequency, infrastructure ownership, and operational tolerance. By understanding the trade-offs and following proven best practices, teams can build Kafka-to-Iceberg pipelines that remain performant, maintainable, and scalable as data volumes grow.

To explore how Estuary automates data integration and boosts business productivity, connect with the Estuary experts.


Related Sync with Kafka

FAQs

    What is the best way to stream data from Kafka to Iceberg?

    The best approach depends on your operational requirements. Managed Kafka-to-Iceberg pipelines such as Estuary are well suited for teams that need low latency, automatic schema handling, and minimal maintenance. Kafka Connect with the Iceberg Sink Connector is a better fit for teams that require full control over infrastructure and already operate Kafka Connect at scale.
    Iceberg tables populated from Kafka can be queried using engines such as Apache Spark, Trino, Flink, Presto, and cloud analytics services that support Iceberg, provided the catalog and permissions are configured correctly.
    Kafka Connect can be reliable when properly configured, but it requires careful management of schema evolution, batching, retries, and table maintenance. Operational overhead increases as pipelines scale, especially in environments with frequent schema changes or high throughput.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.