Estuary

DynamoDB to Apache Iceberg: Streaming Change Data into a Lakehouse

Learn the common ways to stream DynamoDB data into Apache Iceberg, including AWS-native pipelines and a managed CDC approach with Estuary. Covers architecture, tradeoffs, and design considerations.

Stream Data From DynamoDB to Apache Iceberg in Real Time
Share this article

Streaming data from Amazon DynamoDB into Apache Iceberg is a common requirement for teams that want to analyze application data at scale without sacrificing freshness. DynamoDB is optimized for low-latency transactional workloads, while Iceberg is designed for analytical queries, historical analysis, and lakehouse architectures. Bridging these systems requires capturing DynamoDB change events and applying them correctly to Iceberg tables on object storage.

This guide explains the common DynamoDB to Iceberg architectures, their tradeoffs, and how Estuary simplifies real-time streaming into Iceberg without requiring teams to operate complex streaming infrastructure.

Key takeaways

  • DynamoDB is not designed for analytical queries, joins, or historical analysis.

  • Apache Iceberg provides ACID transactions, schema evolution, and time travel on low-cost object storage.

  • DynamoDB Streams is the authoritative source for change data capture from DynamoDB.

  • AWS-native pipelines can stream DynamoDB data into Iceberg but introduce operational and correctness challenges.

  • Estuary provides a right-time data platform that streams DynamoDB changes into Iceberg with built-in CDC handling and minimal operational overhead.

The business problem: why DynamoDB data is hard to analyze

Amazon DynamoDB is purpose-built for operational workloads. It excels at high-throughput key-value access, predictable latency, and horizontal scalability. These strengths come with tradeoffs that become apparent as data volumes and analytical needs grow.

Limited analytical query capabilities

DynamoDB does not support complex analytical queries. Even with PartiQL, querying remains constrained to access patterns defined by primary keys and indexes. Joins, aggregations across large datasets, and historical trend analysis are not practical.

No native historical or time-based analysis

DynamoDB is optimized for current state access, not historical exploration. While point-in-time recovery exists for backup and restore, it is not designed for querying data over time or analyzing change history.

Analytics workflows require external systems

To analyze DynamoDB data, teams typically export data into systems like Amazon S3, data warehouses, or lakehouse platforms. This immediately introduces data movement, transformation logic, and operational complexity.

Why batch exports fall short

A common workaround is to export DynamoDB data to S3 using periodic jobs or managed services such as AWS Glue. While simple to reason about, batch pipelines introduce several limitations.

High latency

Batch exports run on schedules. Whether hourly or nightly, insights lag behind production activity. This impacts real-time dashboards, operational monitoring, and machine learning pipelines that depend on fresh data.

Fragile and hard to evolve

Batch pipelines often rely on scripts or Glue jobs that must be updated when schemas change. DynamoDB is schemaless by nature, but analytical systems are not. Over time, these pipelines become brittle and costly to maintain.

Inefficient for large-scale change data

Re-exporting large tables repeatedly is inefficient when only a small subset of records change. Batch jobs waste compute and storage while still failing to deliver low-latency insights.

For teams that need timely analytics, streaming change data capture is a more appropriate model.

Common ways to stream DynamoDB changes into Apache Iceberg

There are several established, AWS-native approaches to streaming DynamoDB data into Iceberg tables on object storage. Each approach makes different tradeoffs between simplicity, correctness, and operational effort.

Option 1: DynamoDB Streams to Firehose to Iceberg

This is the simplest managed option.

How it works

  • DynamoDB Streams is enabled on a table, typically using NEW_AND_OLD_IMAGES to capture full change events.
  • Stream records are forwarded into Amazon Data Firehose, often via a Lambda or Kinesis intermediary.
  • Firehose writes records into Apache Iceberg tables on Amazon S3 using a supported catalog.

When this works well

  • Minimal operational overhead
  • Mostly append-oriented analytics
  • Low transformation requirements

Tradeoffs

DynamoDB Streams emits CDC events, not row-level table updates. Mapping inserts, updates, and deletes into correct Iceberg row semantics requires careful design. Frequent updates and deletes often require downstream merge or compaction processes. Schema evolution and deduplication logic must be handled outside of Firehose.

This approach offers the most control and correctness.

How it works

  • DynamoDB Streams provides ordered change events per partition.
  • Apache Flink reads the stream using a DynamoDB Streams connector, commonly via Amazon Managed Service for Apache Flink.
  • Flink applies CDC semantics, transformations, enrichment, and writes into Iceberg using the Iceberg sink connector.

When this works well

  • True upserts and deletes are required
  • Data must be enriched, re-keyed, or transformed
  • Strong control over watermarking and state

Tradeoffs

Operating Flink introduces significant complexity. Teams must manage state backends, checkpoints, failure recovery, and small-file mitigation. Iceberg table maintenance, including compaction and optimization, must be planned and operated continuously.

Option 3: DynamoDB Streams to AWS Glue streaming jobs to Iceberg

This option is familiar to Spark-centric teams.

How it works

  • DynamoDB Streams data is routed through a streaming backbone such as Kinesis.
  • AWS Glue streaming jobs read the stream and apply transformations using Spark.
  • Data is written to Iceberg tables using the Glue Data Catalog.

When this works well

  • Existing Spark and Glue expertise
  • Hybrid batch and streaming transformations
  • Integration with existing Glue-based data lakes

Tradeoffs

Although Glue abstracts some infrastructure, CDC correctness, merge semantics, and Iceberg table maintenance remain the responsibility of the data team. Operational complexity is higher than Firehose and similar to Flink-based approaches.

The shared challenges across all AWS-native approaches

While the architectures above differ, they share a set of underlying challenges that teams consistently encounter.

CDC correctness is not automatic

DynamoDB Streams produces change events, not relational updates. Teams must define how primary keys map to Iceberg rows, how updates overwrite prior values, and how deletes are represented and applied.

Schema evolution requires discipline

DynamoDB allows attributes to appear and disappear freely. Analytical systems require stable schemas. Without careful handling, schema drift can break downstream queries or force repeated manual intervention.

Backfills and reprocessing are complex

Replaying historical data or recovering from pipeline failures often requires custom logic. This is particularly difficult in streaming systems where state and ordering matter.

Operational burden grows over time

Flink and Spark pipelines require monitoring, tuning, upgrades, and cost management. Even fully managed services still require an operating model.

For many teams, the challenge is not whether DynamoDB data can reach Iceberg, but how much infrastructure they want to own to make it reliable.

A simpler approach: DynamoDB to Iceberg with Estuary

Estuary addresses these challenges by providing a right-time data platform that natively handles change data capture and delivery into analytical systems like Apache Iceberg.

Right-time data movement means teams can choose when data moves, whether sub-second, near real-time, or batch, without rebuilding pipelines.

At a high level, Estuary:

  • Reads change events directly from DynamoDB Streams
  • Applies CDC semantics consistently for inserts, updates, and deletes
  • Enforces and evolves schemas for analytical use
  • Writes data into Apache Iceberg tables on object storage
  • Operates as a managed service with predictable reliability and cost

Instead of assembling Firehose, Flink, or Spark pipelines, Estuary collapses these responsibilities into a single managed system designed specifically for streaming operational data into analytics-ready formats.

When Estuary is the right choice

AWS-native pipelines for DynamoDB to Iceberg are viable, but they assume teams are willing to design, operate, and continuously maintain streaming infrastructure. Estuary is designed for teams that want correct, real-time data movement without owning that operational complexity.

Estuary is a strong fit when:

  • Correct CDC semantics matter
    Inserts, updates, and deletes from DynamoDB need to be applied deterministically to Iceberg tables without custom merge logic.
  • Low-latency analytics are required
    Dashboards, monitoring, or downstream systems depend on near real-time visibility into application data.
  • Operational simplicity is a priority
    Teams want to avoid running Flink clusters, Spark streaming jobs, or custom retry and recovery logic.
  • Schema evolution is expected
    DynamoDB attributes change over time, and the analytics layer must adapt without breaking queries.
  • Predictable cost and reliability are important
    Streaming infrastructure sprawl often leads to hidden costs and operational risk.

In these cases, Estuary functions as a purpose-built CDC-to-lakehouse layer rather than a general-purpose streaming framework.

How to Set Up a DynamoDB to Iceberg Pipeline with Estuary

With Estuary, you can move from raw DynamoDB change data to structured Iceberg tables in minutes. Here’s how to do it.

Let’s walk through the steps to build a DynamoDB to Iceberg pipeline using Estuary.

Prerequisites

Step 1: Configure Amazon DynamoDB as the Source

Selecting dynamoDB as a source
  1. Log into your Estuary account.
  2. In the left sidebar, click Sources, then click + NEW CAPTURE.
  3. In the Search connectors field, search for DynamoDB.
  4. Click the Capture button on the Amazon DynamoDB connector.
  5. On the configuration page, fill in:
    • Name: A unique identifier for your capture
    • AWS Access Key ID / Secret Access Key: Credentials with access to DynamoDB
    • Region: AWS region of your DynamoDB table
  6. Click NEXT > SAVE AND PUBLISH to activate the capture.

Once configured, Estuary will begin reading real-time change events (inserts, updates, deletes) from your table and write them to a collection.

Step 2: Configure Apache Iceberg as the Destination

Selecting an Apache Iceberg materialization connector
  1. After your capture is active, click MATERIALIZE COLLECTIONS in the pop-up, or navigate to Destinations > + NEW MATERIALIZATION.
  2. In the Search connectors field, type Iceberg.
  3. Select the appropriate materialization:
    • Amazon S3 Iceberg (for delta update pipelines using S3 + AWS Glue)
    • Apache Iceberg (for full updates using a REST catalog, S3, and EMR)

Configuration Fields

Amazon S3 Iceberg (Delta Updates):

  • Name: Unique materialization name
  • AWS Access Key ID / Secret Access Key: Must have permissions for S3 and Glue
  • Bucket: Your S3 bucket for data storage
  • Region: AWS region for the S3 bucket and Glue catalog
  • Namespace: Logical grouping of your Iceberg tables (e.g., prod/analytics)
  • Catalog:
    • Glue: If using AWS Glue as the catalog
    • REST: Provide REST URI, warehouse path, and credentials if using a custom catalog

Apache Iceberg (Standard Updates with EMR Serverless):

  • URL: REST catalog base URI
  • Warehouse: Iceberg warehouse path
  • Namespace: Logical table grouping
  • Authentication: OAuth or AWS SigV4, depending on catalog type
  • Compute Settings:
    • Application ID: EMR Serverless application ID
    • Execution Role ARN: IAM role for job execution
    • Bucket / Region: S3 bucket and AWS region for EMR
    • AWS Access Key ID / Secret Access Key: Credentials to access EMR and S3
  1. In the Source Collections section, click SOURCE FROM CAPTURE to bind the collection created by your DynamoDB capture.
  2. Click NEXT > SAVE AND PUBLISH to finalize your materialization.

What You Get

Once active, Estuary continuously syncs change events from your DynamoDB table into Iceberg-backed tables — enabling real-time analytics, queryability via SQL engines, and durable, governed data storage.

You can also:

  • Backfill historical data (if needed)
  • Apply schema evolution rules
  • Monitor the pipeline via metrics and alerts

How Estuary compares to AWS-native approaches

ApproachLatencyCDC correctnessOperational effortSchema handling
DynamoDB export + batch jobsHoursLimitedMediumManual
Firehose to IcebergMinutesPartialLowManual
Flink to IcebergSecondsHighVery highManual
Glue streaming to IcebergMinutesHighHighManual
Estuary to IcebergSecondsHighLowAutomatic

This comparison highlights the core distinction: most pipelines focus on data movement, while Estuary focuses on change data correctness and lifecycle management.

Common use cases

Streaming DynamoDB into Iceberg enables a range of analytical and operational workloads.

Real-time operational analytics

Application events, transactions, and state changes can be queried in near real time using SQL engines without impacting production workloads.

Lakehouse integration

DynamoDB data can be joined with relational, event, and batch data in a unified Iceberg-based lakehouse architecture.

Machine learning pipelines

Fresh, versioned data supports feature generation, model training, and reproducibility using Iceberg’s snapshot and time-travel capabilities.

Long-term retention and compliance

Historical records can be stored cost-effectively on object storage with full auditability and schema governance.

Conclusion

Streaming DynamoDB data into Apache Iceberg is essential for teams that want scalable analytics, historical visibility, and lakehouse interoperability. While AWS-native approaches can achieve this, they often require significant infrastructure and operational investment.

Estuary provides a purpose-built alternative that delivers correct, right-time change data into Iceberg without requiring teams to assemble and operate complex streaming systems. By abstracting CDC handling, schema evolution, and delivery mechanics, Estuary allows teams to focus on analytics rather than infrastructure.

Ready to Get Started?

Sign up for Estuary to build your DynamoDB to Iceberg pipeline today — and unlock real-time analytics at scale.

FAQs

    Can DynamoDB data be streamed into Apache Iceberg?

    Yes. DynamoDB Streams provides change data capture, which can be applied to Iceberg tables using streaming pipelines or managed platforms such as Estuary.
    Not necessarily. While Spark and Flink are common choices, managed platforms can handle CDC ingestion and Iceberg writes without requiring teams to operate streaming frameworks.
    Updates overwrite existing rows based on logical keys, and deletes are applied using CDC semantics supported by Iceberg.
    Latency depends on the pipeline design. Streaming approaches can deliver data within seconds, while batch exports introduce significantly higher delays.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Jeffrey Richman
Jeffrey Richman

With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.