Estuary

How to Stream MongoDB Data to Apache Iceberg for Analytics and AI

Learn how to stream MongoDB data to Apache Iceberg in real time using Estuary. Capture MongoDB change streams, ensure exactly-once delivery, and keep your Iceberg tables fresh for analytics, AI, and BI without complex ETL jobs.

MongoDB-Apache Iceberg Streaming Pipeline with Estuary
Share this article

Moving data from MongoDB to Apache Iceberg is a common requirement for teams that want real-time analytics, AI workloads, and BI dashboards on operational data. The challenge is not just copying data, but keeping it continuously in sync as MongoDB changes, while preserving correctness, schema evolution, and transactional guarantees.

A MongoDB CDC to Iceberg pipeline captures inserts, updates, and deletes from MongoDB and materializes them into Iceberg tables with ACID transactions, schema evolution, partition evolution, and time travel. This allows analytics and machine learning systems to query the most current data directly from cloud storage, without relying on fragile batch jobs or manual exports.

Teams can implement this pipeline in different ways, ranging from custom-built CDC architectures to fully managed streaming solutions. The sections below explain why streaming MongoDB data into Iceberg matters, how the pipeline works, and what approaches teams typically use.

Why Stream MongoDB Data to Apache Iceberg

MongoDB is often the system of record for high-velocity data such as application events, IoT readings, transactions, and user interactions. This data changes continuously, and analytics or AI workloads lose value when they operate on stale or delayed information. Apache Iceberg provides an ideal destination for this data by combining the scalability of a data lake with the reliability and transactional guarantees of a database.

With a MongoDB CDC to Iceberg integration, every insert, update, and delete can be captured as it happens and written into Iceberg tables that support ACID transactions, schema evolution, and time travel. This ensures that queries, dashboards, and machine learning pipelines always operate on the most current and consistent view of the data.

Streaming MongoDB into Iceberg also removes the need for manual exports or scheduled batch ETL jobs that require constant upkeep. Instead of coordinating snapshots and reconciling partial loads, a continuous pipeline delivers incremental updates, keeps schemas in sync, and maintains a live lakehouse that is always ready for analytics and AI workloads.

Method 1: Streaming MongoDB Data to Apache Iceberg with Estuary

One common and increasingly preferred approach is to use a managed CDC pipeline that continuously streams MongoDB changes directly into Apache Iceberg. In this model, change data capture handles both the initial historical load and ongoing inserts, updates, and deletes, while the destination layer ensures transactional writes and schema evolution in Iceberg tables.

With Estuary, this pipeline is fully managed end-to-end. Estuary connects to MongoDB using change streams, performs an initial backfill to capture existing data, and then continuously streams changes with exactly-once guarantees. On the destination side, data is materialized into Apache Iceberg using a transactional write path that preserves ACID semantics, schema evolution, and time travel.

This approach removes the need to build and operate custom CDC consumers, orchestration logic, or merge jobs, while still delivering real-time freshness and reliability.

Other Ways Teams Move MongoDB Data to Apache Iceberg

Before managed streaming pipelines became common, most teams relied on custom or manual approaches to move MongoDB data into Apache Iceberg. These methods are still used today, but they come with meaningful trade-offs.

One common approach is to build a custom CDC pipeline using MongoDB Change Streams. In this model, engineers write and operate their own consumers to read change events, manage offsets, handle retries, and apply updates to Iceberg tables using processing frameworks such as Spark, Flink, or cloud-native jobs like AWS Glue. While flexible, this approach requires significant engineering effort to implement exactly-once guarantees, manage schema evolution, and safely handle backfills and failures.

Another widely used method is periodic batch exports. MongoDB collections are exported to cloud storage on a schedule and then loaded into Iceberg tables using batch jobs. This approach is easier to reason about but results in stale data, higher compute costs, and limited support for incremental updates or deletes.

Some organizations attempt a manual hybrid architecture that combines batch snapshots with custom streaming updates. While this approach can reduce staleness compared to pure batch pipelines, it typically requires additional logic for deduplication, ordering, and reconciliation when writing into Iceberg tables.

These approaches can work, but they demand ongoing maintenance, custom logic, and deep expertise in both CDC systems and Iceberg internals, particularly when snapshot ingestion and streaming updates are managed as separate systems. This is why many teams choose a fully managed solution that provides real-time change capture, schema governance, and exactly-once delivery without the operational overhead.

Prerequisites Checklist

Before you set up your pipeline, ensure you have:

  • A MongoDB cluster with change streams enabled (Atlas, self-hosted, or compatible services like Amazon DocumentDB and Azure Cosmos DB for MongoDB API — note that some versions may have limited change stream support).
  • An Apache Iceberg REST catalog (e.g., AWS Glue, Snowflake Open Catalog).
  • Access to Estuary (create a free account if you don’t have one).
  • Network and IAM permissions for Estuary to connect to both MongoDB and your Iceberg environment.

Architecture Overview

A MongoDB to Apache Iceberg streaming pipeline built with Estuary has three main layers:

  1. Source layer: The pipeline begins with MongoDB change streams, which capture every data change in real time. When you first connect, Estuary automatically performs a backfill to take a complete snapshot of your existing collections as part of a single CDC pipeline, before switching seamlessly to streaming mode. This ensures there are no gaps between historical and live data.
  2. Transport and governance layer: Captured events flow into Estuary collections. These collections validate data against JSON schemas, enforce exactly-once delivery, and maintain data lineage for governance and auditability. This layer also manages schema evolution so that changes in MongoDB are propagated safely to downstream systems.
  3. Destination layer: On the Iceberg side, Estuary writes data through a REST catalog such as AWS Glue or S3 Tables. It stages data in an S3 bucket and uses EMR Serverless compute to merge new events into Iceberg tables with full ACID compliance. This transactional approach ensures that queries in tools like Athena or Spark always return consistent and up-to-date results.

This architecture is fault-tolerant, recoverable, and scalable, making it suitable for enterprise-grade MongoDB to Iceberg streaming workloads.

Step-by-Step: Setting Up MongoDB to Apache Iceberg in Estuary

Watch it in Action:

See how to set up a real-time MongoDB to Apache Iceberg pipeline in Estuary.

Step 1: Create the MongoDB Capture in Estuary

  1. Sign Up or Log In to Estuary
    If you do not have an Estuary account, visit dashboard.estuary.dev/register and sign up. Once logged in, open the Estuary web application.
  2. Select MongoDB Capture Connector
    In the Estuary UI, choose the MongoDB capture connector. Estuary supports MongoDB Atlas, self-hosted MongoDB replica sets, sharded clusters, Amazon DocumentDB, and Azure Cosmos DB for MongoDB API.
Real-time MongoDB CDC source connector
  1. Enter Connection Details
Configure MongoDB capture details
  • Address / Connection String: Paste the MongoDB URI. For Atlas, use the SRV format (mongodb+srv://), and for self-hosted deployments include the authSource parameter when needed.
  • Authentication: Provide the MongoDB username and password with read privileges.
  • Database Discovery: Select the databases and collections you want Estuary to discover.
  1. Choose Capture Mode Per Collection (Optional)

    • Change Stream Incremental: For low-latency real-time CDC.
    • Batch Snapshot: For one-time backfills without ongoing CDC.
    • Batch Incremental: For workloads where you want periodic updates using a strictly increasing indexed field (e.g., timestamp or ObjectId).

If left unset, collections will default to incremental CDC as long as the MongoDB instance supports change streams.

  1. Configure Polling (If Needed)
    For collections using batch modes, set a polling schedule in cron or interval format to control how often Estuary fetches new data.
  2. Save and Publish the Capture
    Publishing starts the initial backfill followed by continuous change stream capture (if enabled).

Step 2: Create the Iceberg Materialization

  1. In the Estuary UI, go to Materializations and select the Apache Iceberg connector. A delta updates connector is also available.
Apache Iceberg destination connector options in Estuary
  1. Enter the endpoint details: Base URL, Warehouse, Namespace, and Base Location if required.
  1. Configure catalog authentication:
    • AWS SigV4 for Glue or S3 Tables.
    • OAuth client credentials for other REST catalogs.
  1. Set up compute: EMR Serverless Application ID, AWS access keys, AWS Region, Execution Role ARN, and S3 staging bucket.
  2. Optionally choose a sync schedule for your materialization. To sync new data to Iceberg as soon as it’s available, set sync frequency to 0s or set a batch schedule.
  3. Other optional settings include hard delete support, lowercase column names for Athena, and Systems Manager prefix for credentials.

Step 3: Bind Collections to Iceberg Tables

  1. Move on to the Source Collections section within connector creation. Here, you can add a binding for each MongoDB collection you want to stream.
  2. For each binding:
    • Select the Estuary collection created by your MongoDB capture.
    • Alter the binding’s default table name or namespace if desired. You can also set the default naming convention before adding collections to quickly choose between common naming schemes.
  3. Click Next.
  4. Save and Publish the materialization to start the continuous stream from MongoDB change streams into Iceberg tables.

Once published, this materialization will be ready to receive and merge data from MongoDB in real time.

For connector-specific parameters, see the official documentation for the MongoDB Capture Connector and Apache Iceberg Materialization Connector.

Ready to try it yourself?
Set up your MongoDB to Iceberg pipeline in minutes with Estuary.

  • Free forever for up to 10 GB/month (no credit card needed)
  • Millisecond latency for real-time freshness
  • Exactly-once delivery with incremental syncing
  • 30-day free trial for higher-volume Cloud plans

Get Started Free

Performance and Cost Tuning

To get the most value from your MongoDB to Apache Iceberg stream, you should balance low latency with predictable spend. Estuary offers several configuration options to help optimize performance and cost.

  1. Prefer Change Streams Over Batch: Always choose Change Stream Incremental mode when possible. This delivers new events with sub-second latency and reduces compute costs compared to repeated polling.
  2. Use Batch Incremental for Append-Only Workloads: If your collections grow with strictly increasing indexed fields (such as timestamps or IDs), Batch Incremental mode can reduce costs by scanning only new data at each poll interval.
  3. Configure EMR Serverless Autoscaling: For Iceberg materialization, enable autoscaling in EMR Serverless so compute resources adjust dynamically to workload size. Set auto-stop timers to release resources when idle.
  4. Optimize Partition Strategy: Align your Iceberg table partitioning with query patterns. For example, partition by date for time-series workloads or by customer ID for segmentation queries. This minimizes the amount of data scanned and reduces query costs.
  5. Schedule Compactions Strategically: Compact Iceberg data files during off-peak hours to reduce EMR runtime costs and improve read performance.

With these practices, you can maintain near real-time freshness from MongoDB while keeping infrastructure expenses under control.

Security Deep Dive

When streaming MongoDB data to Apache Iceberg, especially for enterprise workloads, security and compliance are non-negotiable. Estuary’s architecture is designed to let you keep full control over your data while meeting strict regulatory requirements.

BYOC for Maximum Control

With the Bring Your Own Cloud (BYOC) deployment model, you can run your own Estuary data plane inside your own VPC. This means data never leaves your infrastructure, and you maintain direct control over compute, storage, and networking.

Private Networking Options

You can connect on-premises or cloud-hosted MongoDB instances securely using:

  • VPC Peering: Directly connect your VPC to the MongoDB or Iceberg catalog’s VPC without internet exposure.
  • PrivateLink: Establish a private connection over AWS or Azure without traversing public networks.
  • SSH Tunnels: For on-premises or restricted networks, tunnel traffic through a secure SSH connection to Estuary’s runtime.

IAM and Access Control

Follow the principle of least privilege by granting Estuary only the permissions it needs:

  • For AWS Glue or S3 Tables catalogs, allow minimal read/write access to metadata and staging buckets.
  • For EMR Serverless, provide an execution role scoped to the specific application and region.
  • Store secrets securely in AWS Systems Manager Parameter Store or a similar vault.

Data Encryption

All communication between Estuary and endpoints is encrypted in transit using TLS. Data at rest in staging buckets or intermediate collections is encrypted using the storage provider’s native encryption (SSE-S3, SSE-KMS, or equivalent).

Audit and Traceability

Estuary’s collections act as immutable, schema-enforced logs of every record written. This supports data lineage tracking and helps with compliance audits. Access logs, execution metrics, and configuration history are available for review in the Estuary UI or via APIs.

Conclusion

Streaming MongoDB data to Apache Iceberg enables teams to build a reliable, real-time data lakehouse that supports analytics, AI models, and BI dashboards without relying on fragile batch pipelines. While it is possible to achieve this using custom CDC pipelines or manual hybrid architectures, those approaches require significant engineering effort to manage correctness, schema evolution, and operational reliability.

With Estuary, teams get a unified, end-to-end solution that captures changes from MongoDB with exactly-once guarantees and materializes them into Iceberg tables with full transactional integrity and schema governance. The pipeline handles initial backfills and continuous streaming as a single system, reducing operational complexity while maintaining real-time freshness.

Whether you are using MongoDB Atlas, self-hosted MongoDB, Amazon DocumentDB, or Azure Cosmos DB with the MongoDB API, Estuary makes it possible to set up a production-ready MongoDB to Iceberg pipeline in minutes. This allows teams to focus on using their data rather than maintaining the infrastructure that moves it.

Next steps:

FAQs

    Can I stream MongoDB to Apache Iceberg without Estuary?

    Yes, you can build a custom pipeline using scripts or open-source tools, but it requires coding connectors, managing schema evolution, and handling exactly-once delivery yourself. This approach is time-consuming and complex compared to a managed solution.
    Estuary Flow offers a fully managed, exactly-once streaming pipeline from MongoDB to Iceberg. It supports MongoDB Atlas, self-hosted MongoDB, Amazon DocumentDB, and Azure Cosmos DB API, with schema governance and BYOC deployment options.
    Yes, with MongoDB change streams, Estuary streams updates to Iceberg in seconds, keeping analytics and BI dashboards current without batch delays.
    Yes, Estuary supports batch snapshot and batch incremental modes for deployments without change streams, ensuring your Iceberg tables still stay updated.
    Yes, you can capture MongoDB change streams into Kafka and then write to Iceberg, but this adds complexity. Using Estuary removes the need for a separate messaging layer and ensures exactly-once delivery directly into Iceberg.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.