Estuary

How to Stream MongoDB Data to Apache Iceberg for Analytics and AI

Learn how to stream MongoDB data to Apache Iceberg in real time using Estuary Flow. Capture MongoDB change streams, ensure exactly-once delivery, and keep your Iceberg tables fresh for analytics, AI, and BI without complex ETL jobs.

MongoDB-Apache Iceberg Streaming Pipeline with Estuary
Share this article

Want to stream MongoDB data to Apache Iceberg without managing complex ETL pipelines?

With a MongoDB CDC to Iceberg integration, you can capture changes from MongoDB as they happen and store them in Iceberg tables. These tables give you ACID transactions, schema evolution, partition evolution, and time travel, all while using your existing cloud storage.

This means your analytics, AI models, and BI dashboards always have the freshest data without waiting for nightly batch jobs. With Estuary Flow, a MongoDB to Apache Iceberg pipeline takes minutes to set up, runs with exactly-once guarantees, and keeps data flowing continuously so you never have to worry about stale information.

Why Stream MongoDB Data to Apache Iceberg

MongoDB is often the system of record for high-velocity data such as application events, IoT readings, transactions, and user interactions. This data changes constantly, and analytics or AI workloads lose value if they work with stale information. Apache Iceberg provides the ideal destination for this data by combining the scalability of a data lake with the reliability of a database.

With MongoDB CDC to Iceberg integration, you can capture every insert, update, and delete as it happens and store it in Iceberg tables that support ACID transactions, schema evolution, and time travel. This ensures that queries, dashboards, and ML pipelines always operate on the most current state of your data.

Streaming MongoDB into Iceberg also removes the need for manual exports or batch ETL jobs that require scheduling and constant upkeep. Instead, a continuous pipeline delivers exactly-once guarantees, keeps schema in sync, and maintains a live data lakehouse ready for any analytics or AI workload.

Prerequisites Checklist

Before you set up your pipeline, ensure you have:

  • A MongoDB cluster with change streams enabled (Atlas, self-hosted, or compatible services like Amazon DocumentDB and Azure Cosmos DB for MongoDB API — note that some versions may have limited change stream support).
  • An Apache Iceberg REST catalog (e.g., AWS Glue, Snowflake Open Catalog).
  • Access to Estuary Flow (create a free account if you don’t have one).
  • Network and IAM permissions for Estuary Flow to connect to both MongoDB and your Iceberg environment.

Architecture Overview

A MongoDB to Apache Iceberg streaming pipeline built with Estuary Flow has three main layers:

  1. Source layer: The pipeline begins with MongoDB change streams, which capture every data change in real time. When you first connect, Estuary Flow automatically performs a backfill to take a complete snapshot of your existing collections before switching to streaming mode. This ensures there are no gaps between historical and live data.
  2. Transport and governance layer: Captured events flow into Estuary collections. These collections validate data against JSON schemas, enforce exactly-once delivery, and maintain data lineage for governance and auditability. This layer also manages schema evolution so that changes in MongoDB are propagated safely to downstream systems.
  3. Destination layer: On the Iceberg side, Estuary writes data through a REST catalog such as AWS Glue or S3 Tables. It stages data in an S3 bucket and uses EMR Serverless compute to merge new events into Iceberg tables with full ACID compliance. This transactional approach ensures that queries in tools like Athena or Spark always return consistent and up-to-date results.

The architecture is designed to be fault-tolerant, recoverable, and scalable for any workload size, making it ideal for enterprise-grade MongoDB to Iceberg integration.

Watch it in Action:

See how to set up a real-time MongoDB to Apache Iceberg pipeline in Estuary Flow.

Step-by-Step: Setting Up MongoDB to Apache Iceberg in Estuary Flow

Step 1: Create the MongoDB Capture in Estuary Flow

  1. Sign Up or Log In to Estuary Flow
    If you do not have an Estuary Flow account, visit dashboard.estuary.dev/register and sign up. Once logged in, open the Flow web application.
  2. Select MongoDB Capture Connector
    In the Estuary Flow UI, choose the MongoDB capture connector. Flow supports MongoDB Atlas, self-hosted MongoDB replica sets, sharded clusters, Amazon DocumentDB, and Azure Cosmos DB for MongoDB API.
Real-time MongoDB CDC source connector
  1. Enter Connection Details
Configure MongoDB capture details
  • Address / Connection String: Paste the MongoDB URI. For Atlas, use the SRV format (mongodb+srv://), and for self-hosted deployments include the authSource parameter when needed.
  • Authentication: Provide the MongoDB username and password with read privileges.
  • Database Discovery: Select the databases and collections you want Flow to discover.
  1. Choose Capture Mode Per Collection (Optional)

    • Change Stream Incremental: For low-latency real-time CDC.
    • Batch Snapshot: For one-time backfills without ongoing CDC.
    • Batch Incremental: For workloads where you want periodic updates using a strictly increasing indexed field (e.g., timestamp or ObjectId).

If left unset, collections will default to incremental CDC as long as the MongoDB instance supports change streams.

  1. Configure Polling (If Needed)
    For collections using batch modes, set a polling schedule in cron or interval format to control how often Flow fetches new data.
  2. Save and Publish the Capture
    Publishing starts the initial backfill followed by continuous change stream capture (if enabled).

Step 2: Create the Iceberg Materialization

  1. In the Estuary Flow UI, go to Materializations and select the Apache Iceberg connector. A delta updates connector is also available.
Apache Iceberg destination connector options in Estuary
  1. Enter the endpoint details: Base URL, Warehouse, Namespace, and Base Location if required.
  2. Configure catalog authentication:
    • AWS SigV4 for Glue or S3 Tables.
    • OAuth client credentials for other REST catalogs.
  3. Set up compute: EMR Serverless Application ID, AWS access keys, AWS Region, Execution Role ARN, and S3 staging bucket.
  4. Optionally choose a sync schedule for your materialization. To sync new data to Iceberg as soon as it’s available, set sync frequency to 0s or set a batch schedule.
  5. Other optional settings include hard delete support, lowercase column names for Athena, and Systems Manager prefix for credentials.

Step 3: Bind Collections to Iceberg Tables

  1. Move on to the Source Collections section within connector creation. Here, you can add a binding for each MongoDB collection you want to stream.
  2. For each binding:
    • Select the Estuary collection created by your MongoDB capture.
    • Alter the binding’s default table name or namespace if desired. You can also set the default naming convention before adding collections to quickly choose between common naming schemes.
  3. Click Next.
  4. Save and Publish the materialization to start the continuous stream from MongoDB change streams into Iceberg tables.

Once published, this materialization will be ready to receive and merge data from MongoDB in real time.

For connector-specific parameters, see the official documentation for the MongoDB Capture Connector and Apache Iceberg Materialization Connector.

Ready to try it yourself?
Set up your MongoDB to Iceberg pipeline in minutes with Estuary Flow.

  • Free forever for up to 10 GB/month (no credit card needed)
  • Millisecond latency for real-time freshness
  • Exactly-once delivery with incremental syncing
  • 30-day free trial for higher-volume Cloud plans

Get Started Free

Performance and Cost Tuning

To get the most value from your MongoDB to Apache Iceberg stream, you should balance low latency with predictable spend. Estuary Flow offers several configuration options to help optimize performance and cost.

  1. Prefer Change Streams Over Batch: Always choose Change Stream Incremental mode when possible. This delivers new events with sub-second latency and reduces compute costs compared to repeated polling.
  2. Use Batch Incremental for Append-Only Workloads: If your collections grow with strictly increasing indexed fields (such as timestamps or IDs), Batch Incremental mode can reduce costs by scanning only new data at each poll interval.
  3. Configure EMR Serverless Autoscaling: For Iceberg materialization, enable autoscaling in EMR Serverless so compute resources adjust dynamically to workload size. Set auto-stop timers to release resources when idle.
  4. Optimize Partition Strategy: Align your Iceberg table partitioning with query patterns. For example, partition by date for time-series workloads or by customer ID for segmentation queries. This minimizes the amount of data scanned and reduces query costs.
  5. Schedule Compactions Strategically: Compact Iceberg data files during off-peak hours to reduce EMR runtime costs and improve read performance.

With these practices, you can maintain near real-time freshness from MongoDB while keeping infrastructure expenses under control.

Security Deep Dive

When streaming MongoDB data to Apache Iceberg, especially for enterprise workloads, security and compliance are non-negotiable. Estuary Flow’s architecture is designed to let you keep full control over your data while meeting strict regulatory requirements.

BYOC for Maximum Control

With the Bring Your Own Cloud (BYOC) deployment model, you can run your own Flow data plane inside your own VPC. This means data never leaves your infrastructure, and you maintain direct control over compute, storage, and networking.

Private Networking Options

You can connect on-premises or cloud-hosted MongoDB instances securely using:

  • VPC Peering: Directly connect your VPC to the MongoDB or Iceberg catalog’s VPC without internet exposure.
  • PrivateLink: Establish a private connection over AWS or Azure without traversing public networks.
  • SSH Tunnels: For on-premises or restricted networks, tunnel traffic through a secure SSH connection to Flow’s runtime.

IAM and Access Control

Follow the principle of least privilege by granting Flow only the permissions it needs:

  • For AWS Glue or S3 Tables catalogs, allow minimal read/write access to metadata and staging buckets.
  • For EMR Serverless, provide an execution role scoped to the specific application and region.
  • Store secrets securely in AWS Systems Manager Parameter Store or a similar vault.

Data Encryption

All communication between Flow and endpoints is encrypted in transit using TLS. Data at rest in staging buckets or intermediate collections is encrypted using the storage provider’s native encryption (SSE-S3, SSE-KMS, or equivalent).

Audit and Traceability

Estuary’s collections act as immutable, schema-enforced logs of every record written. This supports data lineage tracking and helps with compliance audits. Access logs, execution metrics, and configuration history are available for review in the Flow UI or via APIs.

Conclusion

Streaming MongoDB data to Apache Iceberg unlocks a reliable, real-time data lakehouse that can power advanced analytics, AI models, and BI dashboards without the headaches of building and maintaining custom jobs. With Estuary Flow, you get an end-to-end solution that captures changes from MongoDB with exactly-once guarantees, materializes them into Iceberg with full schema governance, and runs securely in your own environment if needed.

Whether you are using MongoDB Atlas, self-hosted MongoDB, Amazon DocumentDB, or Azure Cosmos DB with the MongoDB API, Estuary Flow can help you set up a high-performance streaming pipeline in minutes. You will save engineering effort, reduce operational risk, and lower total cost of ownership while delivering the freshest data possible to your teams.

Next steps:

FAQs

    Yes, you can build a custom pipeline using scripts or open-source tools, but it requires coding connectors, managing schema evolution, and handling exactly-once delivery yourself. This approach is time-consuming and complex compared to a managed solution.
    Estuary Flow offers a fully managed, exactly-once streaming pipeline from MongoDB to Iceberg. It supports MongoDB Atlas, self-hosted MongoDB, Amazon DocumentDB, and Azure Cosmos DB API, with schema governance and BYOC deployment options.
    Yes, with MongoDB change streams, Estuary Flow streams updates to Iceberg in seconds, keeping analytics and BI dashboards current without batch delays.
    Yes, Estuary Flow supports batch snapshot and batch incremental modes for deployments without change streams, ensuring your Iceberg tables still stay updated.
    Yes, you can capture MongoDB change streams into Kafka and then write to Iceberg, but this adds complexity. Using Estuary Flow removes the need for a separate messaging layer and ensures exactly-once delivery directly into Iceberg.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Team Estuary
Team EstuaryEstuary Editorial Team

Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.