
How to Stream MongoDB Data to Apache Iceberg for Analytics and AI
Learn how to stream MongoDB data to Apache Iceberg in real time using Estuary. Capture MongoDB change streams, ensure exactly-once delivery, and keep your Iceberg tables fresh for analytics, AI, and BI without complex ETL jobs.

Moving data from MongoDB to Apache Iceberg is a common requirement for teams that want real-time analytics, AI workloads, and BI dashboards on operational data. The challenge is not just copying data, but keeping it continuously in sync as MongoDB changes, while preserving correctness, schema evolution, and transactional guarantees.
A MongoDB CDC to Iceberg pipeline captures inserts, updates, and deletes from MongoDB and materializes them into Iceberg tables with ACID transactions, schema evolution, partition evolution, and time travel. This allows analytics and machine learning systems to query the most current data directly from cloud storage, without relying on fragile batch jobs or manual exports.
Teams can implement this pipeline in different ways, ranging from custom-built CDC architectures to fully managed streaming solutions. The sections below explain why streaming MongoDB data into Iceberg matters, how the pipeline works, and what approaches teams typically use.
Why Stream MongoDB Data to Apache Iceberg
MongoDB is often the system of record for high-velocity data such as application events, IoT readings, transactions, and user interactions. This data changes continuously, and analytics or AI workloads lose value when they operate on stale or delayed information. Apache Iceberg provides an ideal destination for this data by combining the scalability of a data lake with the reliability and transactional guarantees of a database.
With a MongoDB CDC to Iceberg integration, every insert, update, and delete can be captured as it happens and written into Iceberg tables that support ACID transactions, schema evolution, and time travel. This ensures that queries, dashboards, and machine learning pipelines always operate on the most current and consistent view of the data.
Streaming MongoDB into Iceberg also removes the need for manual exports or scheduled batch ETL jobs that require constant upkeep. Instead of coordinating snapshots and reconciling partial loads, a continuous pipeline delivers incremental updates, keeps schemas in sync, and maintains a live lakehouse that is always ready for analytics and AI workloads.
Method 1: Streaming MongoDB Data to Apache Iceberg with Estuary
One common and increasingly preferred approach is to use a managed CDC pipeline that continuously streams MongoDB changes directly into Apache Iceberg. In this model, change data capture handles both the initial historical load and ongoing inserts, updates, and deletes, while the destination layer ensures transactional writes and schema evolution in Iceberg tables.
With Estuary, this pipeline is fully managed end-to-end. Estuary connects to MongoDB using change streams, performs an initial backfill to capture existing data, and then continuously streams changes with exactly-once guarantees. On the destination side, data is materialized into Apache Iceberg using a transactional write path that preserves ACID semantics, schema evolution, and time travel.
This approach removes the need to build and operate custom CDC consumers, orchestration logic, or merge jobs, while still delivering real-time freshness and reliability.
Other Ways Teams Move MongoDB Data to Apache Iceberg
Before managed streaming pipelines became common, most teams relied on custom or manual approaches to move MongoDB data into Apache Iceberg. These methods are still used today, but they come with meaningful trade-offs.
One common approach is to build a custom CDC pipeline using MongoDB Change Streams. In this model, engineers write and operate their own consumers to read change events, manage offsets, handle retries, and apply updates to Iceberg tables using processing frameworks such as Spark, Flink, or cloud-native jobs like AWS Glue. While flexible, this approach requires significant engineering effort to implement exactly-once guarantees, manage schema evolution, and safely handle backfills and failures.
Another widely used method is periodic batch exports. MongoDB collections are exported to cloud storage on a schedule and then loaded into Iceberg tables using batch jobs. This approach is easier to reason about but results in stale data, higher compute costs, and limited support for incremental updates or deletes.
Some organizations attempt a manual hybrid architecture that combines batch snapshots with custom streaming updates. While this approach can reduce staleness compared to pure batch pipelines, it typically requires additional logic for deduplication, ordering, and reconciliation when writing into Iceberg tables.
These approaches can work, but they demand ongoing maintenance, custom logic, and deep expertise in both CDC systems and Iceberg internals, particularly when snapshot ingestion and streaming updates are managed as separate systems. This is why many teams choose a fully managed solution that provides real-time change capture, schema governance, and exactly-once delivery without the operational overhead.
Prerequisites Checklist
Before you set up your pipeline, ensure you have:
- A MongoDB cluster with change streams enabled (Atlas, self-hosted, or compatible services like Amazon DocumentDB and Azure Cosmos DB for MongoDB API — note that some versions may have limited change stream support).
- An Apache Iceberg REST catalog (e.g., AWS Glue, Snowflake Open Catalog).
- Access to Estuary (create a free account if you don’t have one).
- Network and IAM permissions for Estuary to connect to both MongoDB and your Iceberg environment.
Architecture Overview
A MongoDB to Apache Iceberg streaming pipeline built with Estuary has three main layers:
- Source layer: The pipeline begins with MongoDB change streams, which capture every data change in real time. When you first connect, Estuary automatically performs a backfill to take a complete snapshot of your existing collections as part of a single CDC pipeline, before switching seamlessly to streaming mode. This ensures there are no gaps between historical and live data.
- Transport and governance layer: Captured events flow into Estuary collections. These collections validate data against JSON schemas, enforce exactly-once delivery, and maintain data lineage for governance and auditability. This layer also manages schema evolution so that changes in MongoDB are propagated safely to downstream systems.
- Destination layer: On the Iceberg side, Estuary writes data through a REST catalog such as AWS Glue or S3 Tables. It stages data in an S3 bucket and uses EMR Serverless compute to merge new events into Iceberg tables with full ACID compliance. This transactional approach ensures that queries in tools like Athena or Spark always return consistent and up-to-date results.
This architecture is fault-tolerant, recoverable, and scalable, making it suitable for enterprise-grade MongoDB to Iceberg streaming workloads.
Step-by-Step: Setting Up MongoDB to Apache Iceberg in Estuary
Watch it in Action:
See how to set up a real-time MongoDB to Apache Iceberg pipeline in Estuary.
Step 1: Create the MongoDB Capture in Estuary
- Sign Up or Log In to Estuary
If you do not have an Estuary account, visit dashboard.estuary.dev/register and sign up. Once logged in, open the Estuary web application. - Select MongoDB Capture Connector
In the Estuary UI, choose the MongoDB capture connector. Estuary supports MongoDB Atlas, self-hosted MongoDB replica sets, sharded clusters, Amazon DocumentDB, and Azure Cosmos DB for MongoDB API.
- Enter Connection Details
- Address / Connection String: Paste the MongoDB URI. For Atlas, use the SRV format (
mongodb+srv://), and for self-hosted deployments include theauthSourceparameter when needed. - Authentication: Provide the MongoDB username and password with read privileges.
- Database Discovery: Select the databases and collections you want Estuary to discover.
- Choose Capture Mode Per Collection (Optional)
- Change Stream Incremental: For low-latency real-time CDC.
- Batch Snapshot: For one-time backfills without ongoing CDC.
- Batch Incremental: For workloads where you want periodic updates using a strictly increasing indexed field (e.g., timestamp or ObjectId).
If left unset, collections will default to incremental CDC as long as the MongoDB instance supports change streams.
- Configure Polling (If Needed)
For collections using batch modes, set a polling schedule in cron or interval format to control how often Estuary fetches new data. - Save and Publish the Capture
Publishing starts the initial backfill followed by continuous change stream capture (if enabled).
Step 2: Create the Iceberg Materialization
- In the Estuary UI, go to Materializations and select the Apache Iceberg connector. A delta updates connector is also available.
- Enter the endpoint details: Base URL, Warehouse, Namespace, and Base Location if required.
- Configure catalog authentication:
- AWS SigV4 for Glue or S3 Tables.
- OAuth client credentials for other REST catalogs.
- Set up compute: EMR Serverless Application ID, AWS access keys, AWS Region, Execution Role ARN, and S3 staging bucket.
- Optionally choose a sync schedule for your materialization. To sync new data to Iceberg as soon as it’s available, set sync frequency to
0sor set a batch schedule. - Other optional settings include hard delete support, lowercase column names for Athena, and Systems Manager prefix for credentials.
Step 3: Bind Collections to Iceberg Tables
- Move on to the Source Collections section within connector creation. Here, you can add a binding for each MongoDB collection you want to stream.
- For each binding:
- Select the Estuary collection created by your MongoDB capture.
- Alter the binding’s default table name or namespace if desired. You can also set the default naming convention before adding collections to quickly choose between common naming schemes.
- Click Next.
- Save and Publish the materialization to start the continuous stream from MongoDB change streams into Iceberg tables.
Once published, this materialization will be ready to receive and merge data from MongoDB in real time.
For connector-specific parameters, see the official documentation for the MongoDB Capture Connector and Apache Iceberg Materialization Connector.
Ready to try it yourself?
Set up your MongoDB to Iceberg pipeline in minutes with Estuary.
- Free forever for up to 10 GB/month (no credit card needed)
- Millisecond latency for real-time freshness
- Exactly-once delivery with incremental syncing
- 30-day free trial for higher-volume Cloud plans
Performance and Cost Tuning
To get the most value from your MongoDB to Apache Iceberg stream, you should balance low latency with predictable spend. Estuary offers several configuration options to help optimize performance and cost.
- Prefer Change Streams Over Batch: Always choose Change Stream Incremental mode when possible. This delivers new events with sub-second latency and reduces compute costs compared to repeated polling.
- Use Batch Incremental for Append-Only Workloads: If your collections grow with strictly increasing indexed fields (such as timestamps or IDs), Batch Incremental mode can reduce costs by scanning only new data at each poll interval.
- Configure EMR Serverless Autoscaling: For Iceberg materialization, enable autoscaling in EMR Serverless so compute resources adjust dynamically to workload size. Set auto-stop timers to release resources when idle.
- Optimize Partition Strategy: Align your Iceberg table partitioning with query patterns. For example, partition by date for time-series workloads or by customer ID for segmentation queries. This minimizes the amount of data scanned and reduces query costs.
- Schedule Compactions Strategically: Compact Iceberg data files during off-peak hours to reduce EMR runtime costs and improve read performance.
With these practices, you can maintain near real-time freshness from MongoDB while keeping infrastructure expenses under control.
Security Deep Dive
When streaming MongoDB data to Apache Iceberg, especially for enterprise workloads, security and compliance are non-negotiable. Estuary’s architecture is designed to let you keep full control over your data while meeting strict regulatory requirements.
BYOC for Maximum Control
With the Bring Your Own Cloud (BYOC) deployment model, you can run your own Estuary data plane inside your own VPC. This means data never leaves your infrastructure, and you maintain direct control over compute, storage, and networking.
Private Networking Options
You can connect on-premises or cloud-hosted MongoDB instances securely using:
- VPC Peering: Directly connect your VPC to the MongoDB or Iceberg catalog’s VPC without internet exposure.
- PrivateLink: Establish a private connection over AWS or Azure without traversing public networks.
- SSH Tunnels: For on-premises or restricted networks, tunnel traffic through a secure SSH connection to Estuary’s runtime.
IAM and Access Control
Follow the principle of least privilege by granting Estuary only the permissions it needs:
- For AWS Glue or S3 Tables catalogs, allow minimal read/write access to metadata and staging buckets.
- For EMR Serverless, provide an execution role scoped to the specific application and region.
- Store secrets securely in AWS Systems Manager Parameter Store or a similar vault.
Data Encryption
All communication between Estuary and endpoints is encrypted in transit using TLS. Data at rest in staging buckets or intermediate collections is encrypted using the storage provider’s native encryption (SSE-S3, SSE-KMS, or equivalent).
Audit and Traceability
Estuary’s collections act as immutable, schema-enforced logs of every record written. This supports data lineage tracking and helps with compliance audits. Access logs, execution metrics, and configuration history are available for review in the Estuary UI or via APIs.
Conclusion
Streaming MongoDB data to Apache Iceberg enables teams to build a reliable, real-time data lakehouse that supports analytics, AI models, and BI dashboards without relying on fragile batch pipelines. While it is possible to achieve this using custom CDC pipelines or manual hybrid architectures, those approaches require significant engineering effort to manage correctness, schema evolution, and operational reliability.
With Estuary, teams get a unified, end-to-end solution that captures changes from MongoDB with exactly-once guarantees and materializes them into Iceberg tables with full transactional integrity and schema governance. The pipeline handles initial backfills and continuous streaming as a single system, reducing operational complexity while maintaining real-time freshness.
Whether you are using MongoDB Atlas, self-hosted MongoDB, Amazon DocumentDB, or Azure Cosmos DB with the MongoDB API, Estuary makes it possible to set up a production-ready MongoDB to Iceberg pipeline in minutes. This allows teams to focus on using their data rather than maintaining the infrastructure that moves it.
Next steps:
- Try it yourself – Sign up for a free Estuary account and start streaming from MongoDB to Iceberg today.
- Read the docs – Check out the official guides for MongoDB Capture and Iceberg Materialization for detailed configuration steps.
- Talk to our experts – Book a consultation to discuss your use case and compliance requirements.
FAQs
What is the best tool for MongoDB to Iceberg integration?
Is MongoDB to Iceberg integration real time?
Can it work without MongoDB change streams?
Can I use Kafka as an intermediary between MongoDB and Apache Iceberg?

About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.















