
Moving data from Microsoft SQL Server to Apache Iceberg is typically done in one of three ways: a batch migration, incremental loads based on watermarks, or continuous change data capture (CDC). Batch and incremental approaches are simpler to operate but sacrifice freshness and often miss deletes, while CDC pipelines provide correct inserts, updates, and deletes at the cost of higher operational complexity.
Apache Iceberg is commonly used as the analytics target because it supports ACID transactions, schema evolution, and time travel on object storage, and can be queried by engines such as Spark, Trino, and Flink. SQL Server, by contrast, remains the system of record for transactional workloads.
This guide explains the common SQL Server to Iceberg architectures, their tradeoffs, and when a managed CDC platform such as Estuary is an appropriate alternative to running Spark, Kafka, or Flink pipelines.
TL;DR
SQL Server data is typically moved into Apache Iceberg using batch migration, incremental polling, or log-based CDC.
Batch and polling approaches are simpler but sacrifice freshness and often miss deletes.
CDC pipelines using tools like Debezium or Flink provide correct inserts, updates, and deletes but require operating streaming infrastructure.
Managed CDC platforms such as Estuary stream SQL Server changes directly into Iceberg while reducing operational complexity.
Decide your Iceberg target first
Before choosing how to move data out of SQL Server, it is important to decide where and how your Apache Iceberg tables will live. This choice influences which ingestion approaches are viable and how complex the overall pipeline becomes.
Storage layer
Iceberg tables are stored on a filesystem or object store rather than inside a database engine. Common choices include:
- Amazon S3
- Azure Data Lake Storage (ADLS Gen2)
- Google Cloud Storage (GCS)
- HDFS or compatible systems such as MinIO
Iceberg itself does not manage storage; it relies on these systems for durability and scalability. Your ingestion pipeline must therefore be able to write files and metadata into the chosen storage layer.
Catalog
Iceberg tables are managed through a catalog, which tracks table metadata, snapshots, and schema versions. Common catalog options include:
- Hive Metastore
- REST catalog
- AWS Glue Catalog
- Project Nessie
The catalog choice determines how writers and query engines coordinate commits and how table metadata is discovered. For example, Spark, Flink, and Trino all require compatible catalog configurations to read from and write to Iceberg tables safely.
Query engines
Iceberg is designed to be queried by multiple engines. In practice, teams commonly use:
- Spark for large-scale analytics and batch processing
- Trino or Presto for interactive SQL queries
- Flink for streaming or near-real-time analytics
Ingestion pipelines typically write to Iceberg through Spark, Flink, or a managed service that implements Iceberg’s write and commit protocols correctly.
Why this decision matters
The combination of storage, catalog, and query engine determines:
- Whether batch-only ingestion is sufficient or CDC is required
- Whether streaming engines such as Spark Structured Streaming or Flink are already part of your stack
- How much operational infrastructure your team is willing to manage
Once the Iceberg target is defined, you can evaluate the available approaches for moving SQL Server data into Iceberg and understand the tradeoffs each approach introduces.
Common ways to move data from SQL Server to Apache Iceberg
There is no single “best” way to move data from SQL Server into Iceberg. The right approach depends on whether you are doing a one-time migration or need continuous updates, whether deletes must be captured correctly, and how much operational complexity your team is willing to manage.
The approaches below are the most common patterns used in practice.
1) One-time or scheduled batch migration (Spark JDBC → Iceberg)
This is the most straightforward way to move data from SQL Server into Iceberg and is often used for initial migrations or periodic refreshes.
How it works
- Spark reads tables from SQL Server using the JDBC connector.
- Data is written into Iceberg tables on object storage.
- Tables can be partitioned and sorted to match analytical query patterns.
This approach is widely documented and well supported by the Iceberg ecosystem.
When this works well
- Initial backfills or historical migrations
- Low-frequency refreshes (daily or hourly)
- Environments where SQL Server CDC cannot be enabled
Tradeoffs
- Changes between runs are not captured automatically.
- Deletes and updates require full rewrites or custom merge logic.
- Querying large tables repeatedly can put load on SQL Server.
- Latency is bounded by the batch schedule.
Batch JDBC ingestion is simple to implement, but it is not suitable when you need continuous updates or low-latency analytics.
2) Incremental loads without log-based CDC (watermark or polling)
Some teams avoid log-based CDC by relying on timestamps or monotonically increasing columns.
How it works
- A column such as
updated_ator an increasing ID is used as a watermark. - Each run reads only rows newer than the last processed value.
- Data is written to Iceberg using partition overwrites or merge semantics.
When this works well
- Tables with reliable update timestamps
- Use cases where deletes are rare or can be ignored
- Lower operational complexity than streaming CDC
Tradeoffs
- Deletes are not captured unless modeled explicitly.
- Late updates can be missed if timestamps are unreliable.
- Merge logic must still be implemented in Spark or another engine.
- Latency remains batch-oriented.
Incremental polling is often used as a compromise, but it does not provide full correctness for CDC workloads.
3) True change data capture (CDC) into Iceberg
When correctness and freshness matter, teams use log-based CDC to capture inserts, updates, and deletes.
3A) Debezium (SQL Server CDC) → Iceberg
Debezium can read SQL Server change data from CDC tables or transaction logs and emit structured change events.
How it works
- Debezium captures row-level changes from SQL Server.
- Change events are emitted to a streaming system or consumed directly.
- A downstream process applies those events to Iceberg tables using merge or upsert semantics.
When this works well
- Near-real-time replication is required
- Deletes must be captured accurately
- Teams already operate Kafka or Debezium infrastructure
Tradeoffs
- Requires careful design of merge semantics in Iceberg.
- Operational complexity increases with Kafka, connectors, and consumers.
- Schema evolution and retries must be handled explicitly.
- Ongoing table maintenance (compaction, optimization) is required.
3B) Flink CDC → Iceberg
Apache Flink provides native CDC connectors and can write directly into Iceberg using streaming sinks.
How it works
- Flink CDC reads SQL Server change logs.
- Streaming jobs transform and apply changes to Iceberg tables.
- Checkpoints and state ensure fault tolerance.
When this works well
- Teams already use Flink for streaming workloads
- Low-latency ingestion with fine-grained control is required
- Complex transformations or enrichments are needed
Tradeoffs
- Flink clusters and state backends must be operated.
- CDC correctness depends on careful job configuration.
- Small-file management and compaction must be planned.
- Operational cost and complexity are high.
3C) Managed CDC platforms → Iceberg (Estuary)
Managed CDC platforms provide an alternative to running custom Spark, Kafka, or Flink pipelines.
How it works
- The platform captures SQL Server changes using CDC.
- Inserts, updates, and deletes are interpreted and applied consistently.
- Changes are written directly into Iceberg tables on object storage.
When this works well
- Correct CDC semantics are required, including deletes.
- Low-latency delivery into Iceberg is needed.
- Teams want to avoid operating streaming infrastructure.
Estuary is an example of a managed CDC platform that streams SQL Server changes directly into Apache Iceberg tables, handling backfill, schema evolution, retries, and exactly-once delivery without requiring teams to run Spark or Flink jobs.
Step-by-Step: SQL Server to Iceberg with Estuary
Prerequisites
Before you begin:
- A free Estuary account
- A SQL Server database with:
- CDC enabled on target tables
- A user with VIEW DATABASE STATE and SELECT permissions
- An S3 bucket + REST Catalog for Iceberg
- An AWS EMR Serverless application with the Spark runtime
- AWS credentials (access key & secret)
Step 1: Set Up SQL Server as the Source
- Go to Sources → + New Capture in the Estuary dashboard
- Select the SQL Server connector
- Provide the connection details:
- Address: <host>:1433
- Database: Your target DB
- Username/password: With CDC + SELECT permissions
- Choose your tables and specify primary keys if needed
- Click Next → Save and Publish
👉 Estuary will start capturing inserts, updates, and deletes in real time using CDC.
Step 2: Materialize to Apache Iceberg
- After capture, click Materialize Collections
- Search for and select the Apache Iceberg connector
- The Apache Iceberg connector can merge CDC updates while the Amazon S3 Iceberg connector instead relies on delta updates, which doesn’t reduce changes to your data
- Fill in your destination config:
- URL: base URL for the REST catalog
- Warehouse
- Namespace (e.g. sql_server_sync)
- Catalog authentication: OAuth 2.0 credentials or AWS SigV4 authentication
- Compute details: EMR application ID, S3 bucket, and access credentials
- Map your collections to Iceberg table names in the Source Collections section
- Click Next → Save and Publish
Estuary will batch CDC updates, convert to Parquet, and stream to your Iceberg tables — all in real time.
Advanced Options
Estuary Flow also supports:
- Delta Updates: Skip table queries, write faster (great for large-scale inserts)
- Backfill + CDC: Load historical rows, then stream new ones continuously
- Transformations: Filter, rename, or enrich data in-flight using SQL or TypeScript
- Scheduling: Control sync intervals (as low as 1 minute)
Comparison of approaches and when to choose each
Each SQL Server to Iceberg approach makes different tradeoffs across freshness, correctness, and operational effort. Comparing them side by side helps clarify which pattern fits a given workload.
High-level comparison
| Approach | Inserts / Updates | Deletes | Latency | Operational effort |
|---|---|---|---|---|
| Batch migration (Spark JDBC) | Yes | No | Hours to days | Medium |
| Incremental polling (watermark) | Partial | No | Hours | Medium |
| Debezium CDC → Iceberg | Yes | Yes | Seconds to minutes | High |
| Flink CDC → Iceberg | Yes | Yes | Seconds | Very high |
| Managed CDC (Estuary) | Yes | Yes | Seconds | Low |
This table highlights a key distinction: batch and polling approaches prioritize simplicity but sacrifice correctness and freshness, while CDC-based approaches deliver accurate change capture at the cost of additional infrastructure.
Choosing the right approach
The following guidelines are commonly used in practice.
- Initial migration or historical backfill
Batch ingestion using Spark JDBC is often sufficient for copying existing tables into Iceberg for the first time. - Periodic refresh without strict correctness requirements
Incremental polling can work if tables have reliable update timestamps and deletes can be tolerated or ignored. - Near-real-time replication with full correctness
Log-based CDC using Debezium or Flink is appropriate when inserts, updates, and deletes must be reflected accurately in Iceberg. - Near-real-time replication without operating streaming systems
Managed CDC platforms such as Estuary are well suited when teams want correct CDC into Iceberg without running Spark, Kafka, or Flink pipelines themselves.
In many architectures, teams combine approaches: a batch backfill for historical data followed by a CDC pipeline for continuous updates. The main difference lies in whether CDC infrastructure is built and operated in-house or delegated to a managed system.
Use Cases: SQL Server to Iceberg
Scalable Analytics
Run complex joins, aggregations, or time-travel queries on years of data — without hitting SQL Server.
ML Feature Stores
Sync operational data to Iceberg to train and serve real-time ML models.
BI Dashboards
Query streaming tables in Spark or Trino without stressing your primary database.
Compliance & Auditing
Store every change in Iceberg for secure, queryable historical records.
Final Thoughts: Real-Time SQL Server to Iceberg, Simplified
Moving data from SQL Server to Apache Iceberg enables scalable analytics, historical analysis, and lakehouse architectures without compromising transactional performance. There are multiple ways to achieve this, ranging from simple batch migrations to fully managed CDC pipelines.
Batch and polling approaches are easy to implement but sacrifice freshness and correctness. CDC-based approaches provide accurate, near-real-time replication but introduce operational complexity. Managed CDC platforms such as Estuary offer a middle ground by delivering correct change capture into Iceberg while reducing the burden of operating streaming infrastructure.
Choosing the right approach depends on your requirements for latency, correctness, and operational ownership.
Ready to modernize your SQL Server data strategy? Start streaming to Iceberg with Estuary →
Related Resources
FAQs
Do I need Spark, Kafka, or Flink to stream SQL Server data into Iceberg?
Can I backfill historical data and then stream new changes?
How real-time can SQL Server to Iceberg pipelines be?

About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.
















