Estuary

SQL Server to Apache Iceberg: Migration and CDC Options

Learn the common ways to move data from SQL Server to Apache Iceberg, including batch migration, incremental loads, and CDC pipelines, and when a managed CDC platform like Estuary is the right choice.

SQL Server to Apache Iceberg
Share this article

Moving data from Microsoft SQL Server to Apache Iceberg is typically done in one of three ways: a batch migration, incremental loads based on watermarks, or continuous change data capture (CDC). Batch and incremental approaches are simpler to operate but sacrifice freshness and often miss deletes, while CDC pipelines provide correct inserts, updates, and deletes at the cost of higher operational complexity.

Apache Iceberg is commonly used as the analytics target because it supports ACID transactions, schema evolution, and time travel on object storage, and can be queried by engines such as Spark, Trino, and Flink. SQL Server, by contrast, remains the system of record for transactional workloads.

This guide explains the common SQL Server to Iceberg architectures, their tradeoffs, and when a managed CDC platform such as Estuary is an appropriate alternative to running Spark, Kafka, or Flink pipelines.

TL;DR

  • SQL Server data is typically moved into Apache Iceberg using batch migration, incremental polling, or log-based CDC.

  • Batch and polling approaches are simpler but sacrifice freshness and often miss deletes.

  • CDC pipelines using tools like Debezium or Flink provide correct inserts, updates, and deletes but require operating streaming infrastructure.

  • Managed CDC platforms such as Estuary stream SQL Server changes directly into Iceberg while reducing operational complexity.

Decide your Iceberg target first

Before choosing how to move data out of SQL Server, it is important to decide where and how your Apache Iceberg tables will live. This choice influences which ingestion approaches are viable and how complex the overall pipeline becomes.

Storage layer

Iceberg tables are stored on a filesystem or object store rather than inside a database engine. Common choices include:

  • Amazon S3
  • Azure Data Lake Storage (ADLS Gen2)
  • Google Cloud Storage (GCS)
  • HDFS or compatible systems such as MinIO

Iceberg itself does not manage storage; it relies on these systems for durability and scalability. Your ingestion pipeline must therefore be able to write files and metadata into the chosen storage layer.

Catalog

Iceberg tables are managed through a catalog, which tracks table metadata, snapshots, and schema versions. Common catalog options include:

  • Hive Metastore
  • REST catalog
  • AWS Glue Catalog
  • Project Nessie

The catalog choice determines how writers and query engines coordinate commits and how table metadata is discovered. For example, Spark, Flink, and Trino all require compatible catalog configurations to read from and write to Iceberg tables safely.

Query engines

Iceberg is designed to be queried by multiple engines. In practice, teams commonly use:

  • Spark for large-scale analytics and batch processing
  • Trino or Presto for interactive SQL queries
  • Flink for streaming or near-real-time analytics

Ingestion pipelines typically write to Iceberg through Spark, Flink, or a managed service that implements Iceberg’s write and commit protocols correctly.

Why this decision matters

The combination of storage, catalog, and query engine determines:

  • Whether batch-only ingestion is sufficient or CDC is required
  • Whether streaming engines such as Spark Structured Streaming or Flink are already part of your stack
  • How much operational infrastructure your team is willing to manage

Once the Iceberg target is defined, you can evaluate the available approaches for moving SQL Server data into Iceberg and understand the tradeoffs each approach introduces.

Common ways to move data from SQL Server to Apache Iceberg

There is no single “best” way to move data from SQL Server into Iceberg. The right approach depends on whether you are doing a one-time migration or need continuous updates, whether deletes must be captured correctly, and how much operational complexity your team is willing to manage.

The approaches below are the most common patterns used in practice.

1) One-time or scheduled batch migration (Spark JDBC → Iceberg)

This is the most straightforward way to move data from SQL Server into Iceberg and is often used for initial migrations or periodic refreshes.

How it works

  • Spark reads tables from SQL Server using the JDBC connector.
  • Data is written into Iceberg tables on object storage.
  • Tables can be partitioned and sorted to match analytical query patterns.

This approach is widely documented and well supported by the Iceberg ecosystem.

When this works well

  • Initial backfills or historical migrations
  • Low-frequency refreshes (daily or hourly)
  • Environments where SQL Server CDC cannot be enabled

Tradeoffs

  • Changes between runs are not captured automatically.
  • Deletes and updates require full rewrites or custom merge logic.
  • Querying large tables repeatedly can put load on SQL Server.
  • Latency is bounded by the batch schedule.

Batch JDBC ingestion is simple to implement, but it is not suitable when you need continuous updates or low-latency analytics.

2) Incremental loads without log-based CDC (watermark or polling)

Some teams avoid log-based CDC by relying on timestamps or monotonically increasing columns.

How it works

  • A column such as updated_at or an increasing ID is used as a watermark.
  • Each run reads only rows newer than the last processed value.
  • Data is written to Iceberg using partition overwrites or merge semantics.

When this works well

  • Tables with reliable update timestamps
  • Use cases where deletes are rare or can be ignored
  • Lower operational complexity than streaming CDC

Tradeoffs

  • Deletes are not captured unless modeled explicitly.
  • Late updates can be missed if timestamps are unreliable.
  • Merge logic must still be implemented in Spark or another engine.
  • Latency remains batch-oriented.

Incremental polling is often used as a compromise, but it does not provide full correctness for CDC workloads.

3) True change data capture (CDC) into Iceberg

When correctness and freshness matter, teams use log-based CDC to capture inserts, updates, and deletes.

3A) Debezium (SQL Server CDC) → Iceberg

Debezium can read SQL Server change data from CDC tables or transaction logs and emit structured change events.

How it works
  • Debezium captures row-level changes from SQL Server.
  • Change events are emitted to a streaming system or consumed directly.
  • A downstream process applies those events to Iceberg tables using merge or upsert semantics.
When this works well
  • Near-real-time replication is required
  • Deletes must be captured accurately
  • Teams already operate Kafka or Debezium infrastructure
Tradeoffs
  • Requires careful design of merge semantics in Iceberg.
  • Operational complexity increases with Kafka, connectors, and consumers.
  • Schema evolution and retries must be handled explicitly.
  • Ongoing table maintenance (compaction, optimization) is required.

Apache Flink provides native CDC connectors and can write directly into Iceberg using streaming sinks.

How it works
  • Flink CDC reads SQL Server change logs.
  • Streaming jobs transform and apply changes to Iceberg tables.
  • Checkpoints and state ensure fault tolerance.
When this works well
  • Teams already use Flink for streaming workloads
  • Low-latency ingestion with fine-grained control is required
  • Complex transformations or enrichments are needed
Tradeoffs
  • Flink clusters and state backends must be operated.
  • CDC correctness depends on careful job configuration.
  • Small-file management and compaction must be planned.
  • Operational cost and complexity are high.

3C) Managed CDC platforms → Iceberg (Estuary)

Managed CDC platforms provide an alternative to running custom Spark, Kafka, or Flink pipelines.

How it works

  • The platform captures SQL Server changes using CDC.
  • Inserts, updates, and deletes are interpreted and applied consistently.
  • Changes are written directly into Iceberg tables on object storage.

When this works well

  • Correct CDC semantics are required, including deletes.
  • Low-latency delivery into Iceberg is needed.
  • Teams want to avoid operating streaming infrastructure.

Estuary is an example of a managed CDC platform that streams SQL Server changes directly into Apache Iceberg tables, handling backfill, schema evolution, retries, and exactly-once delivery without requiring teams to run Spark or Flink jobs.

Step-by-Step: SQL Server to Iceberg with Estuary

Prerequisites

Before you begin:

Step 1: Set Up SQL Server as the Source

Search for SQL Server for the Estuary source connector
  1. Go to Sources → + New Capture in the Estuary dashboard
  2. Select the SQL Server connector
  3. Provide the connection details:
    • Address: <host>:1433
    • Database: Your target DB
    • Username/password: With CDC + SELECT permissions
  4. Choose your tables and specify primary keys if needed
  5. Click Next → Save and Publish

👉 Estuary will start capturing inserts, updates, and deletes in real time using CDC.

Step 2: Materialize to Apache Iceberg

Search for Apache Iceberg for the Estuary destination connector
  1. After capture, click Materialize Collections
  2. Search for and select the Apache Iceberg connector
    • The Apache Iceberg connector can merge CDC updates while the Amazon S3 Iceberg connector instead relies on delta updates, which doesn’t reduce changes to your data
  3. Fill in your destination config:
    • URL: base URL for the REST catalog
    • Warehouse
    • Namespace (e.g. sql_server_sync)
    • Catalog authentication: OAuth 2.0 credentials or AWS SigV4 authentication
    • Compute details: EMR application ID, S3 bucket, and access credentials
  4. Map your collections to Iceberg table names in the Source Collections section
  5. Click Next → Save and Publish

Estuary will batch CDC updates, convert to Parquet, and stream to your Iceberg tables — all in real time.

Advanced Options

Estuary Flow also supports:

  • Delta Updates: Skip table queries, write faster (great for large-scale inserts)
  • Backfill + CDC: Load historical rows, then stream new ones continuously
  • Transformations: Filter, rename, or enrich data in-flight using SQL or TypeScript
  • Scheduling: Control sync intervals (as low as 1 minute)
SQL Server to Apache Iceberg real-time sync using Estuary Flow

Comparison of approaches and when to choose each

Each SQL Server to Iceberg approach makes different tradeoffs across freshness, correctness, and operational effort. Comparing them side by side helps clarify which pattern fits a given workload.

High-level comparison

ApproachInserts / UpdatesDeletesLatencyOperational effort
Batch migration (Spark JDBC)YesNoHours to daysMedium
Incremental polling (watermark)PartialNoHoursMedium
Debezium CDC → IcebergYesYesSeconds to minutesHigh
Flink CDC → IcebergYesYesSecondsVery high
Managed CDC (Estuary)YesYesSecondsLow

This table highlights a key distinction: batch and polling approaches prioritize simplicity but sacrifice correctness and freshness, while CDC-based approaches deliver accurate change capture at the cost of additional infrastructure.

Choosing the right approach

The following guidelines are commonly used in practice.

  • Initial migration or historical backfill
    Batch ingestion using Spark JDBC is often sufficient for copying existing tables into Iceberg for the first time.
  • Periodic refresh without strict correctness requirements
    Incremental polling can work if tables have reliable update timestamps and deletes can be tolerated or ignored.
  • Near-real-time replication with full correctness
    Log-based CDC using Debezium or Flink is appropriate when inserts, updates, and deletes must be reflected accurately in Iceberg.
  • Near-real-time replication without operating streaming systems
    Managed CDC platforms such as Estuary are well suited when teams want correct CDC into Iceberg without running Spark, Kafka, or Flink pipelines themselves.

In many architectures, teams combine approaches: a batch backfill for historical data followed by a CDC pipeline for continuous updates. The main difference lies in whether CDC infrastructure is built and operated in-house or delegated to a managed system.

Use Cases: SQL Server to Iceberg

Scalable Analytics

Run complex joins, aggregations, or time-travel queries on years of data — without hitting SQL Server.

ML Feature Stores

Sync operational data to Iceberg to train and serve real-time ML models.

BI Dashboards

Query streaming tables in Spark or Trino without stressing your primary database.

Compliance & Auditing

Store every change in Iceberg for secure, queryable historical records.

Final Thoughts: Real-Time SQL Server to Iceberg, Simplified

Moving data from SQL Server to Apache Iceberg enables scalable analytics, historical analysis, and lakehouse architectures without compromising transactional performance. There are multiple ways to achieve this, ranging from simple batch migrations to fully managed CDC pipelines.

Batch and polling approaches are easy to implement but sacrifice freshness and correctness. CDC-based approaches provide accurate, near-real-time replication but introduce operational complexity. Managed CDC platforms such as Estuary offer a middle ground by delivering correct change capture into Iceberg while reducing the burden of operating streaming infrastructure.

Choosing the right approach depends on your requirements for latency, correctness, and operational ownership.

Ready to modernize your SQL Server data strategy? Start streaming to Iceberg with Estuary →

FAQs

    Does SQL Server natively support Apache Iceberg?

    No. SQL Server does not have native support for Iceberg. Data must be exported or streamed into Iceberg using external pipelines or platforms.
    Not necessarily. While Spark, Kafka, and Flink are commonly used, managed CDC platforms such as Estuary can stream SQL Server changes directly into Iceberg without requiring teams to operate these systems.
    Yes. A common pattern is to perform an initial backfill of existing tables followed by continuous CDC to keep Iceberg tables up to date.
    Latency depends on the approach. Batch pipelines operate on schedules, while CDC-based pipelines can deliver changes within seconds.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.