
Introduction
Moving data from IBM Db2 to Databricks is typically done using one of four approaches: batch ingestion over JDBC, file-based exports loaded with Databricks Auto Loader, third-party ETL tools, or change data capture for continuous replication. The right choice depends primarily on network reachability between Databricks and Db2, the Db2 platform variant in use, and whether you need near real-time updates with correct handling of inserts, updates, and deletes.
Databricks commonly lands incoming data into Delta Lake tables, which support transactional writes and merge semantics for applying changes over time. Db2 remains the system of record for transactional workloads, while Databricks serves as the analytics and lakehouse layer.
This guide explains the practical architectures for moving data from IBM Db2 to Databricks, the trade-offs of each approach, and when a managed, right-time data platform such as Estuary can simplify ingestion without requiring teams to operate custom Spark, streaming, or CDC infrastructure.
TL;DR
Db2 data is most commonly loaded into Databricks using JDBC batch jobs, file exports with Auto Loader, ETL tools, or CDC pipelines.
Batch and file-based approaches are simpler to operate but sacrifice freshness and often miss deletes.
CDC pipelines provide correct inserts, updates, and deletes but introduce significant operational complexity.
Managed right-time platforms such as Estuary provide scheduled or near real-time movement into Databricks while reducing pipeline and infrastructure overhead.
Constraints to identify before choosing an approach
Before selecting a Db2 to Databricks ingestion pattern, it is important to understand a small set of constraints that strongly influence which options are viable. In practice, most architecture decisions are determined by these factors rather than by tooling preferences.
Network reachability between Databricks and Db2
The first question is whether Databricks compute can establish a direct network connection to the Db2 instance.
- If Databricks can reach Db2 over TCP, JDBC-based ingestion and some managed ingestion tools are viable.
- If Db2 is on-premises or in a restricted network, connectivity may require VPN, private connectivity, or may not be permitted at all.
- In locked-down environments, file-based exports from Db2 into object storage are often the only acceptable integration point.
Network reachability often becomes the primary decision driver, especially in regulated environments.
DB2 platform variant
IBM Db2 exists in several variants, and they differ materially in how data can be extracted and how change data capture works.
- Db2 LUW is commonly used on distributed systems and is the most straightforward for JDBC access and scheduled query-based ingestion.
- Db2 for z/OS and Db2 for i often require additional configuration and, in many cases, specialized replication tooling for CDC.
- Db2 Warehouse may support different performance and access patterns depending on deployment.
Identifying the Db2 variant early helps avoid selecting an approach that is not supported or that introduces unexpected operational constraints.
Freshness and correctness requirements
Not all use cases require the same level of data freshness or correctness.
- Some analytical workloads are satisfied with daily or hourly snapshots.
- Others require near real-time visibility into operational data.
- Some use cases can tolerate missing deletes or late-arriving updates, while others cannot.
Batch and incremental approaches prioritize simplicity, while CDC-based approaches prioritize correctness. The acceptable tradeoff should be explicit before choosing a pipeline.
Operational ownership
Finally, consider how much infrastructure your team is willing to operate.
- JDBC batch jobs and file exports are relatively simple but require scheduling, monitoring, and error handling.
- CDC pipelines often require streaming systems, state management, and ongoing maintenance.
- Managed platforms reduce operational ownership at the cost of delegating parts of the pipeline to an external system.
Once these constraints are clear, it becomes easier to evaluate how Databricks will ingest and apply Db2 data and which integration patterns are appropriate.
Decide your Databricks landing target
Once the constraints are clear, the next step is to decide how Db2 data will be represented and managed inside Databricks. This decision affects how ingestion pipelines are designed, how changes are applied, and how downstream analytics are built.
Delta Lake as the default target
In Databricks, incoming data from external systems such as Db2 is most commonly written into Delta Lake tables. Delta tables provide transactional guarantees, schema enforcement, and support for merge operations, which are essential when data is refreshed incrementally or updated over time.
Using Delta as the landing format allows teams to:
- Append data efficiently for batch or micro-batch ingestion
- Apply updates and deletes using merge semantics
- Evolve schemas without full table rewrites
- Query the same tables from SQL, Spark, and BI tools
For Db2 ingestion, Delta tables typically serve as the initial landing zone, even if data is later transformed into additional layers.
Streaming versus batch ingestion in Databricks
Databricks supports both batch and streaming ingestion patterns.
- Batch ingestion is commonly used for JDBC pulls or scheduled exports. Jobs run on a fixed schedule and write full snapshots or incremental changes into Delta tables.
- Streaming ingestion is used when data arrives continuously, such as from file-based pipelines or CDC streams. Structured Streaming and Databricks Auto Loader are often used to process new data as it becomes available.
The choice between batch and streaming ingestion determines whether the upstream Db2 extraction needs to produce discrete files, continuous event streams, or periodic query results.
Applying changes in the Delta tables
When Db2 data is refreshed incrementally or via CDC, changes must be applied correctly to Delta tables.
- Inserts and updates are typically applied using merge operations keyed on the primary key.
- Deletes require explicit handling so that removed rows in Db2 are reflected in Databricks.
- Ordering and deduplication are important when changes arrive out of sequence or are replayed.
These requirements influence whether a simple append-only pipeline is sufficient or whether a more sophisticated change-application strategy is needed.
Governance and table management
Databricks deployments commonly use Unity Catalog to manage table metadata, permissions, and lineage. Ingestion pipelines must be compatible with this governance model, particularly when writing into managed tables or shared schemas.
Deciding on the Databricks landing target clarifies what the upstream pipeline must produce: full snapshots, incremental change sets, or row-level change events. With that context, the available options for moving data from Db2 into Databricks can be evaluated more concretely.
Practical ways to move data from IBM Db2 to Databricks
There is no single best way to move data from Db2 into Databricks. The right approach depends on connectivity, freshness requirements, and how much operational complexity your team is willing to manage. The patterns below represent the approaches most commonly used in production.
Option 1: JDBC batch ingestion from Databricks
This is the most common starting point.
When this works well
- Databricks compute can reach Db2 over the network
- Batch or scheduled updates are acceptable
- You are performing an initial migration or periodic refresh
How it works
Databricks jobs use the Db2 JDBC driver to read tables or queries directly from Db2 and write the results into Delta Lake tables. Jobs typically run on a schedule and may read full tables or filtered subsets.
To improve performance on large tables, JDBC reads are often partitioned across a numeric or time-based column so multiple tasks can pull data in parallel.
Operational considerations
- JDBC drivers must be available to the cluster
- Large tables can put a sustained load on Db2 during extraction
- Data types such as DECIMAL, TIMESTAMP, and LOB columns may require explicit casting
- Updates and deletes are not automatically handled unless custom merge logic is added
This approach is simple and well understood, but it is fundamentally batch-oriented and does not provide continuous updates.
Option 2: File-based exports with Databricks Auto Loader
This pattern is common when direct database connectivity is restricted.
When this works well
- Db2 cannot be reached directly from Databricks
- Security or network policies require file-based data exchange
- Batch or near-batch freshness is sufficient
How it works
Data is exported or unloaded from Db2 into files such as CSV or Parquet in object storage. Databricks Auto Loader monitors the storage location and incrementally processes new files as they arrive, writing them into Delta tables.
Auto Loader provides scalable file ingestion and avoids repeatedly scanning entire directories, which makes it suitable for large volumes of data.
Operational considerations
- Export jobs must be built and scheduled outside of Databricks
- Deletes and updates require explicit modeling, often via full refreshes or reconciliation jobs
- Schema changes must be coordinated between the export logic and the ingestion
- File arrival ordering can affect downstream merge logic
This approach provides a clean security boundary but shifts responsibility for correctness to the export process.
Option 3: ETL and partner ingestion tools
Many teams rely on managed ETL or ingestion platforms to move data into Databricks.
When this works well
- You want UI-based configuration, scheduling, and monitoring
- Licensing and managed infrastructure are acceptable
- You prefer not to write and operate custom Spark jobs
How it works
ETL tools extract data from Db2, stage it in cloud storage or directly into Delta tables, and manage scheduling, retries, and schema changes. Some tools support incremental loads or CDC-like behavior depending on configuration.
Operational considerations
- Connector capabilities vary by Db2 variant
- Staging layers may add latency and cost
- Schema evolution and deletes may be partially supported
- Tool-specific limits and pricing models apply
This approach trades engineering effort for vendor-managed operations.
Some teams also evaluate managed right-time platforms such as Estuary alongside traditional ETL tools, especially when they want incremental movement into Databricks with lower operational overhead.
Option 4: Change data capture into Databricks
CDC is used when correctness and freshness are critical.
When this works well
- Near real-time replication is required
- Inserts, updates, and deletes must be reflected accurately
- The team can operate a streaming or replication infrastructure
How it works
CDC systems read changes from Db2 logs or change tables and emit row-level events. These events are then applied to Delta tables using merge semantics so that each change is reflected in Databricks.
Depending on the DB2 platform and tooling, CDC may involve a streaming backbone, state management, and downstream compaction or optimization.
Operational considerations
- CDC availability and complexity depend on the Db2 variant
- Ordering, deduplication, and replay handling must be designed carefully
- Streaming infrastructure increases operational overhead
- Long-term maintenance and monitoring are required
CDC provides the highest level of correctness but comes with the highest operational cost.
Option 5: Managed right-time ingestion with Estuary

Some teams want incremental data movement into Databricks without building and operating a full CDC stack.
When this works well
- You want incremental movement into Databricks on a schedule (for example, every few minutes or hourly)
- You want consistent handling of updates without writing and maintaining custom Spark jobs
- You want to reduce pipeline and infrastructure ownership
How it works
Estuary captures data from Db2 using scheduled queries with cursor-based incremental tracking and delivers changes into Databricks tables using its Databricks materialization connector. Data is staged and applied transactionally to Databricks SQL Warehouse tables, supporting standard and delta update patterns.
This model allows teams to choose batch or near-batch intervals while avoiding direct management of Spark jobs, export scripts, or streaming systems.
The Db2 batch capture connector is tested against Db2 LUW and supports full-refresh and cursor-incremental query patterns.
Operational considerations
- Capture frequency is schedule-based rather than log-based
- Deletes are not automatically emitted by cursor-incremental queries and typically require either periodic full-refresh reconciliation or explicit delete modeling (for example, tombstone rows)
- Correctness depends on stable primary keys and reliable incremental tracking columns
- Infrastructure, retries, and state management are handled by the platform
Managed right-time ingestion is often used when teams want predictable operations and lower maintenance while still keeping Databricks tables incrementally up to date.
Readers who want a hands-on walkthrough can refer to the step-by-step guide on loading data into Databricks using Estuary.
Comparison of approaches and when to choose each
The options above solve different problems. Comparing them side by side makes the tradeoffs clearer and helps narrow down the right choice for a given environment.
High-level comparison
| Approach | Deletes handled | Typical latency | Requires direct Db2 connectivity | Operational burden | Best fit |
|---|---|---|---|---|---|
| JDBC batch from Databricks | No | Hours to days | Yes | Medium | Initial migration, periodic refresh |
| File export + Auto Loader | No | Hours | No | Medium | Locked-down networks, file-based exchange |
| ETL / partner tools | Partial | Minutes to hours | Varies | Medium to high | UI-driven ingestion, managed scheduling |
| CDC pipelines | Yes | Seconds to minutes | Yes (or via agents) | High | Continuous replication with full correctness |
| Managed right-time ingestion (Estuary) | Limited (full-refresh or explicit delete modeling) | Minutes | No direct cluster connectivity required | Low | Incremental movement without CDC infrastructure |
This comparison highlights a recurring pattern:
- Batch and file-based approaches minimize infrastructure but sacrifice freshness and delete handling.
- CDC pipelines maximize correctness but require operating and maintaining complex systems.
- Managed right-time platforms aim to deliver incremental data movement with predictable operations and lower maintenance.
Which approach should you choose
The following guidelines reflect how teams typically make this decision in practice.
One-time migration or historical backfill
If the goal is to copy existing Db2 tables into Databricks for analytics, JDBC batch ingestion or file-based exports are usually sufficient. These approaches are simple to implement and work well for large, static datasets.
Recurring refresh with limited correctness requirements
If data needs to be refreshed daily or hourly and deletes are rare or unimportant, incremental JDBC loads or ETL tools can be acceptable. These approaches reduce complexity compared to CDC but still require careful handling of update logic.
Continuous updates with full correctness
If downstream analytics require accurate inserts, updates, and deletes with low latency, CDC pipelines are the most robust option. This is appropriate when teams already operate streaming or replication infrastructure and are prepared to manage its complexity.
Incremental ingestion without operating CDC infrastructure
If the goal is to keep Databricks reasonably up to date without running Kafka, Flink, or custom Spark streaming jobs, managed right-time ingestion platforms such as Estuary are often a practical compromise. They are commonly used when teams want predictable operations, simpler maintenance, and incremental data movement into Databricks tables.
In many real-world architectures, teams combine approaches. A batch backfill may be followed by incremental ingestion, or file-based exports may be used alongside a managed platform. The key is to match the ingestion pattern to the required level of freshness and operational ownership rather than defaulting to the most complex solution.
Practical engineering considerations for Db2 to Databricks pipelines
Regardless of which ingestion approach you choose, several engineering details consistently determine whether a Db2 to Databricks pipeline remains reliable over time. These considerations are often where otherwise sound architectures fail in production.
Choosing a stable row identity
Databricks applies updates and deletes to Delta tables based on a defined row identity. In most cases, this should map directly to the primary key of the Db2 table.
- Ensure every table has a stable, unique key.
- Composite keys should be preserved rather than replaced with derived identifiers.
- If surrogate keys are introduced during ingestion, they must be deterministic and consistent across runs.
Without a reliable row identity, merge-based updates and delete handling become error-prone.
Incremental correctness and late-arriving changes
Incremental ingestion depends on correctly identifying which rows have changed since the last run.
- Timestamp-based watermarks assume reliable clock behavior and monotonic updates.
- Late-arriving updates or backfilled records can be missed if the watermark logic is too strict.
- Cursor-based approaches that track a monotonically increasing value often provide stronger guarantees than timestamps.
Managed platforms and carefully designed incremental jobs should retain state in a durable store so that restarts or retries do not cause data loss or duplication.
Handling deletes explicitly
Deletes are frequently overlooked in analytics pipelines.
- Batch and file-based approaches often require full refreshes or reconciliation jobs to detect deletions.
- CDC and incremental merge-based approaches can apply deletes deterministically if delete events or tombstones are available.
- When deletes are business-critical, the ingestion approach must be designed with delete propagation in mind from the start.
Ignoring deletes early can lead to subtle data quality issues that are difficult to correct later.
Data type mapping and schema evolution
Db2 data types do not always map cleanly to Delta Lake types.
- High-precision DECIMAL values should be reviewed to avoid silent truncation.
- TIMESTAMP and timezone semantics should be normalized consistently.
- LOB columns may need to be excluded, truncated, or handled separately depending on size and query requirements.
Schema evolution should be handled deliberately. Adding columns is generally safe, but type changes and column removals require coordination to avoid breaking downstream queries.
Performance and scalability considerations
Performance issues typically surface as data volumes grow.
- JDBC-based ingestion benefits from partitioned reads to parallelize extraction.
- File-based ingestion should avoid generating excessive small files.
- Merge-heavy workloads in Delta tables benefit from thoughtful partitioning and periodic optimization.
Planning for growth early reduces the need for disruptive re-architecture later.
Conclusion
Moving data from IBM Db2 into Databricks is a practical necessity for modern analytics, but the correct approach depends on connectivity, correctness requirements, and operational ownership. Batch and file-based methods are easy to adopt but limited in freshness and delete handling. CDC pipelines offer strong correctness but introduce significant complexity.
Managed right-time platforms, such as Estuary, provide an alternative that balances incremental data movement with predictable operations, allowing teams to keep Databricks tables up to date without building and maintaining complex ingestion infrastructure. Choosing the right approach requires aligning technical constraints with long-term operational goals.
FAQs
What if Databricks cannot reach Db2 over the network?
Do I need CDC to handle deletes correctly?
Can I combine multiple approaches?

About the author
Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.















