Amazon RDSPostgreSQLdatabricks

22 min read

December 12, 2025

How to Move Amazon RDS for PostgreSQL Data to Databricks: 3 Reliable Methods

Learn three reliable ways to move Amazon RDS for PostgreSQL data into Databricks. Compare methods, understand CDC workflows, and follow a detailed step by step setup guide.

Emily Lucek Technical Content Creator

Move data from Amazon RDS for PostgreSQL to Databricks

Share this article

Summarize this page with AI

Table of Contents

Start Building For Free

Key Takeaways

You move Amazon RDS for PostgreSQL data into Databricks so you can join operational data with analytics, machine learning workflows, and downstream business logic. Databricks provides scalable storage, Delta Lake reliability, and a unified environment for SQL, notebooks, and ML pipelines.
There are three reliable methods to connect Amazon RDS for PostgreSQL to Databricks:
1. Using Estuary, which captures PostgreSQL changes with CDC and delivers them into Databricks tables with minimal setup and predictable sync schedules.
2. Using an intermediate warehouse or storage layer, such as S3, Snowflake, or BigQuery, and then loading that data into Databricks through Auto Loader or COPY INTO.
3. Building a custom CDC pipeline, using PostgreSQL logical replication, WAL decoding, and a custom process that writes change events to cloud storage for Databricks to ingest.
Each method has different tradeoffs in freshness, complexity, cost, and maintenance. This guide walks you through the strengths and limitations of each option, and how to choose the approach that best fits your team's data architecture and operational needs.

Introduction

If you want reliable analytics, real time insights, or ML-ready data, you need your operational PostgreSQL data from Amazon RDS available inside Databricks. RDS for PostgreSQL is often the system of record for orders, customers, transactions, and application events, while Databricks is where teams build their unified lakehouse, run transformations, and power dashboards or machine learning. Connecting the two is a common challenge, especially when you need accurate change data capture, scalable ingestion, and predictable compute costs.

The good news is that there are several dependable ways to move Amazon RDS for PostgreSQL data into Databricks. Even better, you can choose the method that matches your latency needs, your team’s engineering capacity, and how much complexity you want to manage. In this guide, we break down the three most practical methods—a managed CDC pipeline with Estuary, a warehouse or storage based workflow, and a fully custom ingestion pipeline. You will learn how each approach works, the tradeoffs involved, and when to choose one over the others. The goal is to help you move from raw Postgres data to analytics ready Delta tables in the most efficient and dependable way.

Why move Amazon RDS for PostgreSQL data to Databricks

Amazon RDS for PostgreSQL is usually where your most important operational data lives: orders, customers, payments, events from your applications, and internal systems. On its own, that data is great for powering your product and performing point in time checks, but it is not the easiest place to run complex analytics, historical trend analysis, or machine learning workloads. And you definitely don’t want those complex analytics to take resources away from, and impact, your application database.

Databricks, on the other hand, is designed for large scale analytics and data science. It gives you Delta tables for reliable storage, notebooks for experimentation, and powerful compute for everything from simple BI queries to advanced ML models. When you bring RDS PostgreSQL data into Databricks, you unlock a few key capabilities:

Unified analytics across systems: Join transactional data from RDS with data from your warehouse, events, logs, and third party sources inside a single lakehouse.
More accurate reporting and dashboards: Build reliable reports on revenue, churn, product performance, and operations without overloading your production database with heavy queries.
Feature engineering and machine learning: Turn raw RDS tables into feature sets for fraud detection, recommendations, personalization, forecasting, and other ML use cases in Databricks.
Historical and audit friendly storage: Keep long term history, versioned data, and CDC style change logs in Delta without bloating the production database or affecting application performance.

The challenge is to move this RDS PostgreSQL data into Databricks in a way that respects production stability, handles schema changes, and keeps costs predictable. That is where the three methods in this article differ: they each solve the same problem, but with different tradeoffs in complexity, latency, and operational overhead.

Prerequisites

Before choosing a method to move Amazon RDS for PostgreSQL data into Databricks, make sure the following basics are in place.

For Amazon RDS PostgreSQL

Logical replication enabled (wal_level = logical)
A user role with the REPLICATION attribute
A publication for the tables you want to capture
A replication slot (created automatically by some tools)
Network access from your ingestion tool, either through public accessibility or an SSH tunnel

For Estuary (Method 1)

An Estuary account to configure the capture and materialization
→ dashboard.estuary.dev/register

For Databricks

A Databricks workspace with:
- A Unity Catalog
- A SQL Warehouse
- A catalog and schema for tables
A personal access token (PAT) or service principal token that grants access to these resources

With these prerequisites ready, you can choose the method that best fits your pipeline needs.

Methods to Move Amazon RDS for PostgreSQL Data into Databricks

There are three reliable and commonly used ways to sync operational PostgreSQL data from Amazon RDS into Databricks. The right option depends on the level of freshness you need, how much engineering effort you want to invest, and whether you prefer fully managed pipelines or custom control. This section gives you a clear overview of all three approaches before we walk through one of them in detail.

Method 1: Using Estuary (Right-Time Data Movement with Managed CDC)

This is the most direct method. Estuary captures change data from Amazon RDS for PostgreSQL using logical replication and delivers it into Databricks Delta tables on a schedule you control. Because Estuary maintains the CDC state, schema evolution, backfills, and exactly once delivery, you avoid a large amount of operational overhead.

How it works:

Estuary reads changes from RDS using logical replication.
It stores the data in strongly typed collections.
A Databricks materialization writes those collections into Delta tables.
You choose the sync frequency: sub second, near real time, or batch.

This method is ideal if you want dependable CDC without writing code, and if you need predictable Databricks usage.

Pros

A unified path from CDC to Databricks Delta tables with minimal configuration.
Right time data delivery that you can control, from sub second to classic batch.
Predictable Databricks cost thanks to explicit sync scheduling.
Automatic handling of schema evolution, backfills, replication slots, and WAL decoding.
Supports SSH tunneling, read-only replicas, multiple tables, and large schema discovery.

Cons

Some advanced Postgres replication features (like custom publications) require correct configuration on the database side.

We provide the full step by step configuration for this method in the next section.

Method 2: Using an Intermediate Layer (S3, Glue, Snowflake, or Redshift)

Some teams already push Amazon RDS data into an intermediate system. In that case, Databricks becomes the second hop.

Common versions of this pattern:

RDS → S3 → Databricks via Autoloader
RDS → Snowflake → Databricks via table ingestion
RDS → Redshift → Databricks

This approach can work if:

You already centralize your data in another system.
You do not need right-time performance.
You are comfortable maintaining multiple pipelines.

However, if you don’t already load data into an intermediate layer, setting up a robust pipeline from RDS to the intermediate system adds architectural complexity and increases the number of systems that must stay in sync.

Pros

Leverages existing storage or warehouse systems already in your data stack.
Works well if you already centralize RDS data elsewhere before pushing to Databricks.
Easy integration with Databricks Autoloader or COPY INTO.

Cons

More moving parts: storage, ETL tools, scheduled jobs, and Databricks ingestion.
Latency depends on multiple tools and lower latency may result in higher costs.
Schema changes must be managed across systems.

Method 3: Custom CDC or Full Load Pipeline (DIY Approach)

The third option is to build and operate your own ingestion pipeline.
Teams often choose this path when they have very specialized requirements or want absolute control over every part of the workflow.

Common tools used:

AWS DMS
Debezium
Custom logical decoding scripts
Databricks workflows

This method can deliver flexibility, but it comes with responsibilities:

Managing logical replication slots
Handling WAL retention
Dealing with schema evolution
Orchestrating retries, backfills, and monitoring

It offers maximum control but the highest engineering cost.

Pros

Full control over pipeline logic, GAQL queries, WAL decoding, batching, and scheduling.
Can embed domain-specific transformations directly into code.
Works for teams with very specialized requirements or bespoke architectures.

Cons

Highest engineering overhead: building, scaling, monitoring, and maintaining everything.
Complexity of managing Postgres WAL retention, logical replication slots, and retries.
Risk of data drift or duplication without exactly once handling.
Difficult to maintain long term as requirements grow.

Need help moving your data? Our engineers can help you connect your sources and destinations in minutes — no setup headaches, no guesswork.

Step by Step: Move Amazon RDS for PostgreSQL Data to Databricks with Estuary

This walkthrough shows you exactly how to configure a right time pipeline using Estuary. You will create a capture that reads change data from Amazon RDS for PostgreSQL, then create a materialization that writes that data into Databricks Delta tables on your chosen sync schedule.

Step 1: Create the Amazon RDS for PostgreSQL Capture

Start the capture

In the Estuary dashboard, click Sources in the left navigation.
Click New Capture.
Search for Amazon RDS for PostgreSQL.
Select the connector tile and click Capture.

You are now on the capture configuration screen.

Fill capture details

Postgres on Amazon rds data capture details configuration

Name – Enter something clear, such as rds_postgres_prod.
Data Plane – Choose the plane closest to your RDS region, such as aws: us-east-1 c1.

Configure the RDS endpoint

Scroll to Endpoint Config:

Server Address – Paste your RDS endpoint, e.g., mydb.abc123xyz.us-east-1.rds.amazonaws.com:5432.
User – Enter the replication user you created (e.g. flow_capture).
Database – Enter the logical database, usually postgres.

Authentication

Under Authentication:

Auth Type – Choose UserPassword.
Password – Enter the database user password.

Advanced options (optional, but powerful)

These appear exactly as shown in the screenshot:

Amazon rds for postgres data capture detail setup in Estuary

Publication Name – Defaults to flow_publication.
Slot Name – Defaults to flow_slot.
Watermarks Table – Defaults to public.flow_watermarks.
Skip Backfills – Add tables here if you want to avoid initial full loads.
SSL Mode – Set if your RDS instance requires a specific mode.
Discovery Schema Selection – Limit discovery to specific schemas if needed.
Read Only Capture – Enable if capturing from a read only RDS replica.
Network Tunnel –
- Use this if your database is not publicly accessible.
- Enter SSH endpoint and private key as shown in the UI.

Test and publish

Click Next (top right).
Estuary will test connectivity, replication, privileges, WAL decoding, and publication access.
Once tests pass, click Publish.

Your RDS PostgreSQL data is now flowing into Estuary collections in real time.

Step 2: Configure the Databricks Materialization

Start the materialization

Go to Destinations.
Click New Materialization.
Search for Databricks.
Select the connector tile and click Materialization.

Materialization details

Name – Something like postgres_to_databricks.
Data Plane – Use the same plane as your capture.

Enter Databricks endpoint configuration

Amazon rds for Postgres capture data configuration

Under Endpoint Config:

Address – Your Databricks SQL Warehouse host, e.g. dbc-xxxx.cloud.databricks.com.
HTTP Path – Copied from the SQL Warehouse connection info.
Catalog Name – e.g. main.
Schema Name – e.g. default.

(Optional) Enable Hard Delete if you want delete events from RDS to delete corresponding rows in Databricks.

Authentication

Auth Type – Select PAT.
Personal Access Token – Paste your Databricks PAT or service principal token.

Define your sync schedule

Data Sync Frequency Setup in Databricks configuration

These settings determine how often Estuary writes to Databricks:

Sync Frequency – Choose from 0s (real-time), 30s, 5m, 30m, 1h, etc.
Timezone – Optional timezone associated with fast sync settings.
Fast Sync Start / Stop – Optional windows for higher frequency.
Fast Sync Enabled Days – Optional days to use the fast sync schedule.

These controls help you balance freshness and Databricks cost.

Default table behavior

Delta Updates default – Turn on if you want inserts-only behavior by default.
Default Naming Convention – Commonly Mirror Schemas.
Default Field Depth – Keep default unless you need flattened nested structures.

Select collections

Under Link Capture, click Modify and select your RDS capture. This will automatically add all data collections associated with the RDS capture.
- Alternatively, under Collections, click Add and select individual tables you want in Databricks.
For each table:
- Confirm Table Name.
- Adjust Schema if needed.
- Toggle Delta Updates for high-volume tables (with caveats—see the section on TOASTed values below).

Schema evolution options

Under Advanced Options:
Choose how to handle incompatible schema changes:

Abort
Backfill
Disable Binding
Disable Task

Choose based on your governance needs.

Test and publish

Click Next.
Estuary validates Databricks permissions, table creation, and binding readiness.
Click Save & Publish.

Your pipeline is now live.

Step 3: Data Begins Streaming into Databricks

Once published:

Estuary continuously reads new WAL records from Amazon RDS for PostgreSQL.
Changes appear in Estuary collections.
On your schedule, the Databricks materialization merges or delta-appends rows into Delta tables.

You now have a right time pipeline delivering operational PostgreSQL data straight into Databricks, ready for analytics, reporting, or machine learning.

Try a right time PostgreSQL to Databricks pipeline.

Set up a small Amazon RDS PostgreSQL capture and materialize it into Databricks to see how continuous CDC syncs work in practice. Try it now

Modeling RDS PostgreSQL Data in Databricks

Image Source

Once your Amazon RDS for PostgreSQL data is flowing into Databricks, the next step is shaping it into analytics friendly tables. Databricks works best when data is organized in a Lakehouse medallion pattern, which separates raw CDC records from cleaned and curated models. This helps maintain high query performance and keeps transformations predictable as your datasets grow.

Start with Bronze tables

Your Bronze layer should contain the raw CDC events exactly as they arrive from RDS. These tables preserve inserts, updates, and deletes from the logical replication stream. They act as an audit friendly record of all changes and make it easy to recover if upstream schema shifts.

Build Silver tables for cleaned models

From Bronze, create Silver tables that:

Deduplicate rows using primary keys
Reconstruct the latest version of each record
Normalize TOASTed values if your tables contain large fields
Apply type casting and field cleanup
Filter out soft deletes when needed

This layer produces clean, ready to query tables that mirror your PostgreSQL schema.

Create Gold tables for analytics and ML

Gold tables support:

BI models such as revenue, customers, orders, or product data
Operational reporting for growth, churn, supply chain, or finance
Aggregated facts (daily, weekly, monthly)
ML features for fraud detection, recommendations, forecasting, or retention modeling

These curated tables are where Databricks shines, thanks to Delta optimizations and scalable compute.

Performance tips

To keep queries fast and storage efficient:

Partition large tables by date or a natural shard key
Use Delta features like Z ordering or clustering
Keep schema evolution simple by avoiding unnecessary nested structures
Regularly vacuum or optimize tables depending on your retention policies

With clean modeling practices in place, your RDS PostgreSQL data becomes a dependable foundation for analytics, reporting, and machine learning.

WAL Retention, Replication Slot Health, and Performance Tips

When you capture data from Amazon RDS for PostgreSQL using logical replication, the health of your WAL (write-ahead log) and replication slots directly impacts pipeline reliability. Understanding these concepts helps prevent common failures such as slots filling disk space, pipelines stalling, or RDS maintenance events breaking CDC.

Be aware of WAL retention behavior

PostgreSQL cannot delete old WAL segments if any replication slot still depends on them. If the slot’s restart position does not advance—usually because the consumer is paused, disconnected, or failing—you will see WAL files accumulate.
If allowed to grow, this can fill the database storage and cause downtime.

Monitor replication slot positions

Replication slots track two important log sequence numbers (LSNs):

confirmed_flush_lsn
The point up to which the connector has fully processed WAL.
restart_lsn
The point at which PostgreSQL must retain WAL for restart.

If restart_lsn falls too far behind, WAL volume grows and may hit RDS storage limits.

Set max_slot_wal_keep_size

For production RDS environments, configure a safety limit on WAL retention:

A setting such as 50GB works for many workloads.
Too small a value risks slot invalidation.
Too large a value may consume more storage, but it is safer.

When the limit is exceeded, PostgreSQL invalidates the slot, which forces a full re-capture or backfill.

Avoid long-running transactions

A single long transaction in your database can prevent restart_lsn from advancing, even when the connector is processing events quickly. This can cause WAL buildup.
If you see slot lag rising during ETL or batch workloads, this is often the reason.

Ensure replication slot cleanup when disabling pipelines

If you delete or disable a CDC pipeline, always delete the associated replication slot.
You can list slots via:

plaintext
SELECT * FROM pg_replication_slots;

And drop one with:

plaintext
SELECT pg_drop_replication_slot('flow_slot');

Leaving unused slots behind is a common source of WAL bloat.

Use read-only mode carefully

If you are capturing from a read-only replica, ensure at least one captured table receives periodic writes—or create a dedicated heartbeat table.
If no table in the publication changes, the slot cannot advance even if other tables change.

Consider replication to a standby

Capturing from a read replica can offload load from primary databases, but requires:

PostgreSQL 16+ on RDS
hot_standby_feedback = on
Regular heartbeat updates on captured tables

RDS parameter tuning

Alongside logical replication parameters, ensure:

rds.logical_replication = 1
Proper VPC security configuration
Correct parameter group attached and rebooted

TOAST Values and Large Record Handling

PostgreSQL uses a storage mechanism called TOAST (The Oversized Attribute Storage Technique) to handle columns that exceed the standard page size limit, typically 8 KB. These large values are stored separately from the main table row. When logical replication streams change events, PostgreSQL may omit TOASTed values if they did not change during the update. This behavior improves performance but can lead to missing fields in downstream systems if not handled correctly.

Why TOAST matters for CDC pipelines

During CDC replication:

If a row updates but the TOASTed field stays the same, PostgreSQL does not resend that field.
This means the logical replication stream contains only the changed columns, not the full record.
If your downstream pipeline expects full records, missing TOASTed values can lead to partial updates or schema inconsistencies.

How Estuary handles TOASTed values

Estuary’s PostgreSQL connector automatically resolves TOAST behavior by:

Using merge reductions to fill in previously known TOASTed values
Ensuring downstream collections always have complete records
Avoiding data loss when large fields are unchanged but other columns are updated

This prevents issues when materializing tables in Databricks or building derivations inside Estuary.

When TOAST can still appear as a problem

Challenges may occur if:

You materialize data into a destination using delta updates instead of merges
You perform derivations that rely on TOASTed fields

In those cases, you may see null or missing values if not modeled carefully.

Recommended best practices

Ensure the collection schema uses merge reductions (Estuary sets this up automatically).
Avoid delta updates when modeling pipelines that depend on large or nested fields.
For critical tables, set REPLICA IDENTITY FULL to force full rows to be logged:

ALTER TABLE your_table REPLICA IDENTITY FULL;

This guarantees complete data in the WAL but may increase storage overhead.
Test your large-field tables to confirm they appear correctly in Databricks after materialization.

With the right configurations, TOASTed fields flow reliably from RDS PostgreSQL into Databricks, even for large JSON blobs, documents, descriptions, or unstructured data.

Read-only Captures and Replica-Based Ingestion

Some teams prefer to capture change data from a read replica instead of the primary Amazon RDS for PostgreSQL instance. This approach reduces load on the production database and isolates CDC activity, but it requires specific conditions for logical replication to work correctly.

When read-only capture is useful

A read-only capture makes sense when:

Your production database is sensitive to additional connections
You want to offload CDC workload to a replica
You operate in a high-traffic environment where minimizing primary overhead is important

In these cases, capturing from a replica can be the safer operational choice.

Requirements for capturing from a read replica

PostgreSQL 16 or later is required for logical decoding on a standby. Earlier versions do not allow it, and attempting to decode on a replica results in the error:

ERROR: logical decoding cannot be used while in recovery

If your replica is on PostgreSQL 16+, ensure that:

hot_standby_feedback = on is enabled
The correct parameter group is attached and applied
The replica is restarted after enabling the parameter

This setting prevents the primary database from vacuuming rows needed by the replica’s logical decoding process.

Read-only capture mode

Estuary supports a dedicated Read-Only Capture mode for situations where the connector cannot write temporary data (such as watermarks) back to the source database. When enabled:

The capture operates without writing tables or metadata
Backfill accuracy requires that at least one captured table receives regular updates
If no table changes for a long period, the logical replication slot cannot advance

To avoid slot stalling, teams often include a small “heartbeat” table that updates periodically.

Operational considerations

While read-only capture is powerful, keep these points in mind:

A replica must have low replication lag to ensure timely CDC delivery
Long-running transactions on the primary can still delay slot progress
WAL retention must still be monitored on the replica

When to use this approach

Pick a read-only capture when:

Your production RDS instance cannot absorb additional CDC load
You run PostgreSQL 16+
You can guarantee regular updates to at least one captured table
You want isolation between OLTP operations and CDC

Read-only and replica-based ingestion can significantly improve stability in production environments, as long as the correct settings are in place.

Troubleshooting and Common Pitfalls

CDC pipelines from Amazon RDS for PostgreSQL to Databricks are powerful, but Postgres’s logical replication has strict requirements. Most failures come from a few common issues. Here are the problems you are most likely to encounter and how to resolve them quickly.

Logical replication not enabled

If you see connection or publication errors during capture setup, your RDS instance may not have:

wal_level = logical
rds.logical_replication = 1 (RDS parameter)
The correct parameter group attached
A reboot performed after changes

Fix by verifying the parameter group settings and applying a reboot.

Missing permissions for the replication user

The user must have:

REPLICATION attribute
SELECT on all captured tables
The ability to read from the publication
Access to the watermarks table if not in read-only mode

Insufficient permissions often show up as publication or slot errors.

Replication slot stuck or lagging

A replication slot can stall when:

The connector is paused or unhealthy
A long-running transaction holds back restart_lsn
A table in the publication receives no new changes (read-only issue)

Monitor slot lag and ensure long transactions are minimized. If a slot is stuck, consider manually dropping it and letting the connector recreate it.

WAL retention growing and filling disk

This happens when:

Restart LSN does not advance
The slot is inactive
max_slot_wal_keep_size is not set
Or set too low and slots become invalidated

For production, configure a reasonable limit (many teams use ~50GB) and ensure slot movement is monitored.

Publication does not include all required tables

If a table is missing from your capture:

Confirm it is added to the publication
Or allow Estuary to manage publication creation
Avoid using FOR ALL TABLES unless absolutely necessary

Changes to partitioned tables may also require publish_via_partition_root = true.

Network access blocked

Common issues:

RDS instance is private and connector cannot reach it
SSH tunnel not configured correctly
VPC security group lacks inbound rules for Estuary or your pipeline tool

Check connectivity through security groups, bastion host configuration, and firewall rules.

SSL mode mismatch

If RDS enforces SSL, set the correct SSL mode (verify-full, require, etc.).
SSL misconfiguration frequently appears as handshake or TLS errors.

TOASTed values missing or incomplete

If large fields appear missing:

Confirm your connector or pipeline handles TOAST correctly
Ensure table schemas use merge reductions in Estuary
Consider REPLICA IDENTITY FULL for specific tables if needed

Get help designing your pipeline.

If you want guidance on CDC setup, replication slots, or sync scheduling for Databricks, the Estuary team can help. Contact us

Conclusion and Next Steps

Moving Amazon RDS for PostgreSQL data into Databricks is one of the highest-impact steps you can take to strengthen your analytics, reporting, and machine learning workflows. You now have a clear understanding of the three most practical ways to do it: a managed CDC pipeline with Estuary, a multi hop workflow using an intermediate system, or a fully custom ingestion pipeline. Each option comes with different tradeoffs around latency, control, and maintenance, and the best choice depends on how your team balances freshness, complexity, and long term reliability.

If you want to see how a dependable RDS PostgreSQL to Databricks sync performs, the simplest next step is to try a small capture and materialize a few tables into a Databricks schema. It only takes a few minutes to validate your setup and compare it to whatever process you use today. If the pipeline gives you fresher insights with less operational overhead, you will know you are moving in the right direction.

FAQs

What is the simplest way to move Amazon RDS PostgreSQL data into Databricks?

The easiest method is using a managed CDC connector that handles replication slots, backfills, and schema updates for you. This removes the need to operate your own logical decoding or custom ingestion code.

Can I sync multiple PostgreSQL databases from RDS into one Databricks environment?

Yes. You can create separate pipelines for each database and materialize all of them into the same Databricks catalog or schema. This is common for multi-account or multi-environment setups.

How often should RDS PostgreSQL data be refreshed inside Databricks?

Most teams choose a sync interval between 5 minutes and 1 hour. The ideal schedule depends on your reporting frequency, Databricks SQL Warehouse cost, and the volume of changes in your database.

Does Estuary handle CDC from Amazon RDS PostgreSQL automatically?

Yes. Estuary manages logical replication, replication slots, backfills, schema discovery, and ongoing CDC without custom code. It also lets you choose the sync frequency to Databricks so you can balance freshness and cost.

Do I need to replicate every table from my RDS database?

No. You can include only the tables you care about in your publication or ingestion tool. This keeps WAL volume smaller, speeds up initial backfills, and reduces Databricks storage requirements.

About the author

Emily LucekTechnical Content Creator

Emily is a software engineer and technical content creator with an interest in developer education. She has experience across Developer Relations roles from her FinTech background and is always learning something new.

How to Move Amazon RDS for PostgreSQL Data to Databricks: 3 Reliable Methods

Key Takeaways

Introduction

Why move Amazon RDS for PostgreSQL data to Databricks

Prerequisites

For Amazon RDS PostgreSQL

For Estuary (Method 1)

For Databricks

Methods to Move Amazon RDS for PostgreSQL Data into Databricks

Method 1: Using Estuary (Right-Time Data Movement with Managed CDC)

Method 2: Using an Intermediate Layer (S3, Glue, Snowflake, or Redshift)

Method 3: Custom CDC or Full Load Pipeline (DIY Approach)

Step by Step: Move Amazon RDS for PostgreSQL Data to Databricks with Estuary

Step 1: Create the Amazon RDS for PostgreSQL Capture

Start the capture

Fill capture details

Configure the RDS endpoint

Authentication

Advanced options (optional, but powerful)

Test and publish

Step 2: Configure the Databricks Materialization

Start the materialization

Materialization details

Enter Databricks endpoint configuration

Authentication

Define your sync schedule

Default table behavior

Select collections

Schema evolution options

Test and publish

Step 3: Data Begins Streaming into Databricks

Modeling RDS PostgreSQL Data in Databricks

Start with Bronze tables

Build Silver tables for cleaned models

Create Gold tables for analytics and ML

Performance tips

WAL Retention, Replication Slot Health, and Performance Tips

Be aware of WAL retention behavior

Monitor replication slot positions

Set max_slot_wal_keep_size

Avoid long-running transactions

Ensure replication slot cleanup when disabling pipelines

Use read-only mode carefully

Consider replication to a standby

RDS parameter tuning

TOAST Values and Large Record Handling

Why TOAST matters for CDC pipelines

How Estuary handles TOASTed values

When TOAST can still appear as a problem

Recommended best practices

Read-only Captures and Replica-Based Ingestion

When read-only capture is useful

Requirements for capturing from a read replica

Read-only capture mode

Operational considerations

When to use this approach

Troubleshooting and Common Pitfalls

Logical replication not enabled

Missing permissions for the replication user

Replication slot stuck or lagging

WAL retention growing and filling disk

Publication does not include all required tables

Network access blocked

SSL mode mismatch

TOASTed values missing or incomplete

Conclusion and Next Steps

FAQs

What is the simplest way to move Amazon RDS PostgreSQL data into Databricks?

Can I sync multiple PostgreSQL databases from RDS into one Databricks environment?

How often should RDS PostgreSQL data be refreshed inside Databricks?

Does Estuary handle CDC from Amazon RDS PostgreSQL automatically?

Do I need to replicate every table from my RDS database?

Start streaming your data for free

About the author

Related Articles

Real-time Fraud Detection with Databricks

Load Data From Amazon RDS to Snowflake (Step-by-Step Guide)

How to Load Data into Databricks in Real Time with Estuary

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard