
Key Takeaways
- You move Amazon RDS for PostgreSQL data into Databricks so you can join operational data with analytics, machine learning workflows, and downstream business logic. Databricks provides scalable storage, Delta Lake reliability, and a unified environment for SQL, notebooks, and ML pipelines.
- There are three reliable methods to connect Amazon RDS for PostgreSQL to Databricks:
- Using Estuary, which captures PostgreSQL changes with CDC and delivers them into Databricks tables with minimal setup and predictable sync schedules.
- Using an intermediate warehouse or storage layer, such as S3, Snowflake, or BigQuery, and then loading that data into Databricks through Auto Loader or COPY INTO.
- Building a custom CDC pipeline, using PostgreSQL logical replication, WAL decoding, and a custom process that writes change events to cloud storage for Databricks to ingest.
- Each method has different tradeoffs in freshness, complexity, cost, and maintenance. This guide walks you through the strengths and limitations of each option, and how to choose the approach that best fits your team's data architecture and operational needs.
Introduction
If you want reliable analytics, real time insights, or ML-ready data, you need your operational PostgreSQL data from Amazon RDS available inside Databricks. RDS for PostgreSQL is often the system of record for orders, customers, transactions, and application events, while Databricks is where teams build their unified lakehouse, run transformations, and power dashboards or machine learning. Connecting the two is a common challenge, especially when you need accurate change data capture, scalable ingestion, and predictable compute costs.
The good news is that there are several dependable ways to move Amazon RDS for PostgreSQL data into Databricks. Even better, you can choose the method that matches your latency needs, your team’s engineering capacity, and how much complexity you want to manage. In this guide, we break down the three most practical methods—a managed CDC pipeline with Estuary, a warehouse or storage based workflow, and a fully custom ingestion pipeline. You will learn how each approach works, the tradeoffs involved, and when to choose one over the others. The goal is to help you move from raw Postgres data to analytics ready Delta tables in the most efficient and dependable way.
Why move Amazon RDS for PostgreSQL data to Databricks
Amazon RDS for PostgreSQL is usually where your most important operational data lives: orders, customers, payments, events from your applications, and internal systems. On its own, that data is great for powering your product and performing point in time checks, but it is not the easiest place to run complex analytics, historical trend analysis, or machine learning workloads. And you definitely don’t want those complex analytics to take resources away from, and impact, your application database.
Databricks, on the other hand, is designed for large scale analytics and data science. It gives you Delta tables for reliable storage, notebooks for experimentation, and powerful compute for everything from simple BI queries to advanced ML models. When you bring RDS PostgreSQL data into Databricks, you unlock a few key capabilities:
- Unified analytics across systems: Join transactional data from RDS with data from your warehouse, events, logs, and third party sources inside a single lakehouse.
- More accurate reporting and dashboards: Build reliable reports on revenue, churn, product performance, and operations without overloading your production database with heavy queries.
- Feature engineering and machine learning: Turn raw RDS tables into feature sets for fraud detection, recommendations, personalization, forecasting, and other ML use cases in Databricks.
- Historical and audit friendly storage: Keep long term history, versioned data, and CDC style change logs in Delta without bloating the production database or affecting application performance.
The challenge is to move this RDS PostgreSQL data into Databricks in a way that respects production stability, handles schema changes, and keeps costs predictable. That is where the three methods in this article differ: they each solve the same problem, but with different tradeoffs in complexity, latency, and operational overhead.
Prerequisites
Before choosing a method to move Amazon RDS for PostgreSQL data into Databricks, make sure the following basics are in place.
For Amazon RDS PostgreSQL
- Logical replication enabled (wal_level = logical)
- A user role with the REPLICATION attribute
- A publication for the tables you want to capture
- A replication slot (created automatically by some tools)
- Network access from your ingestion tool, either through public accessibility or an SSH tunnel
For Estuary (Method 1)
- An Estuary account to configure the capture and materialization
→ dashboard.estuary.dev/register
For Databricks
- A Databricks workspace with:
- A Unity Catalog
- A SQL Warehouse
- A catalog and schema for tables
- A personal access token (PAT) or service principal token that grants access to these resources
With these prerequisites ready, you can choose the method that best fits your pipeline needs.
Methods to Move Amazon RDS for PostgreSQL Data into Databricks
There are three reliable and commonly used ways to sync operational PostgreSQL data from Amazon RDS into Databricks. The right option depends on the level of freshness you need, how much engineering effort you want to invest, and whether you prefer fully managed pipelines or custom control. This section gives you a clear overview of all three approaches before we walk through one of them in detail.
Method 1: Using Estuary (Right-Time Data Movement with Managed CDC)
This is the most direct method. Estuary captures change data from Amazon RDS for PostgreSQL using logical replication and delivers it into Databricks Delta tables on a schedule you control. Because Estuary maintains the CDC state, schema evolution, backfills, and exactly once delivery, you avoid a large amount of operational overhead.
How it works:
- Estuary reads changes from RDS using logical replication.
- It stores the data in strongly typed collections.
- A Databricks materialization writes those collections into Delta tables.
- You choose the sync frequency: sub second, near real time, or batch.
This method is ideal if you want dependable CDC without writing code, and if you need predictable Databricks usage.
Pros
- A unified path from CDC to Databricks Delta tables with minimal configuration.
- Right time data delivery that you can control, from sub second to classic batch.
- Predictable Databricks cost thanks to explicit sync scheduling.
- Automatic handling of schema evolution, backfills, replication slots, and WAL decoding.
- Supports SSH tunneling, read-only replicas, multiple tables, and large schema discovery.
Cons
- Some advanced Postgres replication features (like custom publications) require correct configuration on the database side.
We provide the full step by step configuration for this method in the next section.
Method 2: Using an Intermediate Layer (S3, Glue, Snowflake, or Redshift)
Some teams already push Amazon RDS data into an intermediate system. In that case, Databricks becomes the second hop.
Common versions of this pattern:
- RDS → S3 → Databricks via Autoloader
- RDS → Snowflake → Databricks via table ingestion
- RDS → Redshift → Databricks
This approach can work if:
- You already centralize your data in another system.
- You do not need right-time performance.
- You are comfortable maintaining multiple pipelines.
However, if you don’t already load data into an intermediate layer, setting up a robust pipeline from RDS to the intermediate system adds architectural complexity and increases the number of systems that must stay in sync.
Pros
- Leverages existing storage or warehouse systems already in your data stack.
- Works well if you already centralize RDS data elsewhere before pushing to Databricks.
- Easy integration with Databricks Autoloader or COPY INTO.
Cons
- More moving parts: storage, ETL tools, scheduled jobs, and Databricks ingestion.
- Latency depends on multiple tools and lower latency may result in higher costs.
- Schema changes must be managed across systems.
Method 3: Custom CDC or Full Load Pipeline (DIY Approach)
The third option is to build and operate your own ingestion pipeline.
Teams often choose this path when they have very specialized requirements or want absolute control over every part of the workflow.
Common tools used:
This method can deliver flexibility, but it comes with responsibilities:
- Managing logical replication slots
- Handling WAL retention
- Dealing with schema evolution
- Orchestrating retries, backfills, and monitoring
It offers maximum control but the highest engineering cost.
Pros
- Full control over pipeline logic, GAQL queries, WAL decoding, batching, and scheduling.
- Can embed domain-specific transformations directly into code.
- Works for teams with very specialized requirements or bespoke architectures.
Cons
- Highest engineering overhead: building, scaling, monitoring, and maintaining everything.
- Complexity of managing Postgres WAL retention, logical replication slots, and retries.
- Risk of data drift or duplication without exactly once handling.
- Difficult to maintain long term as requirements grow.
Step by Step: Move Amazon RDS for PostgreSQL Data to Databricks with Estuary
This walkthrough shows you exactly how to configure a right time pipeline using Estuary. You will create a capture that reads change data from Amazon RDS for PostgreSQL, then create a materialization that writes that data into Databricks Delta tables on your chosen sync schedule.
Step 1: Create the Amazon RDS for PostgreSQL Capture
Start the capture
- In the Estuary dashboard, click Sources in the left navigation.
- Click New Capture.
- Search for Amazon RDS for PostgreSQL.
- Select the connector tile and click Capture.
You are now on the capture configuration screen.
Fill capture details
- Name – Enter something clear, such as rds_postgres_prod.
- Data Plane – Choose the plane closest to your RDS region, such as aws: us-east-1 c1.
Configure the RDS endpoint
Scroll to Endpoint Config:
- Server Address – Paste your RDS endpoint, e.g., mydb.abc123xyz.us-east-1.rds.amazonaws.com:5432.
- User – Enter the replication user you created (e.g. flow_capture).
- Database – Enter the logical database, usually postgres.
Authentication
Under Authentication:
- Auth Type – Choose UserPassword.
- Password – Enter the database user password.
Advanced options (optional, but powerful)
These appear exactly as shown in the screenshot:
- Publication Name – Defaults to flow_publication.
- Slot Name – Defaults to flow_slot.
- Watermarks Table – Defaults to public.flow_watermarks.
- Skip Backfills – Add tables here if you want to avoid initial full loads.
- SSL Mode – Set if your RDS instance requires a specific mode.
- Discovery Schema Selection – Limit discovery to specific schemas if needed.
- Read Only Capture – Enable if capturing from a read only RDS replica.
- Network Tunnel –
- Use this if your database is not publicly accessible.
- Enter SSH endpoint and private key as shown in the UI.
Test and publish
- Click Next (top right).
- Estuary will test connectivity, replication, privileges, WAL decoding, and publication access.
- Once tests pass, click Publish.
Your RDS PostgreSQL data is now flowing into Estuary collections in real time.
Step 2: Configure the Databricks Materialization
Start the materialization
- Go to Destinations.
- Click New Materialization.
- Search for Databricks.
- Select the connector tile and click Materialization.
Materialization details
- Name – Something like postgres_to_databricks.
- Data Plane – Use the same plane as your capture.
Enter Databricks endpoint configuration
Under Endpoint Config:
- Address – Your Databricks SQL Warehouse host, e.g. dbc-xxxx.cloud.databricks.com.
- HTTP Path – Copied from the SQL Warehouse connection info.
- Catalog Name – e.g. main.
- Schema Name – e.g. default.
(Optional) Enable Hard Delete if you want delete events from RDS to delete corresponding rows in Databricks.
Authentication
- Auth Type – Select PAT.
- Personal Access Token – Paste your Databricks PAT or service principal token.
Define your sync schedule
These settings determine how often Estuary writes to Databricks:
- Sync Frequency – Choose from 0s (real-time), 30s, 5m, 30m, 1h, etc.
- Timezone – Optional timezone associated with fast sync settings.
- Fast Sync Start / Stop – Optional windows for higher frequency.
- Fast Sync Enabled Days – Optional days to use the fast sync schedule.
These controls help you balance freshness and Databricks cost.
Default table behavior
- Delta Updates default – Turn on if you want inserts-only behavior by default.
- Default Naming Convention – Commonly Mirror Schemas.
- Default Field Depth – Keep default unless you need flattened nested structures.
Select collections
- Under Link Capture, click Modify and select your RDS capture. This will automatically add all data collections associated with the RDS capture.
- Alternatively, under Collections, click Add and select individual tables you want in Databricks.
- For each table:
- Confirm Table Name.
- Adjust Schema if needed.
- Toggle Delta Updates for high-volume tables (with caveats—see the section on TOASTed values below).
Schema evolution options
Under Advanced Options:
Choose how to handle incompatible schema changes:
- Abort
- Backfill
- Disable Binding
- Disable Task
Choose based on your governance needs.
Test and publish
- Click Next.
- Estuary validates Databricks permissions, table creation, and binding readiness.
- Click Save & Publish.
Your pipeline is now live.
Step 3: Data Begins Streaming into Databricks
Once published:
- Estuary continuously reads new WAL records from Amazon RDS for PostgreSQL.
- Changes appear in Estuary collections.
- On your schedule, the Databricks materialization merges or delta-appends rows into Delta tables.
You now have a right time pipeline delivering operational PostgreSQL data straight into Databricks, ready for analytics, reporting, or machine learning.
Try a right time PostgreSQL to Databricks pipeline.
Set up a small Amazon RDS PostgreSQL capture and materialize it into Databricks to see how continuous CDC syncs work in practice. Try it now
Modeling RDS PostgreSQL Data in Databricks
Once your Amazon RDS for PostgreSQL data is flowing into Databricks, the next step is shaping it into analytics friendly tables. Databricks works best when data is organized in a Lakehouse medallion pattern, which separates raw CDC records from cleaned and curated models. This helps maintain high query performance and keeps transformations predictable as your datasets grow.
Start with Bronze tables
Your Bronze layer should contain the raw CDC events exactly as they arrive from RDS. These tables preserve inserts, updates, and deletes from the logical replication stream. They act as an audit friendly record of all changes and make it easy to recover if upstream schema shifts.
Build Silver tables for cleaned models
From Bronze, create Silver tables that:
- Deduplicate rows using primary keys
- Reconstruct the latest version of each record
- Normalize TOASTed values if your tables contain large fields
- Apply type casting and field cleanup
- Filter out soft deletes when needed
This layer produces clean, ready to query tables that mirror your PostgreSQL schema.
Create Gold tables for analytics and ML
Gold tables support:
- BI models such as revenue, customers, orders, or product data
- Operational reporting for growth, churn, supply chain, or finance
- Aggregated facts (daily, weekly, monthly)
- ML features for fraud detection, recommendations, forecasting, or retention modeling
These curated tables are where Databricks shines, thanks to Delta optimizations and scalable compute.
Performance tips
To keep queries fast and storage efficient:
- Partition large tables by date or a natural shard key
- Use Delta features like Z ordering or clustering
- Keep schema evolution simple by avoiding unnecessary nested structures
- Regularly vacuum or optimize tables depending on your retention policies
With clean modeling practices in place, your RDS PostgreSQL data becomes a dependable foundation for analytics, reporting, and machine learning.
WAL Retention, Replication Slot Health, and Performance Tips
When you capture data from Amazon RDS for PostgreSQL using logical replication, the health of your WAL (write-ahead log) and replication slots directly impacts pipeline reliability. Understanding these concepts helps prevent common failures such as slots filling disk space, pipelines stalling, or RDS maintenance events breaking CDC.
Be aware of WAL retention behavior
PostgreSQL cannot delete old WAL segments if any replication slot still depends on them. If the slot’s restart position does not advance—usually because the consumer is paused, disconnected, or failing—you will see WAL files accumulate.
If allowed to grow, this can fill the database storage and cause downtime.
Monitor replication slot positions
Replication slots track two important log sequence numbers (LSNs):
- confirmed_flush_lsn
The point up to which the connector has fully processed WAL. - restart_lsn
The point at which PostgreSQL must retain WAL for restart.
If restart_lsn falls too far behind, WAL volume grows and may hit RDS storage limits.
Set max_slot_wal_keep_size
For production RDS environments, configure a safety limit on WAL retention:
- A setting such as 50GB works for many workloads.
- Too small a value risks slot invalidation.
- Too large a value may consume more storage, but it is safer.
When the limit is exceeded, PostgreSQL invalidates the slot, which forces a full re-capture or backfill.
Avoid long-running transactions
A single long transaction in your database can prevent restart_lsn from advancing, even when the connector is processing events quickly. This can cause WAL buildup.
If you see slot lag rising during ETL or batch workloads, this is often the reason.
Ensure replication slot cleanup when disabling pipelines
If you delete or disable a CDC pipeline, always delete the associated replication slot.
You can list slots via:
plaintextSELECT * FROM pg_replication_slots;And drop one with:
plaintextSELECT pg_drop_replication_slot('flow_slot');Leaving unused slots behind is a common source of WAL bloat.
Use read-only mode carefully
If you are capturing from a read-only replica, ensure at least one captured table receives periodic writes—or create a dedicated heartbeat table.
If no table in the publication changes, the slot cannot advance even if other tables change.
Consider replication to a standby
Capturing from a read replica can offload load from primary databases, but requires:
- PostgreSQL 16+ on RDS
- hot_standby_feedback = on
- Regular heartbeat updates on captured tables
RDS parameter tuning
Alongside logical replication parameters, ensure:
- rds.logical_replication = 1
- Proper VPC security configuration
- Correct parameter group attached and rebooted
TOAST Values and Large Record Handling
PostgreSQL uses a storage mechanism called TOAST (The Oversized Attribute Storage Technique) to handle columns that exceed the standard page size limit, typically 8 KB. These large values are stored separately from the main table row. When logical replication streams change events, PostgreSQL may omit TOASTed values if they did not change during the update. This behavior improves performance but can lead to missing fields in downstream systems if not handled correctly.
Why TOAST matters for CDC pipelines
During CDC replication:
- If a row updates but the TOASTed field stays the same, PostgreSQL does not resend that field.
- This means the logical replication stream contains only the changed columns, not the full record.
- If your downstream pipeline expects full records, missing TOASTed values can lead to partial updates or schema inconsistencies.
How Estuary handles TOASTed values
Estuary’s PostgreSQL connector automatically resolves TOAST behavior by:
- Using merge reductions to fill in previously known TOASTed values
- Ensuring downstream collections always have complete records
- Avoiding data loss when large fields are unchanged but other columns are updated
This prevents issues when materializing tables in Databricks or building derivations inside Estuary.
When TOAST can still appear as a problem
Challenges may occur if:
- You materialize data into a destination using delta updates instead of merges
- You perform derivations that rely on TOASTed fields
In those cases, you may see null or missing values if not modeled carefully.
Recommended best practices
- Ensure the collection schema uses merge reductions (Estuary sets this up automatically).
- Avoid delta updates when modeling pipelines that depend on large or nested fields.
- For critical tables, set REPLICA IDENTITY FULL to force full rows to be logged:
ALTER TABLE your_table REPLICA IDENTITY FULL;
- This guarantees complete data in the WAL but may increase storage overhead.
- Test your large-field tables to confirm they appear correctly in Databricks after materialization.
With the right configurations, TOASTed fields flow reliably from RDS PostgreSQL into Databricks, even for large JSON blobs, documents, descriptions, or unstructured data.
Read-only Captures and Replica-Based Ingestion
Some teams prefer to capture change data from a read replica instead of the primary Amazon RDS for PostgreSQL instance. This approach reduces load on the production database and isolates CDC activity, but it requires specific conditions for logical replication to work correctly.
When read-only capture is useful
A read-only capture makes sense when:
- Your production database is sensitive to additional connections
- You want to offload CDC workload to a replica
- You operate in a high-traffic environment where minimizing primary overhead is important
In these cases, capturing from a replica can be the safer operational choice.
Requirements for capturing from a read replica
PostgreSQL 16 or later is required for logical decoding on a standby. Earlier versions do not allow it, and attempting to decode on a replica results in the error:
ERROR: logical decoding cannot be used while in recovery
If your replica is on PostgreSQL 16+, ensure that:
- hot_standby_feedback = on is enabled
- The correct parameter group is attached and applied
- The replica is restarted after enabling the parameter
This setting prevents the primary database from vacuuming rows needed by the replica’s logical decoding process.
Read-only capture mode
Estuary supports a dedicated Read-Only Capture mode for situations where the connector cannot write temporary data (such as watermarks) back to the source database. When enabled:
- The capture operates without writing tables or metadata
- Backfill accuracy requires that at least one captured table receives regular updates
- If no table changes for a long period, the logical replication slot cannot advance
To avoid slot stalling, teams often include a small “heartbeat” table that updates periodically.
Operational considerations
While read-only capture is powerful, keep these points in mind:
- A replica must have low replication lag to ensure timely CDC delivery
- Long-running transactions on the primary can still delay slot progress
- WAL retention must still be monitored on the replica
When to use this approach
Pick a read-only capture when:
- Your production RDS instance cannot absorb additional CDC load
- You run PostgreSQL 16+
- You can guarantee regular updates to at least one captured table
- You want isolation between OLTP operations and CDC
Read-only and replica-based ingestion can significantly improve stability in production environments, as long as the correct settings are in place.
Troubleshooting and Common Pitfalls
CDC pipelines from Amazon RDS for PostgreSQL to Databricks are powerful, but Postgres’s logical replication has strict requirements. Most failures come from a few common issues. Here are the problems you are most likely to encounter and how to resolve them quickly.
Logical replication not enabled
If you see connection or publication errors during capture setup, your RDS instance may not have:
- wal_level = logical
- rds.logical_replication = 1 (RDS parameter)
- The correct parameter group attached
- A reboot performed after changes
Fix by verifying the parameter group settings and applying a reboot.
Missing permissions for the replication user
The user must have:
- REPLICATION attribute
- SELECT on all captured tables
- The ability to read from the publication
- Access to the watermarks table if not in read-only mode
Insufficient permissions often show up as publication or slot errors.
Replication slot stuck or lagging
A replication slot can stall when:
- The connector is paused or unhealthy
- A long-running transaction holds back restart_lsn
- A table in the publication receives no new changes (read-only issue)
Monitor slot lag and ensure long transactions are minimized. If a slot is stuck, consider manually dropping it and letting the connector recreate it.
WAL retention growing and filling disk
This happens when:
- Restart LSN does not advance
- The slot is inactive
- max_slot_wal_keep_size is not set
- Or set too low and slots become invalidated
For production, configure a reasonable limit (many teams use ~50GB) and ensure slot movement is monitored.
Publication does not include all required tables
If a table is missing from your capture:
- Confirm it is added to the publication
- Or allow Estuary to manage publication creation
- Avoid using FOR ALL TABLES unless absolutely necessary
Changes to partitioned tables may also require publish_via_partition_root = true.
Network access blocked
Common issues:
- RDS instance is private and connector cannot reach it
- SSH tunnel not configured correctly
- VPC security group lacks inbound rules for Estuary or your pipeline tool
Check connectivity through security groups, bastion host configuration, and firewall rules.
SSL mode mismatch
If RDS enforces SSL, set the correct SSL mode (verify-full, require, etc.).
SSL misconfiguration frequently appears as handshake or TLS errors.
TOASTed values missing or incomplete
If large fields appear missing:
- Confirm your connector or pipeline handles TOAST correctly
- Ensure table schemas use merge reductions in Estuary
- Consider REPLICA IDENTITY FULL for specific tables if needed
Get help designing your pipeline.
If you want guidance on CDC setup, replication slots, or sync scheduling for Databricks, the Estuary team can help. Contact us
Conclusion and Next Steps
Moving Amazon RDS for PostgreSQL data into Databricks is one of the highest-impact steps you can take to strengthen your analytics, reporting, and machine learning workflows. You now have a clear understanding of the three most practical ways to do it: a managed CDC pipeline with Estuary, a multi hop workflow using an intermediate system, or a fully custom ingestion pipeline. Each option comes with different tradeoffs around latency, control, and maintenance, and the best choice depends on how your team balances freshness, complexity, and long term reliability.
If you want to see how a dependable RDS PostgreSQL to Databricks sync performs, the simplest next step is to try a small capture and materialize a few tables into a Databricks schema. It only takes a few minutes to validate your setup and compare it to whatever process you use today. If the pipeline gives you fresher insights with less operational overhead, you will know you are moving in the right direction.
FAQs
Can I sync multiple PostgreSQL databases from RDS into one Databricks environment?
How often should RDS PostgreSQL data be refreshed inside Databricks?
Does Estuary handle CDC from Amazon RDS PostgreSQL automatically?
Do I need to replicate every table from my RDS database?

About the author
Emily is a software engineer and technical content creator with an interest in developer education. She has experience across Developer Relations roles from her FinTech background and is always learning something new.























