Moving data from Postgres to Snowflake is a crucial step for organizations looking to leverage Snowflake's scalable, cloud-native analytics platform. Whether you're building a modern data pipeline for advanced analytics, operational insights, or machine learning, choosing the right migration approach is critical.
This guide explores two primary methods for transferring data from PostgreSQL to Snowflake:
- Batch processing: Transferring data in scheduled intervals, ideal for less time-sensitive use cases.
- Real-time streaming: Continuously syncing data as changes occur, perfect for up-to-the-minute analytics.
We’ll compare these methods, discuss the tools available (like Apache Airflow and Estuary Flow), and help you decide the best approach for your use case. Whether you prioritize cost-efficiency, ease of use, or real-time data integration, this guide has you covered.
Choosing Between Real-Time Streaming and Batch Processing for PostgreSQL to Snowflake
One of the first and most important decisions when moving data from PostgreSQL to Snowflake is choosing between real-time or batch processing. The choice between batch or streaming depends on your specific requirements and priorities.
With Postgres, the fastest, most efficient, and most reliable way to extract data is using real-time change data capture (Postgres to Snowflake CDC). However, there are several reasons you might still opt for batch processing.
Batch processing is often acceptable for analytics, even if real-time streaming is more efficient and reliable. Many analytics use cases do not require real-time data, as reports are often viewed daily or hourly for historical purposes. Batch processing can also be more cost-effective. For instance, near real-time loading in Snowflake can be more costly, as it requires maintaining an additional Snowflake instance or using Snowpipe streaming.
You’ll need to decide between batch processing and streaming. Make sure you look forward to future projects before deciding on your approach. The best approach is to choose a vendor or open-source technology that will allow you to go real-time, or at least lower batch intervals as needed, and effectively manage the data transfer process in Snowflake.
Data Pipeline for Postgres to Snowflake Migration: Build vs Buy?
You may choose to build your own data pipeline. For batch processing, there are options like Apache Airflow for orchestrating batch jobs. Airflow is proven to work and scale, but it requires Airflow experts, time, and money. You can also build your own streaming data pipeline using messaging software such as Kafka, and Debezium. However, like Airflow, this requires investing in both a skilled team and infrastructure.
The majority of companies don’t have the experts, infrastructure resources, or appetite to build their own data pipelines. For this reason, the majority of Snowflake customers rely on ELT/ETL tools to connect Postgres to Snowflake.
Data integration tools play a crucial role in connecting databases like Postgres to Snowflake, in part because they make it easy to set up connectors to sources and destinations, and because they sync data automatically. Pricing, particularly usage-based models based on rows of data or other metrics, can become costly, so be sure to understand potential expenses before committing.
The most common vendor choice for a batch data pipeline is an ELT vendor, since they all support batch. Many ELT vendors implement CDC from Postgres to Snowflake using micro-batch because they can only support batch. Many companies can’t get access to the database to use CDC for security reasons. Additionally, batch processing may be the only feasible approach for extracting data from legacy or restricted-source systems. Before you choose a vendor, make sure you will be able to get access to your CDC sources.
If you need real-time data for analytics, or plan to use Postgres change data capture (CDC), you can also start to look at those few vendors that support streaming as well, such as Estuary. Streaming data pipelines are also used for operational analytics - including IT or operational support, troubleshooting, or monitoring use cases - all of which require up-to-date data.
ELT vs. ETL for PostgreSQL to Snowflake: Choosing the Right Data Integration Approach
Most data integration vendors either support ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load). While ETL is actually older, more Snowflake customers use batch ELT than ETL. Most ELT vendors replicate from Postgres to Snowflake in batch intervals (not real-time), and then integrate with dbt to run transforms inside Snowflake.
There are several cases where using ETL, and running transforms outside of the data warehouse makes more sense. You can achieve lower latency with streaming transforms. ETL is less work if you need to run the same transform for multiple destinations. And some destinations don’t support ELT using SQL or dbt.
PostgreSQL: A Quick Overview
PostgreSQL is an open-source object-relational database management system (RDBMS). Originally called POSTGRES as the successor to Ingres, it became PostgreSQL to emphasize its focus on SQL. People love open-source Postgres because it’s proven, has an extensive feature set, and is very extensible.
Why do people love Postgres?
- Data support: It supports several built-in data types such as a character, date and time, numeric, boolean, JSON, and more.
- Extensibility: With Postgres, you can change just about anything from functions, aggregates, and data types to core modules. There are plenty of options you can choose from.
- Query optimizer: Postgres is better than most in optimizing queries so that you don’t have to spend time re-writing queries manually.
- Security: PostgreSQL provides strong security features, including user roles and privileges, SSL certificates, and encryption choices to protect the confidentiality and integrity of data.
- Superior Indexing: PostgreSQL supports numerous indexing techniques, such as B-tree, hash, GIN (Generalised Inverted Index), and GiST (Generalized Search Tree), which enable effective data retrieval and query optimization.
A pre-built connector to PostgreSQL makes replication, transformation, and data migration from Postgres to Snowflake easier. The best connectors take care of most aspects of the data migration, from translating schema to the removal of duplicate records to data type mapping for loading into Snowflake.
Snowflake: A Quick Overview
Snowflake is one of the most popular cloud data warehouses. Founded in 2012, it was one of the first cloud data warehouses with decoupled storage and compute, and consumption-based pricing. Snowflake has evolved into a broader cloud data platform that supports other compute workloads for analytics, data science, machine learning, and AI. It is the only major cloud data warehouse that is independent of, and can be hosted on Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
Snowflake's architecture is structured into three decoupled multi-tenant layers that scale independently—database storage, query processing, and cloud services.
- Database storage is a proprietary columnar storage format for tables, with each table composed of micro-partitions between 50-150MB in size. Each time you write anything to Snowflake, you’re rewriting an entire micro-partition. This makes Snowflake more suited for (micro-) batch than real-time. It’s one reason why most writes happen in 1+ minute intervals. But Snowflake is becoming more real-time.
- Data type support. Snowflake supports a rich set of data types, including semi-structured data types like JSON and XML, which is important for loading data directly into Snowflake and for supporting different types of analytics and machine learning.
- Compute (query processing) consists of virtual data warehouses and compute clusters that range from x-small (1 credit/hour, 8 VCPUs) to 6-xlarge (512 credit/hour, 4096 VCPUs). They are dedicated, independent compute clusters that load data from storage into a local “cache” as needed by their consumers. This reduces the data size, making compute faster and more efficient. There are also Snowpark-optimized warehouses for general-purpose compute. Compute is kept “warm” in a shared pool for fast startup.
- Cloud services are all the control services you need to support storage and compute, including security, metadata, and infrastructure management.
Why Move Data from Postgres to Snowflake Data Warehouse?
Snowflake has been an innovator in the market with several key features that make companies choose to adopt Snowflake.
- Scalability: Snowflake has a true elastic scale. You can quickly move from one to another size cluster with a click. Snowflake allows you to automatically scale up and down additional instances of the same size warehouse (multi-cluster warehouses.)
- Data sharing: Snowflake's innovative data-sharing capabilities help in sharing live, up-to-date data in a secure and efficient manner with internal and external people.
- Zero-copy cloning: Snowflake's zero-copy cloning feature helps to create, test, and deploy data without impacting your storage or performance. It makes it easier to develop and test new features, run experiments, or troubleshoot issues.
- Ecosystem support: Snowflake's large number of integrations and compatibility help connect with various popular BI, ETL, and data visualization tools. Its cloud architecture effortlessly integrates with any of your preferred cloud providers, be it Azure, AWS, or GCS.
- Virtual private cloud: Snowflake does not let you run on your own on premises or cloud infrastructure. But you can run Virtual Private Snowflake in a separate dedicated Snowflake account that accomplishes the same goal. With Virtual Private Snowflake, you have your own dedicated pool to ensure fast scaling.
- Time travel: Snowflake not only archives data. It allows you to view and modify historical data for up to 90 days.
Snowflake has helped redefine data management and how a data warehouse can be used as more of a data lake by allowing companies to store unlimited amounts of raw data without having to spend on compute. This approach enables companies to implement modern data engineering best practices: extracting and loading raw data directly into Snowflake as a data cloud platform, then transforming it as needed for analytics and for other workloads, such as machine learning and AI.
2 Methods to Load Data from Postgres to Snowflake
You have a few options to move data from Postgres to Snowflake, from completely manual approaches or scripts to using pre-built connectors and data integration products.
- Method 1: Use Apache Airflow for batch orchestration
- Method 2: Using Estuary Flow
Method 1: Using Apache Airflow to Load Data from Postgres to Snowflake
If you decide to use Apache Airflow to replicate data from Postgres to Snowflake, you will need to set up and maintain your own Apache Airflow deployment. There are several articles with instructions on how to connect Snowflake to Postgres once you’re all set up, like this Medium article.
Step 1: Set up Airflow
Assuming you already have a Postgres database set up, if you’re using Linux you can follow this article for setting up Airflow. You will also need to install your Airflow packages on your client.
Step 2: Set up your Airflow Postgres and Snowflake connectors
From there, you will have to configure the connectors to Postgres and Snowflake. While this can take a little time, Airflow does make it easier.
Step 3: Define your Airflow DAG
Next, you’ll need to write your DAG (Directed Acyclic Graph) as a Python script. Airflow uses DAGs to define orchestrations. This means you need to be familiar with Python and the AIrflow syntax for DAGs.
Step 4: Run and monitor your Airflow DAG
Once you start the Airflow scheduler and Web server, you can log in, find the specific DAG, and start it. You can then start to monitor its progress.
Summary: Airflow Benefits and Challenges
Airflow is one of the most mature open source tools you can use for building your own batch data pipelines. It has been proven to work even in some of the largest environments.
But there are still some major challenges that make the majority of companies choose other options:
- Expertise Required: You need a team of Airflow experts to install and manage Airflow, and developers to code your pipelines in Python. The more sources and destinations, the more complex your DAGs can become to develop, manage, monitor, and scale.
- Open Source Limitations: While there are commercially available options like Astronomer, open-source tools come with their own set of challenges.
- Batch-Only Processing: If you need real-time data movement, you’ll need to use another tool for real-time data, resulting in two different technologies.
Method 2: Steps to Move Data from Postgres to Snowflake Using Estuary Flow
Estuary Flow is the first platform built from the ground up to deliver the most real-time and reliable change data capture (CDC), streaming, and batch data using a single data pipeline for analytics, operations, and AI.
Key features of Estuary Flow include:
- Sub-100ms end-to-end latency for real-time data capture, ensuring up-to-date insights.
- Exactly-once transactional delivery to guarantee data integrity in complex pipelines.
- The only stream-store replay that immediately stores streams in flight as a durable transaction log for later replays such as backfilling, stream processing, or time travel.
- Flexibility to mix real-time with any-speed batch in the same pipeline.
- DataOps support via CLI and API, along with built-in schema evolution.
- Choice of replication that writes in-place or saves entire change history.
- True elastic scale, like Snowflake, that scales horizontally and can support bursting for use cases like data backfilling.
Ready to move your data from PostgreSQL to Snowflake? Follow this comprehensive guide to set up your pipeline in just minutes. If you have any questions or need assistance, our team is always available to help—just reach out on Slack.
Prerequisites
- An Estuary Flow account
- A Postgres database
- A Snowflake account
Step 1: Create an Estuary Flow account
To begin, create a free account in the Estuary Flow web app:
- Sign up using your GitHub or Google account.
- After logging in, you’ll see a welcome page, indicating you’re ready to build your pipeline from PostgreSQL to Snowflake.
Tip: Steps 2–5 will help you set up PostgreSQL and Snowflake. If you’ve already configured them, jump directly to Step 6 for integration.
Step 2: Configure an AWS RDS for PostgreSQL
Set up a PostgreSQL instance using Amazon RDS or another cloud database service like Google Cloud SQL or Azure Database. For this guide, we’ll use Amazon RDS:
- Log in to your AWS account and search for "Amazon RDS" in the AWS search bar.
- Navigate to Databases > Create Database and select PostgreSQL as the database type.
- Configure the following settings:
- Username: estuary_username
- Database name: estuary_db
- Public access: Enabled
AWS will take some time to create the database. Once ready, note the endpoint for later use.
- Create a Parameter Group:
- Go to RDS > Parameter Groups and create a new group named estuary_parameter_group.
- Edit this group to enable logical replication by setting rds.logical_replication=1.
- Associate the Parameter Group with your database:
- Navigate to RDS > Databases > Modify, and change the DB parameter group to estuary_parameter_group.
- Reboot the database to apply changes.
Step 3: Set Up a Local Postgres Client
To finalize your PostgreSQL setup, install and configure a local Postgres client:
- Download the client from the official PostgreSQL website.
- Open the Postgres client and register a new server.
- Enter the following details from your Amazon RDS instance:
- Hostname
- Port number
- Username
- Database name
- Password
These are the same parameters you configured when you created your remote Amazon RDS database instance. When all the parameters have been entered, you should be able to successfully connect your local Postgres client to the remote Postgres database instance.
Step 4: Create a Table and Populate Data
Prepare a sample table and populate it with data:
- Create a schema and table in your database by running the following SQL:
plaintextcreate schema iot_schema;
create table iot_schema.iot_table(
id varchar(100) PRIMARY KEY,
room_id varchar(100),
noted_date varchar(100),
temp integer,
"out/in" varchar(50)
);
- Populate the table with sample data:
plaintext\COPY iot_schema.iot_table FROM '/path-to-file/IOT-temp.csv' WITH (FORMAT CSV, HEADER TRUE, DELIMITER ',');
- After executing this command, the data will be successfully loaded into your Postgres table. You can confirm by extracting the first ten records.
plaintextselect * from iot_schema.iot_table limit 10;
The output of this query is shown in the image below.
You can also get the total records in the table by executing the command:
plaintextselect count(*) from iot_schema.iot_table
The table should now contain 97,606 rows (or your sample dataset). You’re ready to stream this data to Snowflake!
Step 5: Configure Snowflake
Set up your Snowflake environment:
- If you are new to Snowflake, follow Snowflake’s guide to get started. On the Snowflake classic console, you can create a Snowflake database by running the command:
plaintextcreate database estuary_snowflake_db
- You can also create a schema in Snowflake by running the command:
plaintextuse "ESTUARY_SNOWFLAKE_DB";
create schema estuary_snowflake_schema;
- You also need to create a user for Estuary and grant the user data warehouse permissions. You can do that by executing the block of code below.
plaintextset estuary_role = 'ESTUARY_ROLE';
set estuary_user = 'ESTUARY_USER';
set estuary_password = 'password';
create database if not exists identifier($database_name);
use database identifier($database_name);
create schema if not exists identifier($estuary_schema);
create role if not exists identifier($estuary_role);
grant role identifier($estuary_role) to role SYSADMIN;
create user if not exists identifier($estuary_user)
password = $estuary_password
default_role = $estuary_role;
grant role identifier($estuary_role) to user identifier($estuary_user);
grant all on schema estuary_snowflake_schema to identifier($estuary_role);
grant USAGE on warehouse estuary_wh to role identifier($estuary_role);
-- grant Estuary access to database
grant CREATE SCHEMA, MONITOR, USAGE on database estuary_snowflake_db to role identifier($estuary_role);
-- change role to ACCOUNTADMIN for STORAGE INTEGRATION support to Estuary (only needed for Snowflake on GCP)
use role ACCOUNTADMIN;
grant CREATE INTEGRATION on account to role identifier($estuary_role);
use identifier($estuary_role);
COMMIT;
Step 6: Create a Capture in Estuary Flow
Estuary Flow reads data changes from PostgreSQL using Change Data Capture (CDC). To set up:
- Grant permissions in PostgreSQL:
plaintextCREATE USER flow_capture WITH PASSWORD ‘password’;
GRANT rds_replication TO flow_capture;
GRANT SELECT ON ALL TABLES IN SCHEMA iot_schema TO flow_capture;
GRANT ALL ON ALL TABLES IN SCHEMA iot_schema TO flow_capture;
ALTER DEFAULT PRIVILEGES IN SCHEMA iot_schema GRANT SELECT ON TABLES TO flow_capture;
GRANT USAGE ON SCHEMA iot_schema TO flow_capture ;
CREATE TABLE IF NOT EXISTS public.flow_watermarks (slot TEXT PRIMARY KEY, watermark TEXT);
GRANT ALL PRIVILEGES ON TABLE public.flow_watermarks TO flow_capture;
GRANT ALL PRIVILEGES ON TABLE iot_schema.iot_table TO flow_capture;
CREATE PUBLICATION flow_publication FOR ALL TABLES;
- Log in to Estuary Flow and create a New Capture
- Select PostgreSQL as the source.
- Enter connection details such as hostname, database name, username, and password.
These configurations were set when you created your Amazon RDS instance in the previous section.
When you click on NEXT, you will see the tables and schema you also created in the previous section.
- Save and publish the capture to start reading your data.
Estuary Flow will then start to read the WAL and also start to execute an incremental snapshot of the database in parallel. Estuary Flow is the only product that does both in real-time and in parallel. As it reads both, Estuary Flow combines this into a stream with exactly-once guaranteed delivery that it also commits as a durable append-only log called a collection. A collection allows you to replay and reuse the stream at any time as a stream or batch load.
Step 7: Create a Materialization in Snowflake
Materialization transfers data from Estuary Flow to Snowflake:
- In the Estuary Flow UI, navigate to Materialization and click New Materialization.
- Select Snowflake as the destination and configure connection details (e.g., Host URL, Database name, Warehouse name, Username, and Password).
- Specify the collection representing the IoT data from PostgreSQL.
- When you enable the materialization, you will notice that data gets streamed from Postgres to Snowflake in real-time.
Step 8: Verify Data Transfer
Finally, test the pipeline by inserting new data into PostgreSQL and confirming it appears in Snowflake:
- On your local PostgreSQL server, you can run the following command:
plaintextinsert into iot_schema.iot_table values ('__export__new_id_28282',
'Room admin','2023-02-01',67,'IN')
- Check Snowflake to ensure the record is present:
plaintextSELECT * FROM estuary_snowflake_schema.iot_table WHERE id = 'new_id_28282';
Congratulations! Your data pipeline is now successfully moving data from PostgreSQL to Snowflake in real-time.
Key Benefits of Using Estuary Flow
- Estuary Flow allows you to build data pipelines in minutes, making it simple to get started without extensive training or setup time.
- Effortlessly manage both real-time streaming and batch data within a single pipeline.
- Recommended for real-time CDC to reduce source database load but flexible enough for batch processing when needed.
- Data is stored as it moves, ensuring recoverability and preventing data loss.
- Enables reuse of data streams for backfilling, stream processing, or time travel at any time.
- One of the most affordable data integration tools on the market. Flexible pricing options, with a calculator available on the pricing page to estimate costs based on usage.
Estuary Flow’s combination of speed, reliability, and cost-effectiveness makes it an excellent choice for building modern data pipelines.
Conclusion
Hopefully, you’ve learned a few ways to load data from Postgres to Snowflake and figured out which method is best for your needs.
While tools like Apache Airflow are suitable for batch pipelines, Estuary Flow offers a faster, more reliable, and flexible solution that supports both real-time and batch data movement within a single pipeline. With features like sub-100ms latency, exactly-once delivery, and schema evolution, Estuary Flow empowers businesses to streamline their data integration processes.
Ready to simplify your Postgres to Snowflake data pipeline? Try Estuary Flow for free and start moving your data in minutes.
FAQs
What are the top tools for integrating PostgreSQL into Snowflake?
- Estuary Flow: A fully managed data platform designed for real-time and batch data pipelines. Ideal for organizations seeking a simpler, reliable, and cost-effective solution that supports both streaming (with change data capture) and batch processing.
- Fivetran: A popular ELT (Extract, Load, Transform) tool that simplifies data replication from PostgreSQL to Snowflake. While easy to use, Fivetran's pricing can be high for large data volumes or frequent updates due to its usage-based model.
- Stitch: Another ELT tool with pre-built connectors and automation features. Similar to Fivetran, Stitch's row-based pricing can become expensive as your data needs grow.
- Airbyte: An open-source ETL/ELT platform with a large library of connectors. Flexible (self-hosted or cloud-based) but requires technical expertise, and connector quality can vary.
What are my options for real-time data movement using Postgres to Snowflake CDC?
Estuary Flow and Debezium are the two main options. Debezium is only for the CDC and, as open source, will require you to set it up, maintain it, and support it on your own. With Estuary Flow, you’re set up in minutes and can seamlessly mix CDC with other real-time or batch connections.
What are the risks of using CDC with Postgres?
The main risks include misconfigurations in Postgres settings and increased load when reading the WAL in batch intervals instead of real-time. These can be mitigated by using tools like Estuary Flow, which simplifies CDC setup and ensures reliable, low-latency data streaming.
Why do I need to provide write access to a CDC connector?
If a CDC connection fails for any reason, both the Postgres connection and the remote client should keep track of the position where the WAL was last read. The most reliable way to do that in Postgres is to use Postgres to track the offset as a database write.
Related Article Sync with Postgres
About the author
With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.