
Many businesses have long favored PostgreSQL for its robust data storage and management operations. Its ability to support different data types and advanced query execution makes it a popular enterprise-operational database.
However, despite its many benefits, Postgres has associated challenges for auto-scaling, distributed queries, and managing large datasets across multiple servers. To resolve this issue, you can transfer your data to a highly scalable lakehouse, powered by Apache Iceberg. It is an open-source table format that helps you manage and query large volumes of data. This makes Iceberg a suitable choice for enterprises requiring advanced data-handling capabilities.
Here, you will learn two methods for transferring data from Postgres to Iceberg tables, along with the use cases of this integration. This will help you handle large datasets to gain analytical insights for profit-making businesses.
Key Takeaways
- Moving data from PostgreSQL to Apache Iceberg lets you scale analytics without burdening your transactional database.
- You can do this in two ways:
1. Estuary (Streaming CDC): Streams inserts, updates, and deletes from PostgreSQL 10+ to Iceberg in near real time. Requires logical replication, a publication, and a replication slot. Works with Iceberg REST catalogs such as AWS Glue, Amazon S3 Tables, or Snowflake Open Catalog. Best for continuous sync and fresh analytics.
2. pg_dump + Spark (Batch): Exports Postgres data to CSV or SQL and loads it into Iceberg with Spark. Easier to start but fully manual, slower, and lacks real-time updates—suited for one-time or periodic loads.
- Always plan IAM or OAuth permissions for the catalog and EMR Serverless access.
- Regular Iceberg maintenance like compaction and partition optimization keeps queries fast.
- Choose Estuary for right-time, automated pipelines; choose pg_dump for simple bulk transfers.
What is PostgreSQL?
PostgreSQL, or Postgres, is an open-source object-relational database management system (ORDBMS). It supports various data types, including numeric, boolean, date/time, geometric, and JSON data. Postgres is built to handle massive transactional workloads, such as being the database for large scale web applications.
For faster query execution, Postgres provides the parallel query feature. Using this feature, you can divide a single query into multiple sub-tasks and execute them simultaneously across different CPUs. This ability makes Postgres a high-performing database that can quickly query massive amounts of data. As a result, you can use it in enterprise software, geospatial applications, telecommunications, and financial services.
What is Apache Iceberg?
Apache Iceberg is a table format that assists in managing the data files in data lakes. It acts as an abstraction layer between your physical data files and how they are organized into relational tables.
You can use compute engines like Spark, Flink, PrestoDB, or Hive to manage data in Iceberg tables. These tables work just like other database tables, supporting concurrent operations and other features such as ACID-compliance. Iceberg tables can accommodate massive amounts of data on a petabyte scale. As a result, you can use Iceberg-based applications for several big data operations.
Here are some key features of Iceberg that make it an impressive solution for data management:
- Schema Evolution: Schema evolution is a process of modifying the schema of a database system without changing the existing data within the database. Iceberg supports robust schema evolution and allows you to add, drop, or rename columns while maintaining data integrity.
- Hidden Partitioning: You can query data faster in Iceberg with its hidden partitioning system. This involves partitioning tables into smaller, more manageable units. Iceberg supports partition evolution by facilitating change in the partition keys of a table without the need to rewrite the data in the table. The new partition values are created automatically in Iceberg tables based on pre-configured partitioning schemas.
- Time Travel and Rollback: Iceberg offers a time travel feature that allows you to run reproducible queries on historical snapshots of the table. This is useful to track changes over time. You can also utilize the version rollback feature to revert the tables to a previous state using either snapshot ID or timestamp.
- Faster Querying: While using Iceberg, you can prune metadata files, scan tables quickly, and filter out data files that do not contain relevant data. These capabilities increase the querying speed and help you perform faster data analytics on large datasets.
2 Methods for Moving Data From Postgres to Apache Iceberg
- Method 1: Using Estuary to Load Data from Postgres to Iceberg
- Method 2: Exporting Postgres Data to CSV via pg_dump and Using Spark to Load to Iceberg
Watch this quick video to see how Apache Iceberg transforms data management and how Estuary makes Postgres-to-Iceberg integration seamless.
Method 1: Using Estuary to Load Data from Postgres to Iceberg
To streamline the data transfer from Postgres to Iceberg, you can opt for automation tools like Estuary, a unified data pipeline integration platform. It offers 200+ pre-built connectors to help you collect and consolidate data from various sources into most common data destination, such as Snowflake, Databricks, or Apache Iceberg.
With Estuary, you can ingest data from several sources to collections and then materialize to many destinations simultaneously. The platform also allows you to reuse your data across various projects, saving time and resources required to consolidate data.
Steps
You can follow the below-mentioned steps to move data from Postgres to Iceberg after fulfilling the following prerequisites:
Pre-requisites
- An Estuary account
- PostgreSQL 10 or later with logical replication enabled
- wal_level=logical, a user with the REPLICATION attribute, a publication, and a replication slot
- Network access from Estuary to the database, optionally via SSH tunneling
- An Iceberg REST catalog
- AWS Glue Iceberg REST, Amazon S3 Tables REST, or another REST catalog such as Snowflake Open Catalog
- AWS EMR Serverless for compute
- Spark application, S3 staging bucket, and an EMR execution role with permissions
Step 1: Configure Postgres as Source
- Sign in to your Estuary account.
- From the left-side menu of the dashboard, click the Sources tab. You will be redirected to the Sources page.
- Click the + NEW CAPTURE button and type PostgreSQL in the Search connectors field.
- You will see several options for PostgreSQL, including real-time and batch. Select the one that fits your requirements and click on the Capture button of the connector.
For this tutorial, let’s select the PostgreSQL real-time connector.
- On the connector’s configuration page, you need to enter all the essential fields, including:
- Address host or host:port
- Database name
- User and password
- SSL mode if your provider requires it
- After entering these details, click on NEXT > SAVE AND PUBLISH.
This connector leverages the change data capture (CDC) feature to continuously capture modifications made in your PostgreSQL database into Flow collections. You can capture streaming (or batch) data changes from your PostgreSQL database in a flow collection.
Recommended setup details
- Read Only Captures are available if you cannot create the watermarks table. Ensure at least one captured table is updated regularly or create a small heartbeat table to prevent WAL from growing.
- Standby capture is supported on PostgreSQL 16 and later with hot_standby_feedback=on.
- WAL hygiene
- Set max_slot_wal_keep_size to prevent unbounded WAL growth
- Drop replication slots when you disable or delete captures
- Be aware that long running transactions can delay restart_lsn advancement
- TOAST handling
- Estuary fills omitted values using merge reductions
- If you still see edge cases, consider REPLICA IDENTITY FULL after weighing database impact
- Publications
- Include only the tables you intend to capture. Creating a publication for all tables can break updates or deletes for tables without primary keys.
Step 2: Configure Iceberg as Destination
- Choose your catalog type and gather values
- AWS Glue Iceberg REST
- URL https://glue.<region>.amazonaws.com/iceberg
- Warehouse is your AWS Account ID
- Base Location is required and must be an S3 path
- Auth options include AWS SigV4 or AWS IAM
- Amazon S3 Tables REST
- URL https://s3tables.<region>.amazonaws.com/iceberg
- Warehouse is the S3 Tables bucket ARN
- Auth options include AWS SigV4 or AWS IAM
- Other REST catalogs for example Snowflake Open Catalog
- Use OAuth 2.0 Client Credentials with a scope such as PRINCIPAL_ROLE:<role>
- AWS Glue Iceberg REST
- In Estuary, go to Destinations and create a new materialization using the Apache Iceberg connector
- Set URL, Warehouse, Namespace, and Base Location if your catalog requires it
- Choose Catalog Authentication
- OAuth 2.0 Client Credentials
- AWS SigV4
- AWS IAM
- Configure Compute
- EMR region
- EMR Serverless application ID
- EMR execution role ARN
- S3 staging bucket and optional bucket path
- Optional Systems Manager Prefix for securely storing OAuth credentials used by EMR jobs
- The collections of your Postgres data added to your capture will be automatically linked to your materialization. However, if it hasn’t, you can manually select a capture to link.
To do this, click the SOURCE FROM CAPTURE button in the Source Collections section. Then, select your Postgres data collection.
- Finally, click NEXT and SAVE AND PUBLISH to complete the configuration process.
This connector materializes delta updates from collections into Iceberg tables using your configured Iceberg REST catalog and AWS EMR Serverless for compute.
Advanced options you will likely use
- Hard Delete applies source deletes as physical deletes. Off by default for soft delete.
- Lowercase Column Names makes all columns lowercase to improve compatibility with engines such as Athena.
- Sync Schedule lets you run on a schedule if you prefer periodic merges over continuous streaming.
Here’s a quick video that shows how to set up Apache Iceberg with Amazon S3 and AWS Glue to simplify your data workflows:
Permissions checklist
Glue catalog
- Catalog user or role needs Glue permissions to create and modify databases and tables
- Access to the table bucket that stores Iceberg data and metadata
- If Lake Formation is enabled, grant Data Location, Create Database, and table level permissions to both the catalog user and the EMR execution role
S3 Tables catalog
- s3tables permissions for the target bucket for both the catalog user or role and the EMR execution role
EMR Serverless
- EMR execution role policy for reading credentials from Parameter Store when OAuth is used
- Read and write access to the S3 staging bucket
- Application start and job run permissions for the IAM principal configured in Estuary
Benefits of Estuary
Here is an overview of essential features of Estuary that are beneficial while moving data from Postgres to Iceberg:
- Change Data Capture (CDC): Estuary streams changes from Postgres with low end-to-end latency, typically sub-second to near real time depending on workload and network. It supports streaming via transaction logs and incremental backfill.
- Transformation: With Estuary, you can perform ETL or ELT data integration according to the requirements of your organizational workflows. To conduct transformations, you can use streaming SQL (ETL) and TypeScript (ETL) or dbt (ELT) in your destination system.
- Multiple Deployment Options: You can deploy Estuary using three options per your requirements. One is the Public Deployment option, which is fully managed and suitable if you have a small or medium-sized business.
Second is the Private Deployment option, which allows you to deploy Flow in your private environment. It provides you with more control over your data pipeline development process.
Method 2: Exporting Postgres Data to CSV via pg_dump and Using Spark to Load to Iceberg
pg_dump is a command-line utility in PostgreSQL that enables you to create a backup of the Postgres database. To sync data from Postgres to Iceberg tables, you can create a backup of your Postgres data in CSV file format using pg_dump. Then you can then export this CSV data to Iceberg tables operating in Apache Spark.
You can follow the below steps to move data from Postgres to Iceberg table via the CSV export/import method:
Step 1: Extract CSV Data from Postgres
- Log in to your Postgres account.
- On your local system, ensure you have access to the pg_dump tool and then open a terminal or command prompt. If you use a cloud-based Postgres database, access the database dump on that cloud service.
Then, use the following code to extract data from a particular table in the Postgres database using pg_dump:
plaintextpg_dump --column-inserts --data-only --table=<table> <database> > yourfile_name.sql- <table> is the Postgres table from which you want to extract data. You can enter the name of your table here.
- <database> is the name of the Postgres database containing the table from which you are exporting data.
- yourfile_name is the file name where you will save your exported data.
- The output from this will be an SQL file. You can convert it into CSV using the following PostgreSQL COPY command:
plaintextCOPY (SELECT * from <table>) TO '/path/output.csv' WITH CSV;- /path/output.csv is the path of the destination file where you can save the extracted data. You can replace it with the actual file path on your system.
You can store these CSV files in your local system and easily access them to load data to Iceberg tables.
Step 2: Load CSV Data to Iceberg Using Apache Spark
To load data from CSV files into Apache Iceberg tables using Apache Spark, follow these steps:
- Launch the Spark shell, including the Iceberg library. You can use the following command to include the Iceberg Spark runtime package:
plaintextspark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0- You can then add catalogs that enable you to use SQL commands to manage tables and load them by name. Use the following command to create a path-based catalog for your tables.
plaintextspark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouseReplace the name local with your preferred catalog name and $PWD/warehouse with your CSV file storage location.
- Now, load the CSV data exported from Postgres in Iceberg tables using the following code.
plaintextfrom pyspark.sql.functions import days
spark.sql("CREATE DATABASE IF NOT EXISTS nyc")
csv_df = spark.read.csv('/path/to/output.csv', header=True, inferSchema=True)
csv_df.writeTo("nyc.taxis") \
.partitionedBy(days("tpep_pickup_datetime")) \
.createOrReplace()The explanation of the code is as follows:
- First, you can import all the necessary functions from PySpark’s function module. You can replace days with other functions to partition data.
- Create a database in your Spark session (if it doesn’t exist) where you can load the CSV data. You can replace the term nyc with your database name.
- After this, write the CSV data from PostgreSQL into an Iceberg table in this database. In place of nyc.taxis, you can enter your Iceberg table name. Replace the tpep_pickup_datetime with the column name you want to partition your Iceberg table.
This simple Postgres to Iceberg example shows how to manually transfer data between these two data systems for improved analytics.
This completes the Postgres Iceberg data transfer process through a CSV file and Apache Spark.
Limitations of Using CSV Export/Import to Move Data from Postgres to Iceberg
Some of the challenges of using the CSV export/import method to move data from Postgres to Iceberg are as follows:
- Time-Consuming and Effort-Intensive: Loading Postgres data to Iceberg tables via CSV files can be time-consuming as it involves multiple steps. The process also requires a good understanding of both systems. It involves manual efforts to prepare and transform the data, adding to the complexity and duration of the task.
- Shortcomings of Apache Spark: If you use Iceberg tables in Spark, you may encounter latency in data operations, as Spark allows you to process data in micro batches. Spark also needs proper file management systems. This complicates the handling of large datasets and overall data management efficiency.
- Inefficiency of pg_dump: The pg_dump tool can only handle small data objects; you must manage these manually. It also does not support the extraction of partial indices, leading to incomplete backups or the need for additional steps to recreate these indices after data transfers. Another major limitation is the inability of pg_dump to extract some types of metadata during the backup process.
Use Cases for PostgreSQL to Apache Iceberg Integration
The high-performing nature of Iceberg tables makes them a suitable data system for numerous use cases. Some of these are as follows:
- Data Lake Architectures: A data lake is a centralized data system where you can store both structured and unstructured data. Using Iceberg tables within data lakes facilitates adequate data storage, management, and retrieval. You can then use this data for various finance, healthcare, banking, or e-commerce operations.
- High-Scale Data Analytics: By loading data to Iceberg tables, you can analyze petabyte-scale datasets for big enterprises, financial institutions, or government agencies. This simplifies the data-related workflow of such institutional bodies and popularizes data science for real-life applications.
- Developing Real-time Software Applications: Iceberg tables allow you to query data faster. You can leverage this capability to build software applications that provide real-time services, such as online retail stores.
Conclusion
If you are looking for solutions for effective data management and analytics, Postgres to Iceberg replication offers significant advantages. This data integration can help you develop a highly scalable workflow infrastructure that enables you to query large volumes of datasets for better business insights.
You can use one of two methods for transferring data between Postgres and Iceberg. One method uses pg_dump, a Postgres utility for creating data backups, which can be slightly complex. Then, you load this data into Iceberg tables using Apache Spark. This method is effort-intensive, prone to errors, and inefficient for voluminous datasets.
An alternative for a Postgres to Iceberg sync is to use Estuary, an efficient data movement platform that simplifies the data transfer through automation. You can quickly load Postgres data into Iceberg tables and utilize it for your operational needs.
Sign up for your Estuary account today and use it to develop robust data pipelines for creating high-performing applications!
FAQs
How does Apache Iceberg differ from Apache Parquet?
Can Iceberg tables handle real-time data from Postgres?

About the author
With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.






















