Many businesses have long favored PostgreSQL for its robust data storage and management operations. Its ability to support different data types and advanced query execution makes it a popular enterprise-operational database.
However, despite its many benefits, Postgres has associated challenges for auto-scaling, distributed queries, and managing large datasets across multiple servers. To resolve this issue, you can transfer your data to a highly scalable lakehouse, powered by Apache Iceberg. It is an open-source table format that helps you manage and query large volumes of data. This makes Iceberg a suitable choice for enterprises requiring advanced data-handling capabilities.
Here, you will learn two methods for transferring data from Postgres to Iceberg tables, along with the use cases of this integration. This will help you handle large datasets to gain analytical insights for profit-making businesses.
What is PostgreSQL?
PostgreSQL, or Postgres, is an open-source object-relational database management system (ORDBMS). It supports various data types, including numeric, boolean, date/time, geometric, and JSON data. Postgres is built to handle massive transactional workloads, such as being the database for large scale web applications.
For faster query execution, Postgres provides the parallel query feature. Using this feature, you can divide a single query into multiple sub-tasks and execute them simultaneously across different CPUs. This ability makes Postgres a high-performing database that can quickly query massive amounts of data. As a result, you can use it in enterprise software, geospatial applications, telecommunications, and financial services.
What is Apache Iceberg?
Apache Iceberg is a table format that assists in managing the data files in data lakes. It acts as an abstraction layer between your physical data files and how they are organized into relational tables.
You can use compute engines like Spark, Flink, PrestoDB, or Hive to manage data in Iceberg tables. These tables work just like other database tables, supporting concurrent operations and other features such as ACID-compliance. Iceberg tables can accommodate massive amounts of data on a petabyte scale. As a result, you can use Iceberg-based applications for several big data operations.
Here are some key features of Iceberg that make it an impressive solution for data management:
- Schema Evolution: Schema evolution is a process of modifying the schema of a database system without changing the existing data within the database. Iceberg supports robust schema evolution and allows you to add, drop, or rename columns while maintaining data integrity.
- Hidden Partitioning: You can query data faster in Iceberg with its hidden partitioning system. This involves partitioning tables into smaller, more manageable units. Iceberg supports partition evolution by facilitating change in the partition keys of a table without the need to rewrite the data in the table. The new partition values are created automatically in Iceberg tables based on pre-configured partitioning schemas.
- Time Travel and Rollback: Iceberg offers a time travel feature that allows you to run reproducible queries on historical snapshots of the table. This is useful to track changes over time. You can also utilize the version rollback feature to revert the tables to a previous state using either snapshot ID or timestamp.
- Faster Querying: While using Iceberg, you can prune metadata files, scan tables quickly, and filter out data files that do not contain relevant data. These capabilities increase the querying speed and help you perform faster data analytics on large datasets.
Methods for Moving Data From Postgres to Apache Iceberg
- Method 1: Using Estuary Flow to Load Data from Postgres to Iceberg
- Method 2: Exporting Postgres Data to CSV via pg_dump and Using Spark to Load to Iceberg
Method 1: Using Estuary Flow to Load Data from Postgres to Iceberg
To streamline the data transfer from Postgres to Iceberg, you can opt for automation tools like Estuary Flow, a unified data pipeline integration platform. It offers 200+ pre-built connectors to help you collect and consolidate data from various sources into most common data destination, such as Snowflake, Databricks, or Apache Iceberg.
With Estuary Flow, you can ingest data from several sources to collections and then materialize to many destinations simultaneously. The platform also allows you to reuse your data across various projects, saving time and resources required to consolidate data.
Steps
You can follow the below-mentioned steps to move data from Postgres to Iceberg after fulfilling the following prerequisites:
Pre-requisites
- An Estuary Flow account.
- A PostgreSQL instance with logical replication enabled, including user role, replication slot, and publication.
- AWS S3 as storage layer and AWS Glue as catalog (alternatively, a REST catalog).
Step 1: Configure Postgres as Source
- Sign in to your Estuary account.
- From the left-side menu of the dashboard, click the Sources tab. You will be redirected to the Sources page.
- Click the + NEW CAPTURE button and type PostgreSQL in the Search connectors field.
- You will see several options for PostgreSQL, including real-time and batch. Select the one that fits your requirements and click on the Capture button of the connector.
For this tutorial, let’s select the PostgreSQL real-time connector.
- On the connector’s configuration page, you need to enter all the essential fields, including:
- Name: Give a unique name to your capture.
- Server Address: This is the host or host:port to connect to the database.
- User: Enter your Postgres user name for authentication.
- Password: Provide the password for the given user name.
- After entering these details, click on NEXT > SAVE AND PUBLISH.
This connector leverages the change data capture (CDC) feature to continuously capture modifications made in your PostgreSQL database into Flow collections. You can capture streaming (or batch) data changes from your PostgreSQL database in a flow collection.
Step 2: Configure Iceberg as Destination
- After configuring the source connector, you will see a pop-up summarizing the details of the capture. Click on the MATERIALIZE COLLECTIONS button to start setting up Iceberg as a destination.
Alternatively, you can click on the Destinations tab on the left-side pane of the dashboard. You will be directed to the Destinations page; click the + NEW MATERIALIZATION button.
- On the Create Materialization page, enter Iceberg in the Search connectors field.
- Then, click the Materialization button of the Amazon S3 Iceberg connector to continue with the configuration process.
- On the Create Materialization page, enter all the required fields, such as:
- Name: Enter a unique name for the materialization.
- AWS Access Key ID: To access AWS services, provide your AWS Access Key ID.
- AWS Secret Access Key: Mention the Secret Access Key to access your AWS services.
- Bucket: Enter the name of your S3 bucket to write data files.
- Region: Specify the AWS region.
- Namespace: Enter the namespace for bound collection tables.
- The collections of your Postgres data added to your capture will be automatically linked to your materialization. However, if it hasn’t, you can manually select a capture to link.
To do this, click the SOURCE FROM CAPTURE button in the Source Collections section. Then, select your Postgres data collection.
- Finally, click NEXT and SAVE AND PUBLISH to complete the configuration process.
This connector materializes delta updates of Flow collections into Iceberg tables using Amazon S3 for object storage and AWS Glue as the Iceberg catalog.
Here’s a quick video that shows how to set up Apache Iceberg with Amazon S3 and AWS Glue to simplify your data workflows:
Benefits of Estuary Flow
Here is an overview of essential features of Estuary Flow that are beneficial while moving data from Postgres to Iceberg:
- Change Data Capture (CDC): The real-time CDC feature of Estuary Flow helps you quickly capture changes made at Postgres to keep the data in sync within Iceberg. It offers an end-to-end latency range of less than 100 milliseconds. Estuary Flow supports streaming source data into the target system using transaction logs and incremental backfill.
- Transformation: With Estuary Flow, you can perform ETL or ELT data integration according to the requirements of your organizational workflows. To conduct transformations, you can use streaming SQL (ETL) and TypeScript (ETL) or dbt (ELT) in your destination system.
- Multiple Deployment Options: You can deploy Estuary Flow using three options per your requirements. One is the Public Deployment option, which is fully managed and suitable if you have a small or medium-sized business.
Second is the Private Deployment option, which allows you to deploy Flow in your private environment. It provides you with more control over your data pipeline development process.
Method 2: Exporting Postgres Data to CSV via pg_dump and Using Spark to Load to Iceberg
pg_dump is a command-line utility in PostgreSQL that enables you to create a backup of the Postgres database. To sync data from Postgres to Iceberg tables, you can create a backup of your Postgres data in CSV file format using pg_dump. Then you can then export this CSV data to Iceberg tables operating in Apache Spark.
You can follow the below steps to move data from Postgres to Iceberg table via the CSV export/import method:
Step 1: Extract CSV Data from Postgres
- Log in to your Postgres account.
- On your local system, ensure you have access to the pg_dump tool and then open a terminal or command prompt. If you use a cloud-based Postgres database, access the database dump on that cloud service.
Then, use the following code to extract data from a particular table in the Postgres database using pg_dump:
plaintextpg_dump --column-inserts --data-only --table=<table> <database> > yourfile_name.sql
- <table> is the Postgres table from which you want to extract data. You can enter the name of your table here.
- <database> is the name of the Postgres database containing the table from which you are exporting data.
- yourfile_name is the file name where you will save your exported data.
- The output from this will be an SQL file. You can convert it into CSV using the following PostgreSQL COPY command:
plaintextCOPY (SELECT * from <table>) TO '/path/output.csv' WITH CSV;
- /path/output.csv is the path of the destination file where you can save the extracted data. You can replace it with the actual file path on your system.
You can store these CSV files in your local system and easily access them to load data to Iceberg tables.
Step 2: Load CSV Data to Iceberg Using Apache Spark
To load data from CSV files into Apache Iceberg tables using Apache Spark, follow these steps:
- Launch the Spark shell, including the Iceberg library. You can use the following command to include the Iceberg Spark runtime package:
plaintextspark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0
- You can then add catalogs that enable you to use SQL commands to manage tables and load them by name. Use the following command to create a path-based catalog for your tables.
plaintextspark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
Replace the name local with your preferred catalog name and $PWD/warehouse with your CSV file storage location.
- Now, load the CSV data exported from Postgres in Iceberg tables using the following code.
plaintextfrom pyspark.sql.functions import days
spark.sql("CREATE DATABASE IF NOT EXISTS nyc")
csv_df = spark.read.csv('/path/to/output.csv', header=True, inferSchema=True)
csv_df.writeTo("nyc.taxis") \
.partitionedBy(days("tpep_pickup_datetime")) \
.createOrReplace()
The explanation of the code is as follows:
- First, you can import all the necessary functions from PySpark’s function module. You can replace days with other functions to partition data.
- Create a database in your Spark session (if it doesn’t exist) where you can load the CSV data. You can replace the term nyc with your database name.
- After this, write the CSV data from PostgreSQL into an Iceberg table in this database. In place of nyc.taxis, you can enter your Iceberg table name. Replace the tpep_pickup_datetime with the column name you want to partition your Iceberg table.
This simple Postgres to Iceberg example shows how to manually transfer data between these two data systems for improved analytics.
This completes the Postgres Iceberg data transfer process through a CSV file and Apache Spark.
Limitations of Using CSV Export/Import to Move Data from Postgres to Iceberg
Some of the challenges of using the CSV export/import method to move data from Postgres to Iceberg are as follows:
- Time-Consuming and Effort-Intensive: Loading Postgres data to Iceberg tables via CSV files can be time-consuming as it involves multiple steps. The process also requires a good understanding of both systems. It involves manual efforts to prepare and transform the data, adding to the complexity and duration of the task.
- Shortcomings of Apache Spark: If you use Iceberg tables in Spark, you may encounter latency in data operations, as Spark allows you to process data in micro batches. Spark also needs proper file management systems. This complicates the handling of large datasets and overall data management efficiency.
- Inefficiency of pg_dump: The pg_dump tool can only handle small data objects; you must manage these manually. It also does not support the extraction of partial indices, leading to incomplete backups or the need for additional steps to recreate these indices after data transfers. Another major limitation is the inability of pg_dump to extract some types of metadata during the backup process.
Use Cases for PostgreSQL to Apache Iceberg Integration
The high-performing nature of Iceberg tables makes them a suitable data system for numerous use cases. Some of these are as follows:
- Data Lake Architectures: A data lake is a centralized data system where you can store both structured and unstructured data. Using Iceberg tables within data lakes facilitates adequate data storage, management, and retrieval. You can then use this data for various finance, healthcare, banking, or e-commerce operations.
- High-Scale Data Analytics: By loading data to Iceberg tables, you can analyze petabyte-scale datasets for big enterprises, financial institutions, or government agencies. This simplifies the data-related workflow of such institutional bodies and popularizes data science for real-life applications.
- Developing Real-time Software Applications: Iceberg tables allow you to query data faster. You can leverage this capability to build software applications that provide real-time services, such as online retail stores.
Conclusion
If you are looking for solutions for effective data management and analytics, Postgres to Iceberg replication offers significant advantages. This data integration can help you develop a highly scalable workflow infrastructure that enables you to query large volumes of datasets for better business insights.
You can use one of two methods for transferring data between Postgres and Iceberg. One method uses pg_dump, a Postgres utility for creating data backups, which can be slightly complex. Then, you load this data into Iceberg tables using Apache Spark. This method is effort-intensive, prone to errors, and inefficient for voluminous datasets.
An alternative for a Postgres to Iceberg sync is to use Estuary Flow, an efficient data movement platform that simplifies the data transfer through automation. You can quickly load Postgres data into Iceberg tables and utilize it for your operational needs.
Sign up for your Estuary account today and use it to develop robust data pipelines for creating high-performing applications!
FAQs
What are the challenges of using Iceberg tables?
One of the significant challenges of using Iceberg tables is their complexity and higher learning curve. To overcome this, you should use managed versions of Iceberg tables such as Amazon S3 Iceberg or Snowflake Iceberg tables. Other challenges include limited support for different data types, query operations, and high reliance on metadata, which, if not updated regularly, causes errors.
How does Apache Iceberg differ from Apache Parquet?
Apache Iceberg is a table format that can contain data files. On the other hand, Apache Parquet is a file format that represents how data is stored within a file.
Iceberg offers features, including schema evolution and time travel, which Parquet does not support. While Iceberg is ACID-compliant, Parquet does not support ACID properties.
About the author
With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.