Estuary

Migrate Data from Redshift to Apache Iceberg: 2 Proven Methods

Discover two proven methods to migrate from AWS Redshift to Apache Iceberg: leverage Estuary Flow for real-time integration or use CSV with Apache Spark workflows.

Share this article

Business organizations need robust data storage and management systems that can simplify data-driven activities. Amazon Redshift is one popular data system that facilitates effective enterprise data management and business intelligence. However, Redshift has limitations, including limited concurrency. While you can opt for concurrency scaling, it may impact the data warehouse's performance. Redshift is also associated with high usage costs, storage constraints, and the lack of data versioning and time travel capabilities.

Apache Iceberg, on the other hand, is an open-source table format that supports robust concurrent operations. It is also simpler to work with Iceberg since it supports multiple language APIs and features like schema evolution and time travel.

In this article, we explore two proven methods for Redshift to Iceberg migration. One of the methods uses Estuary Flow, a real-time integration platform for automated data transfer from Redshift to Iceberg. The other method requires you to export Redshift data as a CSV file into the local system. You can then transfer the data from the CSV file to Iceberg tables using Apache Spark.

What is AWS Redshift?

redshift to iceberg - redshift logo

AWS Redshift is a data warehousing solution offered by Amazon Web Services. It is a fully managed service that eliminates the need for you to manage infrastructure, allowing you to focus on important business aspects. With its columnar storage format and Massively Parallel Processing (MPP) architecture, Redshift enables you to process complex queries quickly and facilitate faster data analytics.

You can consolidate data from various sources into Redshift, including operational databases, streaming services, and enterprise applications. It can easily accommodate petabytes of data, making it a great data warehouse choice for near-real-time analytics and business intelligence operations on high-volume enterprise data.

Introduction to Apache Iceberg

redshift to iceberg - iceberg logo

Apache Iceberg is an open-source table format that helps you manage data files within data lakes or data meshes. It provides an abstraction layer over a data lake, allowing you to structure and manage data concurrently.

You can integrate Iceberg tables into compute engines such as Spark, Hive, Flink, Trino, or PrestoDB. These tables function like SQL tables, letting you query and analyze petabyte-scale data using SQL commands.

Let’s look at some key features of Iceberg:

  • Schema Evolution: Iceberg supports schema evolution, allowing you to modify a database schema without affecting existing data. You can add, drop, rename, update, or reorder table columns while maintaining data integrity.
  • Hidden Partitioning: Partitioning is a technique of dividing a database into smaller parts by grouping similar rows to facilitate faster data querying. Iceberg facilitates the production of partitions based on column values. This process is “hidden” since you don’t need to explicitly create partition columns. It also supports partition evolution with growing data volumes.
  • Time Travel and Rollback: With Iceberg’s time travel feature, you can access historical data snapshots to run reproducible queries. This lets you review and analyze past data without altering current records. To rectify any erroneous query execution, you can use the version rollback mechanism to reset Iceberg tables.
  • Data Compaction: Iceberg allows you to compact data files in parallel using Spark. This reduces metadata overhead by merging smaller files into larger ones, helping enhance storage efficiency and query performance.

By transitioning from Redshift to Iceberg, businesses gain advanced features and cost savings while enabling high-performance data workflows.

For an in-depth exploration of Iceberg, take a look at this comprehensive article: Apache Iceberg Tutorial

Why Should You Migrate from Redshift to Iceberg 

  • Access to a Diverse Set of Software: While you can easily integrate Amazon Redshift with other AWS services, it has limited compatibility with external tools. Migrating to Iceberg tables provides you with more integration options with varied software and tools. This facilitates flexible data analysis and visualizations for your business needs.
  • Advanced Functionalities: Apache Iceberg offers various advanced features, including hidden partitioning, time travel, and schema evolution. These features make Iceberg tables suitable for effective data storage and management, providing you with enhanced capabilities beyond what is available in Redshift.
  • Cost-Effectiveness: Apache Iceberg is an open-source table format that is available at no cost. On the contrary, Amazon Redshift operates on a pay-as-you-go pricing model. This makes Apache Iceberg a more affordable data analytics solution for small and medium-sized enterprises.

2 Proven Methods to Migrate from Redshift to S3 Iceberg

Let’s look at two methods for migrating data from Redshift to Iceberg. 

  • Method 1: Using Estuary Flow to Migrate from Redshift to Iceberg
  • Method 2: Migration from Redshift to Iceberg Using CSV File and Apache Spark

Method 1: Using Estuary Flow to Migrate from Redshift to Iceberg

Estuary Flow is an effective real-time integration solution that helps simplify data migration from Redshift to Apache Iceberg. It offers a library of 200+ pre-built connectors that help you transfer data between several source and destination data systems.

Using Estuary allows you to load data from Redshift to Iceberg through ELT or ETL methods. During integration, you can use SQLite (ETL), Typescript (ETL), or dbt (ELT) to transform your data.

The steps mentioned below will help you migrate data from Redshift into Iceberg using Estuary Flow.

Prerequisites:

  • An Estuary account.
  • An Amazon Redshift account with credentials to connect to Redshift clusters and properly configured IAM roles.
  • An Iceberg account compatible with AWS S3 as storage layer and AWS Glue as catalog.

Step 1: Configure Amazon Redshift as Source

redshift to iceberg - estuary main dashboard
  • Click on the Sources tab from the left pane of the dashboard. You will be redirected to the Sources page.
redshift to iceberg - estuary new capture page
  • Click the + NEW CAPTURE button and enter Amazon Redshift in the Search connectors field.
redshift to iceberg - redshift search connector page
  • Select the Amazon Redshift Batch connector and click on the Capture button.
redshift to iceberg - redshift create capture page
  • You will be redirected to the connector’s configuration page, where you must enter all the required fields, including:
    • Name: Give a unique name to your capture.
    • Server Address: This is the host or host:port to connect to your database.
    • User: Provide the Redshift user name for authentication.
    • Password: Enter the password for the specified user name.
  • After providing all the details, click NEXT, followed by SAVE AND PUBLISH.

This connector captures data from your Amazon Redshift cluster and translates them into a Flow collection.   

Step 2: Configure Iceberg as Destination

  • After the source connector configuration, you will see a pop-up summarizing the details of the capture. Here, click the MATERIALIZE COLLECTIONS button to start configuring Iceberg as the destination.
  • Alternatively, you can click on the Destinations tab from the left pane of the dashboard. This will redirect you to the Destinations page.
    redshift to iceberg - new materialization page

  • Click on the + NEW MATERIALIZATION button and type Amazon S3 Iceberg in the Search connectors field.
redshift to iceberg - iceberg search connector page
  • The search result will display the Iceberg connector. Click the connector’s Materialization button to proceed with the configuration.
redshift to iceberg - iceberg materialization page
  • On the Create Materialization page, fill in the following details:
    • Name: Give a unique name for the materialization.
    • AWS Access Key ID: Enter the Access Key ID to access AWS services.
    • AWS Secret Access Key: Provide the Secret Access Key to utilize AWS services.
    • Bucket: Enter the S3 bucket name to write data files.
    • Region: Mention the AWS region.
    • Namespace: Enter the namespace of bound collection tables.
  • While configuring Iceberg as a destination, you can opt for the AWS Glue Catalog if you are an AWS root or IAM user. If you are using the REST Catalog, you must provide the name of the warehouse to which you want to connect. You should also provide the URI and credentials to connect to the catalog.
  • If you materialized your collections directly from capture creation, your Redshift collections will already be selected. Otherwise, you can link a source collection.

To do this, click the SOURCE FROM CAPTURE button in the Source Collections section. Then select your Redshift data capture.

  • Finally, click NEXT and SAVE AND PUBLISH to complete the configuration process.

This connector materializes delta updates of Flow collections into Iceberg tables using Amazon S3 for object storage and AWS Glue as the Iceberg catalog.

These steps will help you successfully migrate data from Redshift to Iceberg tables. 

Try out Estuary Flow for free. If you have any questions, join our Slack group or contact us today!

Benefits of Estuary Flow

  • End-to-End Change Data Capture (CDC): Estuary Flow’s CDC feature allows you to track and capture changes made to the source data and reflect them into the destination data system. The entire process has an end-to-end latency of less than 100 milliseconds.
  • No-code Configuration: Estuary Flow simplifies data migration between sources and destinations. With 100s of no-code connectors for apps, databases, and data warehouses in Estuary, you can easily develop data pipelines even with basic or no coding knowledge.
  • Multiple Deployment Options: Estuary provides three deployment options: Public, Private, and Bring Your Own Cloud (BYOC). Public deployment is a fully managed service that you can set up quickly with minimal configuration. On the other hand, you can use the Private deployment option if you have a robust private infrastructural setup. 

The third deployment mode is BYOC, which allows you to deploy Estuary in your own cloud environment, offering more flexibility and customization options.

Method 2: Migration from Redshift to Iceberg Using CSV File and Apache Spark

Another method for Redshift-Iceberg data migration includes CSV export/import. This involves extracting data from Redshift as a CSV and then loading this data to Iceberg using Apache Spark.

The steps for this method are as follows:

Step 1: Extracting Data From Redshift

To export data from Redshift into a CSV file, you can use one of the following four options:

Option 1: Using UNLOAD Command

To export data from Redshift into a CSV file, you can use the UNLOAD command. Here’s an example command that you can run in the Amazon Redshift query editor.

plaintext
UNLOAD ('SELECT * FROM table_name') TO 's3://amzn-s3-demo-bucket/unload/' IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole' FORMAT AS CSV;

Here:

  • table_name: Name of the Redshift table from which you want to export data.
  • amzn-s3-demo-bucket/unload/: The S3 bucket and folder path where the CSV data file will be saved.
  • aws:iam::0123456789012:role/MyRedshiftRole: The IAM role ARN (Amazon Resource Name) that permits Redshift to access the S3 bucket.

Option 2: Using AWS SDK

Another method to extract data from Redshift involves the use of the AWS SDK. You can interact with AWS through one of its SDKs in programming languages like Python, JavaScript, or Node.js. Here, we are using Python to extract data from Redshift.

  • First, install boto3, the AWS SDK for Python, using the following command:
plaintext
pip install boto3
  • Now, configure the boto3 library and create a client. You can log in to your Redshift console and enter all the necessary details.
plaintext
import boto3 import pandas as pd client = boto3.client('redshift-data', region_name='us-east-2', aws_access_key_id='your-public-key', aws_secret_access_key='your-secret-key')

Here:

  • redshift_data: The Amazon Redshift data API.
  • region_name: The AWS region where your Redshift clusters are located.
  • aws_access_key_id and aws_secret_access_key: The credentials to authenticate your requests to AWS.
  • Execute a SQL query on your Redshift cluster with the following:
plaintext
response = client.execute_statement(    ClusterIdentifier='your-cluster',    Database='your-database',    DbUser='your-user',    Sql='SELECT * FROM users;' # Replace with your SQL query )

In this command:

  • ClusterIdentifier: It is the identifier for your Redshift cluster.
  • Database: The name of the Redshift database.
  • DbUser: The database user’s name.
  • The response to the query will be in the form of a dictionary containing metadata. To check the status of your query, you should use describe_statement as shown below:
plaintext
query_id = response [‘Id’] status = client.describe_statement(Id=query_id)['Status'] print(“Query status:”, status)
  • Then, you can retrieve the query results using the following code:
plaintext
result = client.get_statement_result(Id=query_id)
  • To easily convert the query results into a Pandas DataFrame, you can use the following function:
plaintext
def redshift_to_dataframe(result):    column_labels = [col['label'] for col in result['ColumnMetadata']]    data_rows = []    for record in result['Records']:        row_data = [list(field.values())[0] for j in i] for field in record]        data_rows.append(row_data)    return pd.DataFrame(data_rows, columns=column_labels)
  • Finally, you can call the function to create a DataFrame and save it as CSV using the below command:
plaintext
df = redshift_to_dataframe(result) df.to_csv('your-file.csv', index=False)

This completes data retrieval from Redshift as a CSV file using AWS SDK.

Option 3: Using the AWS Command Line Interface

  • First, you need to install AWS CLI. The installation process varies depending on your OS, and you can use the one according to your requirements. Then, you can run the following code to extract data from Redshift:
plaintext
aws redshift-data execute-statement --cluster-identifier my_cluster --database my_database --secret arn:aws:secret:us-west-2:123456789012:secret:my_secret --sql "SELECT * FROM your_table"
  • After executing the above command, you will get an execution ID for the statement. Use this ID with the following command to retrieve the query results:
plaintext
aws redshift-data get-statement-result --id your_sql_id

Where your_sql_id is the ID used to retrieve data. By default, AWS CLI produces output files in JSON format, which you can convert into CSV format using the jq tool.

Option 4: Using SQL Client

Another technique to extract Redshift data as a CSV file is using an SQL client, such as MySQL Workbench, on your local system. Here are the steps for data revival:

  • Connect to your Redshift database using the SQL client.
  • Query the desired table.
  • Use the SQL client’s built-in export functionality to save the query results as a CSV file.

You can store the CSV file containing your Redshift data in your local system.

Step 2: Loading Data to Iceberg Tables Using Apache Spark

  • To start loading data into Iceberg tables, launch the Spark shell, including the Iceberg library. You can execute the following code to include the Iceberg Spark runtime package:
plaintext
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1
  • In Iceberg, creating, renaming, or dropping a table is managed through a catalog, which also tracks the table's metadata. To use Iceberg tables in Apache Spark, you must first configure a catalog. You can use the following code to set up Iceberg catalogs for your tables:
plaintext
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \    --conf spark.sql.catalog.spark_catalog.type=hive \    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \    --conf spark.sql.catalog.local.type=hadoop \    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \    --conf spark.sql.defaultCatalog=local

Here, local is the catalog name with tables saved in a directory named $PWD/warehouse. You can replace ‘local’ with your catalog name and $PWD/warehouse with the location where you want to store your tables. 

  • Finally, you can use the following code to load the CSV data exported from Redshift in Iceberg tables:
plaintext
from pyspark.sql.functions import days spark.sql("CREATE DATABASE IF NOT EXISTS your_database_name") csv_df = spark.read.csv("file:///path/to/your/file.csv", header=True, inferSchema=True) csv_df.writeTo("your_database_name.your_table_name") \    .partitionedBy(days("tpep_your_datetime_column")) \    .createOrReplace()

The explanation for this code is as follows:

  • Import necessary functions from PySpark’s function module. Here, you can replace days with other functions to partition data.
  • Create a database in your Spark session if it doesn’t already exist. Replace the database name your_database_name with the actual name of your database.
  • Once the database is created, you can start writing CSV data from Redshift into Iceberg tables within this database. Enter the name of your Iceberg table in place of your_table_name and the partition column for tpep_datetime_column.

This completes the Redshift to Iceberg migration using CSV files and Apache Spark.

Limitations of Using CSV File and Apache Spark for Redshift to Iceberg Migration 

  • Effort Intensive: Migrating data from Redshift to Iceberg using CSV files requires a good understanding of both platforms. The method also involves manual data preparation and transformation, which can be complex and time-consuming.
  • High Memory Consumption by Apache Spark: Apache Spark is a resource-intensive system and requires substantial memory to perform computational operations. To ensure the availability of sufficient memory, you will have to invest in expensive hardware, which can increase your organization’s overall operational costs.
  • Lack of Real-time Capabilities: This integration technique lacks real-time capabilities due to the delays involved in extracting and loading the data. As a result, it is unsuitable for applications that require low latency, such as event monitoring or fraud detection.

Conclusion

Migrating data from Redshift to Iceberg can help you overcome Redshift's operational limitations. If you compare Iceberg vs Redshift, you will find that Iceberg can help create a robust system for storing and managing data within your enterprise. This makes a Redshift-Iceberg migration a viable solution.

To perform this data migration, you can opt for two methods. One method involves extracting Redshift data as a CSV file and transferring it to Iceberg using Apache Spark. This approach is time-consuming, complex, and lacks real-time capabilities.

Alternatively, you can use Estuary Flow, an efficient data movement platform, to migrate data from Redshift to Iceberg. Estuary automates the data transfer process, allowing you to quickly complete complex data-related tasks and streamline your business workflow.

Sign up for Estuary today to efficiently integrate data across multiple platforms and develop scalable data pipelines for cutting-edge applications!  

FAQs

Does Amazon Redshift support querying Apache Iceberg tables?

Yes, Amazon Redshift supports querying Apache Iceberg tables. Redshift allows you to utilize Redshift Spectrum or Redshift Serverless to query Iceberg tables in the AWS Glue Data Catalog.

What are the challenges of using Apache Iceberg?

You may encounter some challenges while using Apache Iceberg tables. If you are not familiar with how Iceberg functions, you may find its use slightly complex. Iceberg does not support real-time data processing, and you cannot use it in time-critical applications.

Is Apache Iceberg ACID compliant?

Yes, Apache Iceberg is ACID-compliant. Iceberg facilitates atomicity by supporting the addition and removal of data files in a single operation. It ensures data consistency through concurrent operations on the data. Iceberg also supports serializable isolation and durable transactions.


Related Syncs With Redshift

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of Dani Pálma
Dani Pálma

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.