Estuary

How to Migrate From Snowflake Native Tables to Open Catalog (Polaris) Iceberg Tables

Learn how to migrate Snowflake native tables to Polaris Iceberg for open storage, scalability, and interoperability. Step-by-step guide with PySpark code.

Blog post hero image
Share this article

Overview

Migrating from a Snowflake-managed table to a Polaris Iceberg table requires a well-planned and structured approach. Snowflake’s native tables are proprietary, while Snowflake Open Catalog (Polaris) leverages Apache Iceberg for an open, scalable table format.

The Snowflake Native Tables to Polaris Iceberg Tables migration process involves:

  1. Extracting data from Snowflake.
  2. Syncing it with an external Polaris catalog.
  3. Converting it to a Polaris internal catalog.

This guide walks through the technical details, including setting up external volumes, catalog integration, and the actual migration steps using PySpark.

Why Migrate to Polaris Iceberg?

Migrating to Polaris Iceberg tables provides several advantages:

  1. Open Storage Format: Avoids vendor lock-in by leveraging Apache Iceberg.
  2. Cost Efficiency: Optimize storage and query performance.
  3. Interoperability: Enables seamless analytics across Spark, Presto, Trino, and Flink.
  4. Scalability: Supports large datasets with transactional integrity.

Prerequisites

Before migrating, ensure:

  • A Snowflake account with the necessary permissions
  • A Polaris account with external and internal catalog access
  • Apache Spark with Iceberg support configured
  • Access to an object storage bucket (S3, GCS, or equivalent)

Step 1: Set Up Polaris External Catalog

First, create an external Polaris catalog to store the Snowflake table in an Iceberg-compatible format.

plaintext
-- Create an external catalog in Polaris CREATE CATALOG polaris_external WITH (    location = 's3://your-bucket/polaris_external/' );

Step 2: Create a Snowflake Iceberg Table

Now, create an Iceberg table in Snowflake and enable continuous sync with Polaris.

plaintext
USE DATABASE sales_db; USE SCHEMA transactions; CREATE OR REPLACE ICEBERG TABLE order_history_ext (    order_id INT,    customer_id STRING,    order_date TIMESTAMP,    total_amount DECIMAL(18,2),    status STRING ) CATALOG = 'SNOWFLAKE' EXTERNAL_VOLUME = 'polaris_external_vol' BASE_LOCATION = 'order_history' CATALOG_SYNC = 'sales_migration';

This ensures that every commit in Snowflake is reflected in Polaris.

Step 3: Sync Data with Polaris External Catalog

Once the table is created, populate it with data from an existing Snowflake table.

plaintext
INSERT INTO order_history_ext SELECT * FROM sales_db.transactions.order_history WHERE order_date >= '2023-01-01';

Step 4: Migrate to Polaris Internal Catalog Using PySpark

With the data synced to Polaris, use PySpark to register the external table into Polaris’ internal catalog.

python
from pyspark.sql import SparkSession spark = SparkSession.builder \    .appName("PolarisOrderMigration") \    .config("spark.sql.catalog.polaris""org.apache.iceberg.spark.SparkCatalog") \    .config("spark.sql.catalog.polaris.type""hadoop") \    .config("spark.sql.catalog.polaris.warehouse""s3a://your-bucket/polaris_internal/") \    .getOrCreate() # Switch to Polaris catalog spark.sql("USE CATALOG polaris") spark.sql("SHOW NAMESPACES").show() # Fetch latest Iceberg metadata file bucket_name = 'your-bucket' prefix = 'transactions/order_history_ext/metadata/' latest_metadata = get_latest_metadata_json_file(bucket_name, prefix) # Register table in Polaris Internal Catalog spark.sql(f''' CALL polaris.system.register_table(    table => 'polaris.transactions.order_history',    metadata_file => 's3a://{bucket_name}/{latest_metadata}' ) ''') # Verify migration spark.sql("SHOW TABLES").show() spark.sql("SELECT * FROM polaris.transactions.order_history").show(10)

Step 5: Cleanup and Finalize

Once the data is successfully migrated, drop unnecessary Snowflake tables to finalize the migration.

python
-- Drop the original Snowflake table DROP TABLE sales_db.transactions.order_history; -- Drop the external sync table DROP TABLE sales_db.transactions.order_history_ext;

After deleting the old tables, you're good to go!

Optional: Expose Polaris Table to Snowflake Using Rest Integration

If you need Snowflake to access the Polaris internal catalog table, create a read-only Iceberg table.

python
CREATE OR REPLACE ICEBERG TABLE sales_migrated.transactions.order_history AS SELECT * FROM polaris.transactions.order_history;

Post-Migration Optimizations

After migration, optimize the Iceberg table for performance:

python
# Rewrite data files for optimization spark.sql("CALL polaris.system.rewrite_data_files('polaris.transactions.order_history')") # Add partitions (if needed) spark.sql("ALTER TABLE polaris.transactions.order_history ADD PARTITION FIELD order_date") # Configure RBAC permissions spark.sql("GRANT SELECT ON polaris.transactions.order_history TO role data_analyst")

Frequently Asked Questions (FAQs)

1. What is Polaris Iceberg in Snowflake?

Polaris is Snowflake’s Open Catalog, allowing users to store and manage Apache Iceberg tables externally while maintaining compatibility with Snowflake analytics.

2. How does Polaris differ from Snowflake native tables?

Polaris uses Apache Iceberg, which is an open table format.

Snowflake tables are proprietary and locked into the Snowflake ecosystem.

Polaris supports multiple compute engines beyond Snowflake.

3. Can I use Polaris Iceberg tables with Trino and Presto?

Yes, Polaris Iceberg tables are fully compatible with Trino, Presto, Apache Spark, and Flink, making it ideal for a multi-engine architecture.

4. What are the cost benefits of moving to Polaris Iceberg?

Polaris Iceberg provides lower storage costs, better query optimization, and open storage flexibility, reducing vendor lock-in costs.

5. How do I ensure data integrity during migration?

Enable continuous sync between Snowflake and Polaris.

Perform data validation using SELECT COUNT(*) comparisons before switching.

Run checksum tests to validate data accuracy.

Conclusion

Migrating from Snowflake to Polaris Iceberg unlocks open storage, better scalability, and interoperability with other analytics engines. With continuous sync, PySpark migration, and post-migration optimizations, you get a fully operational Iceberg table ready for production workloads.

Suggested Read:

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data Engineering Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.