Estuary

Apache Iceberg Tutorial: The Ultimate Guide for Beginners

Learn what Apache Iceberg is and how to use it with our step-by-step Apache Iceberg tutorial. Master its features like schema evolution, hidden partitioning, and more.

Share this article

Are you a data engineer looking to improve your skills and streamline your data infrastructure? Look no further. In this hands-on tutorial, we’ll explore Apache Iceberg, the revolutionary open table format transforming how data is managed in large-scale analytics environments. Whether you’re navigating schema evolution, optimizing partitioning strategies, or ensuring ACID compliance in your data lakes, this guide will equip you with practical insights and actionable steps to harness the full potential of Apache Iceberg.

From understanding its core capabilities to implementing best practices, you’ll gain the knowledge needed to elevate your data engineering workflows and master the intricacies of modern data lake management. Let's take a look at what makes Iceberg a game-changer for data engineers and why it’s becoming a must-have tool in data analytics.

What is Apache Iceberg?

Apache Iceberg is not just another open table format for large-scale analytics datasets in cloud environments. It's a transformative tool that addresses many of the limitations of traditional table formats. With unique features such as schema evolution, hidden partitioning, and ACID compliance, Iceberg makes it easier to manage data lakes and ensures reliable, high-performance data processing. It's a game-changer in the world of data management.

For instance, organizing large datasets into a structured format can be challenging in a cloud storage setup like Amazon S3. With Iceberg, you can easily query this data using SQL while maintaining the scalability of a data lake. Iceberg is an excellent choice for organizations transitioning from traditional data warehouses to modern data lakehouses.

In simple terms, Apache Iceberg bridges the scalability of data lakes and the query capabilities of data warehouses. It offers the best of both worlds, allowing you to query large datasets in a structured format using SQL while maintaining the scalability of a data lake.

Brief History & Evolution of Apache Iceberg

Apache Iceberg began as an internal project at Netflix in 2018. It was designed to solve the company's challenges in managing its massive data lakes. At the time, existing solutions like Apache Hive and traditional partitioning methods were insufficient for their scale and required frequent manual intervention.

Recognizing the project's broader applicability, Netflix open-sourced Iceberg in 2019 under the Apache Software Foundation. It quickly became an Apache Incubator project, gaining traction due to its innovative approach to table management. By 2020, Apache Iceberg achieved top-level project status, signifying its maturity and widespread adoption.

Today, Iceberg is supported by significant data processing engines like Apache Spark, Flink, and Presto/Trino. Companies such as Apple, Netflix, and LinkedIn actively use it to manage their large-scale analytics workloads.

Core Features of Apache Iceberg

Apache Iceberg’s rich feature set makes it a game-changer for data lake management. Below, we dive deeper into its core features:

Schema Evolution

One of Iceberg's standout features is its schema evolution. This feature allows you to update a table’s schema without rewriting the entire dataset, which is practical in dynamic environments where data requirements frequently change. Iceberg also ensures you can seamlessly add or rename new columns, giving you confidence that your data is always up to date.

Hidden Partitioning

Traditional table formats require explicit partitioning columns, which can lead to suboptimal query performance. Iceberg’s hidden partitioning automatically tracks and optimizes partitions without exposing partition columns to the end user. For instance, if you’re querying sales data partitioned by date, Iceberg will handle partition pruning internally, reducing query times.

ACID Transactions

Iceberg provides full ACID compliance, ensuring your data is consistent and reliable. This means you can perform concurrent read and write operations without risking data corruption—a crucial feature for enterprise-grade analytics.

Time Travel

With Iceberg’s time travel feature, you can query historical snapshots of your data. If a dataset was accidentally overwritten, you can roll back to a previous version and recover the lost data. This feature can be a lifesaver in data recovery scenarios or historical analysis.

Table Versioning

Iceberg maintains a complete history of table changes, enabling you to audit or debug issues. Each snapshot contains metadata about when changes occurred, who made them, and what was modified.

Multi-Cloud Support

Iceberg’s cloud-agnostic design allows it to work seamlessly with object storage solutions like Amazon S3, Google Cloud Storage, and Azure Blob Storage. This flexibility is critical for organizations with multi-cloud strategies.

Compression and Serialization

Iceberg allows various compression techniques to reduce data size during storage and transmission for optimal performance. Users can further enhance performance and efficiency in data operations by choosing appropriate serialization formats and optimizing settings based on data characteristics.

Understanding the Iceberg Architecture

architecture.png
Apache Iceberg Architecture

Source: dremio.com

Apache Iceberg's architecture ensures scalability, reliability, and ease of use. It comprises three main components: metadata, data files, and catalogs.

Metadata

Iceberg uses metadata to store table schemas, snapshots, and partitioning information. Metadata is organized into manifest files and manifest lists. For example:

  • Manifest Files: Contains details about data files, including their paths, record counts, and partition values.
  • Manifest Lists: Act as a directory for all manifest files, enabling quick table scans.

Data Files

Iceberg stores data in immutable file formats like Parquet, Avro, or ORC. This immutability ensures consistency and simplifies table management. For instance, if you update a table, Iceberg creates new data files instead of modifying existing ones.

Catalogs

Catalogs are responsible for table discovery and management. Iceberg supports multiple catalog implementations, including Hive Metastore, AWS Glue, and REST-based catalogs. A catalog lets you interact with Iceberg tables using your preferred query engine.

Apache Iceberg Use Cases

Data Lakehouse

Apache Iceberg transforms traditional data lakes into structured and transactional data lakehouses. This allows organizations to run SQL-based analytics and machine learning workloads directly on their data lakes. For example, an e-commerce company can use Iceberg to analyze customer behavior in real-time while maintaining a clean, organized data lake.

Big Data Analytics

Iceberg’s compatibility with distributed processing engines makes it ideal for big data analytics. Companies handling petabytes of data can leverage Iceberg to execute complex queries efficiently.

Data Governance

Features like schema evolution, time travel, and partition pruning enable robust data governance. For example, financial institutions can use Iceberg to ensure compliance with data retention policies by retaining historical snapshots of critical datasets.

Implementing Apache Iceberg: A Step-by-Step Tutorial

Implementing Apache Iceberg involves several steps, from creating a catalog to creating and querying tables. Below is a detailed walkthrough:

Prerequisites

Before getting started, ensure you have:

  • A cloud storage solution (e.g., Amazon S3, Google Cloud Storage, or Azure Blob Storage).
  • A compatible processing engine like Apache Spark or Flink.

Step 1: Set Up a Catalog
 

The catalog is where Iceberg stores metadata about your tables, such as schema and partitioning information. Configuring the catalog depends on your processing engine. For Apache Spark, use the following:

python
CREATE CATALOG my_catalog USING 'org.apache.iceberg.spark.SparkCatalog' OPTIONS ('type' 'hive''uri' 'thrift://localhost:9083');

Step 2: Create an Iceberg Table

Create a new table with the following command:

python
CREATE TABLE my_catalog.db.table_name ( id BIGINT, data STRING ) USING iceberg;

Step 3: Insert Data into Apache Iceberg Table

Use your processing engine to insert data into the table. In PySpark:

python
df = spark.createDataFrame([(1'value1'), (2'value2')], ['id''data']) df.write.format("iceberg").mode("append").saveAsTable("my_catalog.db.table_name")

Step 4: Query the Data

Retrieve data from the table using SQL:

plaintext
SELECT * FROM my_catalog.db.table_name;

Best Tools and Platforms for Apache Iceberg Integrations

Apache Iceberg integrates seamlessly with significant data processing engines and platforms. Here are some examples:

  • Estuary FlowEnables real-time and batch data integrations with no-code pipelines from hundreds of sources.
  • Apache Spark: Provides native support for Iceberg tables, enabling high-performance analytics.
  • Presto/Trino: Allows SQL-based querying over Iceberg tables, which is ideal for interactive analytics.
  • Apache Flink: Supports real-time streaming to and from Iceberg tables.
  • Snowflake: Offers compatibility for hybrid cloud architectures, allowing organizations to integrate Iceberg into their existing workflows.

What Makes Apache Iceberg Unique?

1. Schema Evolution Without Downtime

One of Iceberg’s defining features is its ability to handle schema changes like adding, dropping, or renaming columns without rewriting the underlying data. This is crucial for dynamic data environments where schema changes are frequent.

Example: Adding a Column in PySpark

python
from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark = SparkSession.builder \    .appName("Iceberg Schema Evolution") \    .config("spark.sql.catalog.spark_catalog""org.apache.iceberg.spark.SparkCatalog") \    .config("spark.sql.catalog.spark_catalog.type""hadoop") \    .config("spark.sql.catalog.spark_catalog.warehouse""s3://your-data-lake/warehouse") \    .getOrCreate() # Load an Iceberg table table = spark.read.format("iceberg").load("spark_catalog.default.my_table") # Add a new column updated_table = table.withColumn("new_column", lit("default_value")) # Write back to the Iceberg table updated_table.write.format("iceberg").mode("overwrite").save("spark_catalog.default.my_table")

Iceberg manages the metadata, so readers continue to query the table seamlessly even as the schema evolves.

2. Time Travel Queries

Iceberg supports time travel, allowing users to query historical snapshots of the data. This is incredibly useful for debugging, auditing, and reproducibility.

Example: Querying a Snapshot by Timestamp

python
# Query a snapshot from a specific timestamp timestamp = "2025-01-01T12:00:00.000Z" historical_data = spark.read.format("iceberg") \    .option("as-of-timestamp", timestamp) \    .load("spark_catalog.default.my_table") historical_data.show()

With time travel, you can quickly access a previous state of your data without restoring backups or managing separate versions.

3. Partitioning Without Performance Bottlenecks

Traditional partitioning schemes (like Hive) can lead to performance bottlenecks due to too many small files. Iceberg introduces hidden partitioning, which abstracts partitioning from users, optimizes query planning, and eliminates these bottlenecks.

Example: Querying Optimized Partitions

python
# Assume the table is partitioned by a hidden column partitioned_data = spark.read.format("iceberg").load("spark_catalog.default.my_table") # Spark automatically optimizes query planning filtered_data = partitioned_data.filter("event_date = '2025-01-01'") filtered_data.show()

Iceberg handles partition pruning under the hood, providing better performance without requiring manual partition management.

4. ACID Transactions

Iceberg supports full ACID compliance, ensuring consistency and reliability during concurrent reads and writes. This makes it suitable for real-time use cases and collaborative environments.

Example: Writing Data with ACID Guarantees

python
# Append new data to the Iceberg table new_data = spark.createDataFrame(    [("user_1""2025-01-10"100.0)],    ["user_id""event_date""amount"] ) new_data.write.format("iceberg") \    .mode("append") \    .save("spark_catalog.default.my_table")

Even with multiple writers, Iceberg ensures the table remains consistent.

5. Data Deletion and Retention Policies

Iceberg provides efficient support for data deletion, including GDPR-compliant record-level deletes, and enables retention policies for cleaning up old data.

Example: Deleting Records in PySpark

python
# Enable row-level deletes spark.sql("ALTER TABLE spark_catalog.default.my_table SET TBLPROPERTIES ('write.delete.mode'='merge-on-read')") # Execute a delete operation spark.sql("DELETE FROM spark_catalog.default.my_table WHERE event_date < '2024-01-01'")

This ensures compliance with legal requirements and optimizes storage by purging outdated records.

6. Integration with the Open Data Ecosystem

Iceberg integrates seamlessly with many tools, including PySpark, Flink, Trino, Presto, and Hive. This flexibility makes it a great choice for modern data lake architectures.

Example: Reading Iceberg Tables in PySpark

python
data = spark.read.format("iceberg").load("spark_catalog.default.my_table") data.show()

Its compatibility with multiple engines ensures that teams can choose the best tools for their workflows.

Comparing Apache Iceberg with Other Table Formats

When choosing a table format for your data lakehouse architecture, the decision often comes down to Apache Iceberg, Apache Hive, or Delta Lake. Each has its strengths and weaknesses, but Iceberg stands out as a modern, open standard built for large-scale, flexible, and reliable data processing. Here, we compare Iceberg with Hive and Delta Lake, focusing on key differentiators.

Iceberg vs. Delta Lake

Delta Lake emerged as a competitor to Apache Iceberg, offering similar modern table format features. However, Iceberg distinguishes itself in areas like multi-cloud compatibility and advanced metadata management:

  • Time Travel: Iceberg and Delta Lake support time travel, allowing users to query historical data states. This feature is critical for debugging, compliance, and recreating past reports.
  • Multi-Cloud Support: Iceberg shines with its open standard and is compatible across major cloud platforms and on-premises environments. Delta Lake, though robust, is tightly integrated with the Databricks ecosystem, limiting its flexibility in multi-cloud setups.
  • Performance: Iceberg uses a highly optimized metadata layer, which reduces query latency and improves scalability. Its metadata tree is designed for large-scale deployments, whereas Delta Lake’s metadata management can become a bottleneck as data volumes grow.

For a more detailed comparison of Iceberg and Delta Lake, check out this deep dive article: Iceberg vs Delta Lake

Iceberg vs. Hudi

Apache Hudi is another table format that competes with Iceberg and is designed primarily for real-time use cases. While both effectively manage large-scale data lakes, they target slightly different needs. Here’s a comparison:

Feature

Iceberg

Hudi

Primary Use CaseBatch & StreamingReal-Time Updates
ACID ComplianceFullFull
Schema EvolutionSeamlessLimited
IndexingMetadata-BasedBuilt-In
Streaming IntegrationSupportedStrong Focus
  • Primary Use Case: Iceberg is versatile, excelling in batch and streaming workloads, making it a strong choice for general-purpose data lakehouse architectures. Hudi, on the other hand, focuses heavily on real-time data ingestion and update scenarios, such as CDC (Change Data Capture) use cases.
  • ACID Compliance: Iceberg and Hudi provide full ACID compliance, ensuring consistency and reliability for concurrent operations. However, Hudi’s design is particularly optimized for frequent minor updates, while Iceberg handles large-scale batch and streaming operations equally well.
  • Schema Evolution: Iceberg allows seamless schema evolution, such as renaming columns or adding new fields, without rewriting data. Hudi supports schema evolution but often requires additional configuration, which can lead to increased complexity in practice.
  • Indexing: Hudi includes built-in indexing to accelerate update and delete operations, which benefits workloads with frequent changes. Iceberg relies on efficient metadata and query optimization techniques, making it better suited for workloads prioritizing analytical performance.
  • Streaming Integration: While Iceberg supports streaming data with connectors like Apache Flink and Spark Structured Streaming, Hudi has a stronger real-time focus, with features like upserts and near-real-time ingestion as core capabilities.

For a more detailed comparison of Iceberg and Hudi, check out this deep dive article: Iceberg vs Hudi 

Performance Benefits of Apache Iceberg

Apache Iceberg provides several performance benefits:

Optimized Query Execution

Iceberg’s hidden partitioning eliminates manual partition column management, significantly improving query performance. For example, a query on a year-long sales dataset will automatically prune irrelevant partitions, reducing scan times.

Efficient Metadata Management

Iceberg’s scalable metadata ensures efficient querying of even petabyte-scale datasets. Unlike traditional formats, metadata operations in Iceberg remain fast regardless of table size.

Concurrent Workflows

With support for ACID transactions, Iceberg allows multiple users to read and write to the same table without conflicts. This is essential for collaborative analytics environments.

Challenges and Limitations of Apache Iceberg

While Apache Iceberg is powerful, it is not without its challenges. Understanding these limitations can help you plan your implementation effectively:

Complexity in Setup

Setting up Apache Iceberg requires familiarity with metadata catalogs, object storage configurations, and compatible processing engines. Organizations without existing expertise in these areas may face a steep learning curve.

Limited Native Tooling

While Iceberg integrates with major platforms like Apache Spark and Flink, native UI tools for managing Iceberg tables are still maturing. This can make troubleshooting and table maintenance more challenging for less technical users.

Performance Overhead for Small Tables

Although Iceberg is optimized for large datasets, it can introduce unnecessary overhead for smaller tables due to its metadata and file management processes. Simpler table formats may be more appropriate for use cases with small-scale datasets.

Compatibility Gaps

While Iceberg supports various engines and platforms, compatibility issues may arise depending on the versions or configurations used. For example, advanced features like time travel may not work seamlessly across all integrations.

Apache Iceberg Best Practices

Data Partitioning

Data partitioning is a critical technique for optimizing data processing with Apache Iceberg. Users can enhance query performance and simplify data management by dividing datasets into manageable subsets based on common attributes.

iceberg_partition_evolution.png
Partition Evolution visualized

Source: iceberg.apache.org

Key Considerations for Partitioning

  • Partitioning Key: Select an appropriate partitioning key that aligns with query patterns. Standard keys include date, geographic location, and categories, which should be chosen based on data characteristics and usage patterns.
  • Partitioning Strategies: It is essential to utilize partitioning strategies that fit typical query patterns. This ensures efficient data access and minimizes performance bottlenecks during data retrieval.
  • Composite partitioning can also combine multiple strategies for more intricate data organization.
  • Partition Size: Aim for partitions of similar size to optimize performance and prevent issues arising from uneven data distribution. Considerations around data growth and update frequency will also inform the chosen partitioning strategy.

Performance Optimization

It is crucial to regularly monitor and optimize performance in data processing workflows to achieve optimal performance.

  • Performance Testing: Conduct performance tests using realistic data volumes early in the project to refine data formats and access patterns
  • Monitoring Tools: Use appropriate tools and metrics to monitor data operations and identify potential bottlenecks. Regular analysis of system performance will facilitate timely optimizations.
  • Query Tuning: Analyze query performance through profiling tools to identify inefficiencies. Modifying queries, execution plans, and schema can significantly enhance performance.

Data Recovery and Rollbacks

Apache Iceberg's time travel feature offers robust data recovery options, enabling users to revert to previous snapshots when needed.

  • Snapshot Management: Maintain historical snapshots to support compliance and auditing requirements, allowing for easy tracking of data changes and restoration to known states
  • Error Investigation: Utilize time travel queries to investigate errors by examining the dataset at specific points in time, which can aid in identifying the root causes of issues in data processing pipelines

Data Archiving and Retention Strategies

Effective data lifecycle management is essential for optimizing storage usage. Employing partitioned archiving strategies allows for systematically organizing archived data based on criteria such as periods or categories.

This can streamline access and management of archived datasets while preserving the performance of active datasets. By following these best practices, organizations can leverage the full potential of Apache Iceberg, ensuring efficient data processing and valuable insights from their big data environments.

Future of Apache Iceberg

Apache Iceberg is continuously evolving, with active contributions from a growing community of developers and enterprises. Some anticipated developments include:

  • Enhanced Integration: More streamlined compatibility with popular BI tools and data orchestration platforms.
  • Improved Streaming Support: Expanded features for real-time data ingestion and processing.
  • Native UI Tools: Development of graphical interfaces for easier management and monitoring of Iceberg tables.
  • Broader Adoption: As more companies transition to data lakehouse architectures, Iceberg will likely become the de facto table format for large-scale analytics.

Conclusion

Apache Iceberg is a revolutionary table format that bridges the gap between traditional data lakes and modern data lakehouses. Its powerful features like schema evolution, hidden partitioning, and time travel make it an ideal choice for enterprises looking to optimize their analytics workflows.

By understanding its architecture, use cases, and integrations, you can unlock the full potential of Iceberg for your organization. Whether you’re an engineer working with big data or a decision-maker seeking scalable analytics solutions, Apache Iceberg offers the tools to future-proof your data strategy.

Ready to dive deeper? Start experimenting with Apache Iceberg today and discover how it can transform your data lake into a robust analytics engine.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of Dani Pálma
Dani Pálma

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.