Estuary

Apache Iceberg Time Travel Guide: Snapshots, Queries & Rollbacks

Learn how to query historical data in Apache Iceberg using time travel, snapshots, and incremental reads with PySpark. Rollback, compare, and audit seamlessly.

Blog post hero image
Share this article

Introduction

Apache Iceberg is a table format built for large-scale analytics. It brings ACID transactions, schema evolution, and time travel to data lakes. Unlike traditional data lakes, Iceberg tables maintain a complete history of changes, allowing users to query historical versions without restoring backups or creating new tables.

Time travel in Iceberg enables:

  • Point-in-time queries: Retrieve data as it existed at any past snapshot.
  • Change tracking: Compare historical data versions to analyze trends.
  • Rollback and recovery: Revert to an older snapshot to fix errors.
  • Audit and compliance: Ensure data integrity by querying past states.

This guide explores how to use time travel with Apache Iceberg in PySpark, demonstrating real-world scenarios for data engineers.

Just getting started with Apache Iceberg? Check out our beginner's guide here.

Setting up an Iceberg Table

To demonstrate Iceberg's time travel, let's create a table and insert some sample data.

python
from pyspark.sql import SparkSession spark = SparkSession.builder \    .appName("Iceberg Time Travel") \    .config("spark.sql.catalog.iceberg""org.apache.iceberg.spark.SparkCatalog") \    .config("spark.sql.catalog.iceberg.type""hadoop") \    .config("spark.sql.catalog.iceberg.warehouse""hdfs://namenode:9000/warehouse") \    .getOrCreate() # Create an Iceberg table spark.sql("""    CREATE TABLE IF NOT EXISTS iceberg.db.orders (        order_id BIGINT,        customer STRING,        amount DOUBLE,        created_at TIMESTAMP    )    USING iceberg """) # Insert some initial data spark.sql("""    INSERT INTO iceberg.db.orders VALUES    (1, 'Alice', 100.0, '2024-02-01 12:00:00'),    (2, 'Bob', 200.0, '2024-02-02 13:00:00') """)

Now, let's check the available snapshots:

python
spark.sql("SELECT snapshot_id, committed_at FROM iceberg.db.orders.snapshots").show()

Querying Historical Snapshots in Apache Iceberg

Once we have snapshots, we can query past versions of the data using either snapshot IDs or timestamps.

Querying by snapshot ID

python
snapshot_id = "1234567890"  # Replace with an actual snapshot ID df = spark.read.format("iceberg") \    .option("snapshot-id", snapshot_id) \    .load("iceberg.db.orders") df.show()

Querying by timestamp

python
query_time = "2024-02-02 12:00:00" df = spark.read.format("iceberg") \    .option("as-of-timestamp", query_time) \    .load("iceberg.db.orders") df.show()

This returns the dataset as it existed at the specified time.

Incremental Data Retrieval with Iceberg Snapshots

Iceberg supports incremental reads, allowing efficient processing of only new or changed records.

python
start_snapshot_id = "1234567890"  # Replace with actual IDs end_snapshot_id = "1234567891" df = spark.read \    .option("start-snapshot-id", start_snapshot_id) \    .option("end-snapshot-id", end_snapshot_id) \    .table("iceberg.db.orders") df.show()

This retrieves only the data changes between the two snapshots.

Using Tags and Branches for Snapshot Management in Iceberg

Iceberg supports tags and branches for managing data history.

Creating a tag

A tag is a named pointer to a specific snapshot.

python
spark.sql("ALTER TABLE iceberg.db.orders CREATE TAG quarterly_backup RETAIN 365 DAYS")

Now, we can query data using the tag:

python
spark.sql("SELECT * FROM iceberg.db.orders VERSION AS OF 'quarterly_backup'").show()

Creating a branch

A branch is like a named version of the dataset.

python
spark.sql("ALTER TABLE iceberg.db.orders CREATE BRANCH experimental")

We can insert data into the branch without affecting the main table:

python
spark.sql("""    INSERT INTO iceberg.db.orders.branch_experimental VALUES    (3, 'Charlie', 300.0, '2024-02-03 14:00:00') """)

Compare the branch with the main table:

python
df_main = spark.sql("SELECT * FROM iceberg.db.orders") df_branch = spark.sql("SELECT * FROM iceberg.db.orders.branch_experimental") df_branch.exceptAll(df_main).show()

This helps in A/B testing or sandboxing new changes.

Rolling Back Changes

Mistakes happen. Iceberg makes it easy to roll back to a previous snapshot.

python
previous_snapshot_id = "1234567890" spark.sql(f"ALTER TABLE iceberg.db.orders ROLLBACK TO SNAPSHOT {previous_snapshot_id}")

This reverts the table to an earlier state without losing history.

Real-World Use Cases of Apache Iceberg Time Travel

Apache Iceberg's time travel functionality is invaluable in multiple real-world scenarios. Here are some practical use cases where data engineers and analysts leverage this feature:

1. Auditing and Compliance

Many industries, such as finance, healthcare, and e-commerce, require maintaining historical data for regulatory compliance and auditability. Iceberg simplifies this process by allowing users to query data as it existed at any point in time without complex ETL processes or additional snapshot storage.

Example:
A financial institution needs to verify transactions from six months ago to ensure compliance with regulatory requirements. Instead of restoring backups or replaying logs, they can query the table as it existed at that time:

python
query_time = "2023-08-01 00:00:00" df = spark.read.format("iceberg") \    .option("as-of-timestamp", query_time) \    .load("iceberg.db.transactions") df.show()

This enables seamless data auditing without disrupting the live dataset.

2. Debugging and Data Quality Validation

Unexpected data discrepancies can arise due to bad transformations, late-arriving data, or schema changes. Time travel allows teams to investigate past data states to pinpoint when and how an issue occurred.

Example:
A data engineer notices that revenue numbers for a specific day look off. By comparing snapshots, they can identify whether incorrect records were introduced in a specific ingestion job.

python
start_snapshot_id = "1234567890" end_snapshot_id = "1234567891" df = spark.read.format("iceberg") \    .option("start-snapshot-id", start_snapshot_id) \    .option("end-snapshot-id", end_snapshot_id) \    .table("iceberg.db.sales") df.show()

This makes root cause analysis much faster, eliminating the need to reconstruct past datasets manually.

3. Machine Learning Model Training on Historical Data

Data scientists often need to train models on historical datasets rather than the latest version of the data. Iceberg’s time travel allows ML teams to query past snapshots to create reproducible datasets.

Example:
An ML team training a fraud detection model needs transaction data as it existed last year to maintain consistency across different training iterations.

python
query_time = "2023-02-01 00:00:00" df = spark.read.format("iceberg") \    .option("as-of-timestamp", query_time) \    .load("iceberg.db.transactions") df.show()

This ensures that models can be trained on point-in-time datasets without interference from recent data changes.

4. Data Recovery and Rollback

Accidental deletions or data corruption can have severe consequences. Iceberg’s rollback feature enables restoring previous table versions instantly.

Example:
A data engineering team accidentally deletes customer orders from a production table. Instead of manually recovering data from backups, they rollback to a previous snapshot.

python
previous_snapshot_id = "1234567890" spark.sql(f"ALTER TABLE iceberg.db.orders ROLLBACK TO SNAPSHOT {previous_snapshot_id}")

This avoids downtime and ensures data integrity without manual intervention.

Limitations & Challenges of Iceberg Time Travel

While Apache Iceberg’s time travel feature is powerful, it comes with some trade-offs and challenges. Understanding these limitations helps in designing better Iceberg-based data architectures.

1. Storage Overhead from Retained Snapshots

Every Iceberg snapshot maintains metadata and references to previous data files. If snapshots are not managed properly, storage costs can grow significantly.

Solution: Implement a snapshot retention policy to remove unnecessary historical snapshots.

python
spark.sql("ALTER TABLE iceberg.db.orders DROP SNAPSHOTS OLDER THAN INTERVAL '30' DAYS")

2. Query Performance on Large Datasets

Querying historical data requires scanning additional metadata and older file versions, which can impact performance if not optimized.

Solution:

  • Use partitioning and hidden partitioning to minimize the data scanned.
  • Enable metadata caching in Iceberg to speed up queries.
  • Use incremental queries instead of full snapshot retrieval where possible.

3. Schema Evolution Considerations

Iceberg supports schema evolution, but querying historical data can introduce challenges if the schema has changed.

Example Problem:
A column was renamed from customer_name to client_name, but a time travel query referencing customer_name might fail.

Solution:

  • Maintain proper schema versioning and update queries accordingly.
  • Use aliases to ensure backward compatibility.
python
spark.sql("ALTER TABLE iceberg.db.orders RENAME COLUMN customer_name TO client_name")

4. Compatibility with Other Query Engines

While Apache Iceberg supports multiple query engines (Spark, Trino, Presto, Hive), not all engines support full time travel features.

Solution:

  • Ensure the query engine used fully supports Iceberg time travel before implementing it in production.
  • Validate that rollback operations behave consistently across different compute platforms.

Conclusion

Apache Iceberg’s time travel is a game-changer for data engineering. It provides:

  • Fast, efficient historical queries
  • Incremental data processing
  • Seamless rollback and versioning
  • Tagging and branching for experimentation

By leveraging Iceberg's time travel, you can make your data lake reliable, traceable, and auditable.

Want to simplify Iceberg-based CDC pipelines? Explore Estuary Flow for real-time data streaming.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data Engineering Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.