
Introduction
Apache Iceberg is a table format built for large-scale analytics. It brings ACID transactions, schema evolution, and time travel to data lakes. Unlike traditional data lakes, Iceberg tables maintain a complete history of changes, allowing users to query historical versions without restoring backups or creating new tables.
Time travel in Iceberg enables:
- Point-in-time queries: Retrieve data as it existed at any past snapshot.
- Change tracking: Compare historical data versions to analyze trends.
- Rollback and recovery: Revert to an older snapshot to fix errors.
- Audit and compliance: Ensure data integrity by querying past states.
This guide explores how to use time travel with Apache Iceberg in PySpark, demonstrating real-world scenarios for data engineers.
Just getting started with Apache Iceberg? Check out our beginner's guide here.
Setting up an Iceberg Table
To demonstrate Iceberg's time travel, let's create a table and insert some sample data.
pythonfrom pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Iceberg Time Travel") \
.config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.iceberg.type", "hadoop") \
.config("spark.sql.catalog.iceberg.warehouse", "hdfs://namenode:9000/warehouse") \
.getOrCreate()
# Create an Iceberg table
spark.sql("""
CREATE TABLE IF NOT EXISTS iceberg.db.orders (
order_id BIGINT,
customer STRING,
amount DOUBLE,
created_at TIMESTAMP
)
USING iceberg
""")
# Insert some initial data
spark.sql("""
INSERT INTO iceberg.db.orders VALUES
(1, 'Alice', 100.0, '2024-02-01 12:00:00'),
(2, 'Bob', 200.0, '2024-02-02 13:00:00')
""")
Now, let's check the available snapshots:
pythonspark.sql("SELECT snapshot_id, committed_at FROM iceberg.db.orders.snapshots").show()
Querying Historical Snapshots in Apache Iceberg
Once we have snapshots, we can query past versions of the data using either snapshot IDs or timestamps.
Querying by snapshot ID
pythonsnapshot_id = "1234567890" # Replace with an actual snapshot ID
df = spark.read.format("iceberg") \
.option("snapshot-id", snapshot_id) \
.load("iceberg.db.orders")
df.show()
Querying by timestamp
pythonquery_time = "2024-02-02 12:00:00"
df = spark.read.format("iceberg") \
.option("as-of-timestamp", query_time) \
.load("iceberg.db.orders")
df.show()
This returns the dataset as it existed at the specified time.
Incremental Data Retrieval with Iceberg Snapshots
Iceberg supports incremental reads, allowing efficient processing of only new or changed records.
pythonstart_snapshot_id = "1234567890" # Replace with actual IDs
end_snapshot_id = "1234567891"
df = spark.read \
.option("start-snapshot-id", start_snapshot_id) \
.option("end-snapshot-id", end_snapshot_id) \
.table("iceberg.db.orders")
df.show()
This retrieves only the data changes between the two snapshots.
Using Tags and Branches for Snapshot Management in Iceberg
Iceberg supports tags and branches for managing data history.
Creating a tag
A tag is a named pointer to a specific snapshot.
pythonspark.sql("ALTER TABLE iceberg.db.orders CREATE TAG quarterly_backup RETAIN 365 DAYS")
Now, we can query data using the tag:
pythonspark.sql("SELECT * FROM iceberg.db.orders VERSION AS OF 'quarterly_backup'").show()
Creating a branch
A branch is like a named version of the dataset.
pythonspark.sql("ALTER TABLE iceberg.db.orders CREATE BRANCH experimental")
We can insert data into the branch without affecting the main table:
pythonspark.sql("""
INSERT INTO iceberg.db.orders.branch_experimental VALUES
(3, 'Charlie', 300.0, '2024-02-03 14:00:00')
""")
Compare the branch with the main table:
pythondf_main = spark.sql("SELECT * FROM iceberg.db.orders")
df_branch = spark.sql("SELECT * FROM iceberg.db.orders.branch_experimental")
df_branch.exceptAll(df_main).show()
This helps in A/B testing or sandboxing new changes.
Rolling Back Changes
Mistakes happen. Iceberg makes it easy to roll back to a previous snapshot.
pythonprevious_snapshot_id = "1234567890"
spark.sql(f"ALTER TABLE iceberg.db.orders ROLLBACK TO SNAPSHOT {previous_snapshot_id}")
This reverts the table to an earlier state without losing history.
Real-World Use Cases of Apache Iceberg Time Travel
Apache Iceberg's time travel functionality is invaluable in multiple real-world scenarios. Here are some practical use cases where data engineers and analysts leverage this feature:
1. Auditing and Compliance
Many industries, such as finance, healthcare, and e-commerce, require maintaining historical data for regulatory compliance and auditability. Iceberg simplifies this process by allowing users to query data as it existed at any point in time without complex ETL processes or additional snapshot storage.
Example:
A financial institution needs to verify transactions from six months ago to ensure compliance with regulatory requirements. Instead of restoring backups or replaying logs, they can query the table as it existed at that time:
pythonquery_time = "2023-08-01 00:00:00"
df = spark.read.format("iceberg") \
.option("as-of-timestamp", query_time) \
.load("iceberg.db.transactions")
df.show()
This enables seamless data auditing without disrupting the live dataset.
2. Debugging and Data Quality Validation
Unexpected data discrepancies can arise due to bad transformations, late-arriving data, or schema changes. Time travel allows teams to investigate past data states to pinpoint when and how an issue occurred.
Example:
A data engineer notices that revenue numbers for a specific day look off. By comparing snapshots, they can identify whether incorrect records were introduced in a specific ingestion job.
pythonstart_snapshot_id = "1234567890"
end_snapshot_id = "1234567891"
df = spark.read.format("iceberg") \
.option("start-snapshot-id", start_snapshot_id) \
.option("end-snapshot-id", end_snapshot_id) \
.table("iceberg.db.sales")
df.show()
This makes root cause analysis much faster, eliminating the need to reconstruct past datasets manually.
3. Machine Learning Model Training on Historical Data
Data scientists often need to train models on historical datasets rather than the latest version of the data. Iceberg’s time travel allows ML teams to query past snapshots to create reproducible datasets.
Example:
An ML team training a fraud detection model needs transaction data as it existed last year to maintain consistency across different training iterations.
pythonquery_time = "2023-02-01 00:00:00"
df = spark.read.format("iceberg") \
.option("as-of-timestamp", query_time) \
.load("iceberg.db.transactions")
df.show()
This ensures that models can be trained on point-in-time datasets without interference from recent data changes.
4. Data Recovery and Rollback
Accidental deletions or data corruption can have severe consequences. Iceberg’s rollback feature enables restoring previous table versions instantly.
Example:
A data engineering team accidentally deletes customer orders from a production table. Instead of manually recovering data from backups, they rollback to a previous snapshot.
pythonprevious_snapshot_id = "1234567890"
spark.sql(f"ALTER TABLE iceberg.db.orders ROLLBACK TO SNAPSHOT {previous_snapshot_id}")
This avoids downtime and ensures data integrity without manual intervention.
Limitations & Challenges of Iceberg Time Travel
While Apache Iceberg’s time travel feature is powerful, it comes with some trade-offs and challenges. Understanding these limitations helps in designing better Iceberg-based data architectures.
1. Storage Overhead from Retained Snapshots
Every Iceberg snapshot maintains metadata and references to previous data files. If snapshots are not managed properly, storage costs can grow significantly.
Solution: Implement a snapshot retention policy to remove unnecessary historical snapshots.
pythonspark.sql("ALTER TABLE iceberg.db.orders DROP SNAPSHOTS OLDER THAN INTERVAL '30' DAYS")
2. Query Performance on Large Datasets
Querying historical data requires scanning additional metadata and older file versions, which can impact performance if not optimized.
Solution:
- Use partitioning and hidden partitioning to minimize the data scanned.
- Enable metadata caching in Iceberg to speed up queries.
- Use incremental queries instead of full snapshot retrieval where possible.
3. Schema Evolution Considerations
Iceberg supports schema evolution, but querying historical data can introduce challenges if the schema has changed.
Example Problem:
A column was renamed from customer_name to client_name, but a time travel query referencing customer_name might fail.
Solution:
- Maintain proper schema versioning and update queries accordingly.
- Use aliases to ensure backward compatibility.
pythonspark.sql("ALTER TABLE iceberg.db.orders RENAME COLUMN customer_name TO client_name")
4. Compatibility with Other Query Engines
While Apache Iceberg supports multiple query engines (Spark, Trino, Presto, Hive), not all engines support full time travel features.
Solution:
- Ensure the query engine used fully supports Iceberg time travel before implementing it in production.
- Validate that rollback operations behave consistently across different compute platforms.
Conclusion
Apache Iceberg’s time travel is a game-changer for data engineering. It provides:
- Fast, efficient historical queries
- Incremental data processing
- Seamless rollback and versioning
- Tagging and branching for experimentation
By leveraging Iceberg's time travel, you can make your data lake reliable, traceable, and auditable.
Want to simplify Iceberg-based CDC pipelines? Explore Estuary Flow for real-time data streaming.

About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.
Popular Articles
