Apache Iceberg is a modern table format that handles massive datasets in cloud-native environments. When combined with S3, Iceberg offers a powerful solution for scalable, real-time data processing and analytics. This tutorial guides you through setting up S3 tables with Apache Iceberg, showcasing the potential for managing structured data in distributed systems.
Why Use S3 Tables for Apache Iceberg?
Apache Iceberg provides features like schema evolution, time travel, and ACID transactions, making it an ideal choice for large-scale analytics workloads. By integrating it with S3, you gain:
- Scalability: Seamlessly handle petabytes of data.
- Cost-Effectiveness: Pay only for the storage and compute you use.
- Flexibility: Use various AWS analytics tools like Athena and Redshift.
Most importantly, there’s no need to manage any Iceberg catalog, as AWS takes care of that for you! Let’s take a look at what you’ll need to get started.
Prerequisites
Before starting, ensure the following:
- An AWS account with appropriate permissions.
- AWS CLI installed and configured.
- Spark installed on your local machine or an EMR cluster.
- A basic understanding of Iceberg and S3 concepts.
Step 1: Set Up an S3 Table Bucket
Setting up an S3 table bucket is almost identical to a normal one.
Create a Table Bucket
- Open the Amazon S3 Console.
- Select your desired region and create a bucket following these rules:
- Names must be unique within your AWS account.
- Use lowercase letters, numbers, and hyphens.
- Avoid special characters or spaces.
Example bucket name: iceberg-demo-bucket
Integrate with AWS Analytics
To enable querying via AWS tools like Athena or Redshift, ensure that your bucket supports table-based analytics. You can manage this integration in the bucket's properties.
Step 2: Launch Spark with Iceberg Support
To interact with S3 Tables from various services, AWS provides a maven jar called s3-tables-catalog-for-iceberg.jar. You can build the client catalog maven jar from the AWS Labs GitHub repository or directly download it from Maven Central. To use a third-party Spark process, you must configure your Iceberg session to use the Maven jar.
We’ll use EMR for this tutorial, as it already has this library built-in; we only need to flip a configuration switch.
Create an EMR Cluster
Use the following AWS CLI command to set up an EMR cluster:
plaintextaws emr create-cluster \
--release-label emr-7.5.0 \
--applications Name=Spark \
--configurations file://configurations.json \
--region us-east-1 \
--name IcebergCluster \
--instance-type m5.xlarge \
--instance-count 3 \
--log-uri s3://your-log-bucket/ \
--service-role EMR_DefaultRole \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole
Sample configurations.json:
plaintext[
{
"Classification": "iceberg-defaults",
"Properties": {
"iceberg.enabled": "true"
}
}
]
This configuration snippet lets the EMR cluster know that we’re planning on utilizing it with S3 Iceberg tables.
Initialize Spark
SSH into your EMR primary node and start a Spark session:
plaintextspark-shell \
--packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.0 \
--conf spark.sql.catalog.demo=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.demo.warehouse=s3://iceberg-demo-bucket/ \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Note that if you’re running a local spark-shell instead of EMR, you’ll need to put the above-mentioned JAR in its proper folder before executing this command.
Step 3: Create and Populate an Iceberg Table
Now, let’s look at some everyday operations practitioners might use when interacting with Iceberg tables.
Create a Namespace and Table
Define a namespace (sales_db) and a table (orders) using Spark SQL:
plaintextspark.sql("CREATE NAMESPACE IF NOT EXISTS demo.sales_db")
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.sales_db.orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
total_amount DOUBLE
) USING iceberg
"""
)
Insert Data into the Table
Manually insert data for quick testing:
plaintextspark.sql("""
INSERT INTO demo.sales_db.orders VALUES
(1, 101, '2023-01-15', 150.00),
(2, 102, '2023-01-16', 200.50),
(3, 103, '2023-01-17', 75.25)
""")
Load Data from External Files
For real-world scenarios, you’ll likely load data from external files like Parquet:
plaintextdataPath = "s3://path-to-data/orders.parquet"
ordersDF = spark.read.parquet(dataPath)
ordersDF.writeTo("demo.sales_db.orders")
.using("iceberg")
.tableProperty("format-version", "2")
.append()
Step 4: Querying Data with SQL
Iceberg enables seamless querying using Spark or AWS analytics tools. Let’s start with some Spark SQL examples:
View All Orders
plaintextspark.sql("SELECT * FROM demo.sales_db.orders").show()
Filter by Date Range
plaintextspark.sql("""
SELECT * FROM demo.sales_db.orders
WHERE order_date BETWEEN '2023-01-15' AND '2023-01-16'
""").show()
Perform Aggregations
plaintextspark.sql("""
SELECT customer_id, COUNT(order_id) AS order_count, SUM(total_amount) AS total_spent
FROM demo.sales_db.orders
GROUP BY customer_id
""").show()
Maintenance Support for Amazon S3 Tables
Amazon S3 Tables offers robust maintenance operations to optimize table performance and manage storage effectively. These maintenance features include Compaction, Snapshot Management, and Unreferenced File Removal. These options are enabled for all tables by default but can be configured or disabled using maintenance configuration files. Configuration changes require specific permissions, such as s3tables:GetTableMaintenanceConfiguration and s3tables:PutTableMaintenanceConfiguration.
Compaction
Compaction improves query performance by combining smaller objects into fewer, more prominent objects based on a target file size. This process also incorporates row-level deletes, ensuring data consistency. Key details include:
- Default Target File Size: 512MB, configurable to suit specific access patterns.
- Supported File Types: Only available for Apache Parquet file types.
- Configuration Scope: This can only be configured at the table level.
- Additional Costs: Incurs costs as detailed in Amazon S3 pricing.
Limitations
- It only supports Parquet file types.
- It doesn’t support certain data types (e.g., Fixed) or compression types (e.g., Brotli, LZ4).
Snapshot Management
Snapshot management governs the retention and expiration of table snapshots. It ensures that older snapshots are removed to optimize storage, with configurations based on:
- Minimum Snapshots to Keep: Default is 1.
- Maximum Snapshot Age: Default is 120 hours.
Expired snapshots result in:
- Deletion of Noncurrent Objects: These are marked and deleted based on the NoncurrentDays property.
- Time Travel Limitations: Once metadata is deleted, time travel queries for expired snapshots are no longer possible.
Limitations
- Snapshot management overrides any retention policies configured in metadata.json or via SQL commands if conflicting values exist.
- Snapshot deletions are irreversible; recovering noncurrent objects requires AWS Support intervention.
Unreferenced File Removal
This feature identifies and deletes objects no longer referenced by table snapshots, optimizing storage. Key configurations:
- Expire Days: Objects older than the ExpireDays property are marked as noncurrent (default: 3 days).
- Noncurrent Days: Noncurrent objects are permanently deleted after this period (default: 10 days).
Limitations
- It is only configurable at the table bucket level.
- Deletes are irreversible and incur additional costs.
Getting Data into Apache Iceberg
Data engineers have many options to get data into Iceberg data lakes. Check out our guide for an overview.
Estuary Flow offers a powerful solution for streaming data integration into Apache Iceberg. Its Iceberg Materialization Connector allows you to seamlessly load real-time and batch data into Iceberg tables, ensuring your data lakehouse remains up-to-date and query-ready.
Here’s how Estuary Flow simplifies and enhances your Apache Iceberg workflows:
1. Unified Streaming and Batch Processing
With Estuary Flow, you can handle streaming and historical batch data in a single pipeline. Flow ensures consistent schema management and efficient writes to Iceberg tables, whether you're processing high-velocity streams or backfilling historical datasets.
2. Seamless Schema Evolution
Schema evolution in Iceberg can be challenging, especially when dealing with dynamic data sources. Estuary Flow automatically handles schema changes from upstream systems, ensuring compatibility with Iceberg’s metadata layers without manual intervention.
3. Enterprise Support
Estuary Flow is ready to be deployed in any networking environment. Your data never leaves your network with private deployments with no compromise in performance.
With Estuary Flow, teams can focus on extracting insights from their data rather than managing complex data pipelines. Whether building a streaming data lakehouse or modernizing your batch workflows, Flow's Iceberg Materialization Connector ensures your data is always in sync and ready for action.
Check out this video to see how easy it is to set up an end-to-end data flow for Apache Iceberg:
Streaming Data Lakehouse Tutorial: MongoDB to Apache Iceberg
Conclusion
Apache Iceberg with S3 provides a robust foundation for modern data engineering. Whether building real-time pipelines or simplifying your data lake architecture, the combination of Iceberg and S3 unlocks unparalleled flexibility and performance.
Experiment with these examples, and consider integrating AWS services like Glue or Athena to enhance your setup further.
Explore the Apache Iceberg documentation or AWS's S3 Tables Guide for more details.
About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.