S3 TablesApache Iceberg

7 min read

December 5, 2024

Getting Started with S3 Tables for Apache Iceberg

Learn how to set up S3 Tables for Apache Iceberg and unlock scalable, real-time data processing with AWS. Follow this step-by-step guide to get started.

Dani Pálma Head of Data Engineering Marketing

Share this article

Apache Iceberg is a modern table format that handles massive datasets in cloud-native environments. When combined with S3, Iceberg offers a powerful solution for scalable, real-time data processing and analytics. This tutorial guides you through setting up S3 tables with Apache Iceberg, showcasing the potential for managing structured data in distributed systems.

Why Use S3 Tables for Apache Iceberg?

Apache Iceberg provides features like schema evolution, time travel, and ACID transactions, making it an ideal choice for large-scale analytics workloads. By integrating it with S3, you gain:

Scalability: Seamlessly handle petabytes of data.
Cost-Effectiveness: Pay only for the storage and compute you use.
Flexibility: Use various AWS analytics tools like Athena and Redshift.

Most importantly, there’s no need to manage any Iceberg catalog, as AWS takes care of that for you! Let’s take a look at what you’ll need to get started.

Prerequisites

Before starting, ensure the following:

An AWS account with appropriate permissions.
AWS CLI installed and configured.
Spark installed on your local machine or an EMR cluster.
A basic understanding of Iceberg and S3 concepts.

Step 1: Set Up an S3 Table Bucket

Setting up an S3 table bucket is almost identical to a normal one.

Create a Table Bucket

Open the Amazon S3 Console.
Select your desired region and create a bucket following these rules:
- Names must be unique within your AWS account.
- Use lowercase letters, numbers, and hyphens.
- Avoid special characters or spaces.

s3 tables for apache iceberg - S3 Console

Example bucket name: iceberg-demo-bucket

Integrate with AWS Analytics

To enable querying via AWS tools like Athena or Redshift, ensure that your bucket supports table-based analytics. You can manage this integration in the bucket's properties.

Step 2: Launch Spark with Iceberg Support

To interact with S3 Tables from various services, AWS provides a maven jar called s3-tables-catalog-for-iceberg.jar. You can build the client catalog maven jar from the AWS Labs GitHub repository or directly download it from Maven Central. To use a third-party Spark process, you must configure your Iceberg session to use the Maven jar.

We’ll use EMR for this tutorial, as it already has this library built-in; we only need to flip a configuration switch.

Create an EMR Cluster

Use the following AWS CLI command to set up an EMR cluster:

plaintextaws emr create-cluster \
  --release-label emr-7.5.0 \
  --applications Name=Spark \
  --configurations file://configurations.json \
  --region us-east-1 \
  --name IcebergCluster \
  --instance-type m5.xlarge \
  --instance-count 3 \
  --log-uri s3://your-log-bucket/ \
  --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole

Sample configurations.json:

plaintext[
  {
    "Classification": "iceberg-defaults",
    "Properties": {
      "iceberg.enabled": "true"
    }
  }
]

This configuration snippet lets the EMR cluster know that we’re planning on utilizing it with S3 Iceberg tables.

Initialize Spark

SSH into your EMR primary node and start a Spark session:

plaintextspark-shell \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.0 \
  --conf spark.sql.catalog.demo=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.demo.warehouse=s3://iceberg-demo-bucket/ \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Note that if you’re running a local spark-shell instead of EMR, you’ll need to put the above-mentioned JAR in its proper folder before executing this command.

Step 3: Create and Populate an Iceberg Table

Now, let’s look at some everyday operations practitioners might use when interacting with Iceberg tables.

Create a Namespace and Table

Define a namespace (sales_db) and a table (orders) using Spark SQL:

plaintextspark.sql("CREATE NAMESPACE IF NOT EXISTS demo.sales_db")

spark.sql("""
CREATE TABLE IF NOT EXISTS demo.sales_db.orders (
  order_id BIGINT,
  customer_id BIGINT,
  order_date DATE,
  total_amount DOUBLE
) USING iceberg
"""
)

Insert Data into the Table

Manually insert data for quick testing:

plaintextspark.sql("""
INSERT INTO demo.sales_db.orders VALUES
  (1, 101, '2023-01-15', 150.00),
  (2, 102, '2023-01-16', 200.50),
  (3, 103, '2023-01-17', 75.25)
""")

Load Data from External Files

For real-world scenarios, you’ll likely load data from external files like Parquet:

plaintextdataPath = "s3://path-to-data/orders.parquet"
ordersDF = spark.read.parquet(dataPath)

ordersDF.writeTo("demo.sales_db.orders")
  .using("iceberg")
  .tableProperty("format-version", "2")
  .append()

Step 4: Querying Data with SQL

Iceberg enables seamless querying using Spark or AWS analytics tools. Let’s start with some Spark SQL examples:

View All Orders

plaintext
spark.sql("SELECT * FROM demo.sales_db.orders").show()

Filter by Date Range

plaintextspark.sql("""
SELECT * FROM demo.sales_db.orders
WHERE order_date BETWEEN '2023-01-15' AND '2023-01-16'
""").show()

Perform Aggregations

plaintextspark.sql("""
SELECT customer_id, COUNT(order_id) AS order_count, SUM(total_amount) AS total_spent
FROM demo.sales_db.orders
GROUP BY customer_id
""").show()

Maintenance Support for Amazon S3 Tables

Amazon S3 Tables offers robust maintenance operations to optimize table performance and manage storage effectively. These maintenance features include Compaction, Snapshot Management, and Unreferenced File Removal. These options are enabled for all tables by default but can be configured or disabled using maintenance configuration files. Configuration changes require specific permissions, such as s3tables:GetTableMaintenanceConfiguration and s3tables:PutTableMaintenanceConfiguration.

Compaction

Compaction improves query performance by combining smaller objects into fewer, more prominent objects based on a target file size. This process also incorporates row-level deletes, ensuring data consistency. Key details include:

Default Target File Size: 512MB, configurable to suit specific access patterns.
Supported File Types: Only available for Apache Parquet file types.
Configuration Scope: This can only be configured at the table level.
Additional Costs: Incurs costs as detailed in Amazon S3 pricing.

Limitations

It only supports Parquet file types.
It doesn’t support certain data types (e.g., Fixed) or compression types (e.g., Brotli, LZ4).

Snapshot Management

Snapshot management governs the retention and expiration of table snapshots. It ensures that older snapshots are removed to optimize storage, with configurations based on:

Minimum Snapshots to Keep: Default is 1.
Maximum Snapshot Age: Default is 120 hours.

Expired snapshots result in:

Deletion of Noncurrent Objects: These are marked and deleted based on the NoncurrentDays property.
Time Travel Limitations: Once metadata is deleted, time travel queries for expired snapshots are no longer possible.

Limitations

Snapshot management overrides any retention policies configured in metadata.json or via SQL commands if conflicting values exist.
Snapshot deletions are irreversible; recovering noncurrent objects requires AWS Support intervention.

Unreferenced File Removal

This feature identifies and deletes objects no longer referenced by table snapshots, optimizing storage. Key configurations:

Expire Days: Objects older than the ExpireDays property are marked as noncurrent (default: 3 days).
Noncurrent Days: Noncurrent objects are permanently deleted after this period (default: 10 days).

Limitations

It is only configurable at the table bucket level.
Deletes are irreversible and incur additional costs.

Getting Data into Apache Iceberg

Data engineers have many options to get data into Iceberg data lakes. Check out our guide for an overview.

Estuary Flow offers a powerful solution for streaming data integration into Apache Iceberg. Its Iceberg Materialization Connector allows you to seamlessly load real-time and batch data into Iceberg tables, ensuring your data lakehouse remains up-to-date and query-ready.

Here’s how Estuary Flow simplifies and enhances your Apache Iceberg workflows:

1. Unified Streaming and Batch Processing

With Estuary Flow, you can handle streaming and historical batch data in a single pipeline. Flow ensures consistent schema management and efficient writes to Iceberg tables, whether you're processing high-velocity streams or backfilling historical datasets.

2. Seamless Schema Evolution

Schema evolution in Iceberg can be challenging, especially when dealing with dynamic data sources. Estuary Flow automatically handles schema changes from upstream systems, ensuring compatibility with Iceberg’s metadata layers without manual intervention.

3. Enterprise Support

Estuary Flow is ready to be deployed in any networking environment. Your data never leaves your network with private deployments with no compromise in performance.

With Estuary Flow, teams can focus on extracting insights from their data rather than managing complex data pipelines. Whether building a streaming data lakehouse or modernizing your batch workflows, Flow's Iceberg Materialization Connector ensures your data is always in sync and ready for action.

Check out this video to see how easy it is to set up an end-to-end data flow for Apache Iceberg:

Streaming Data Lakehouse Tutorial: MongoDB to Apache Iceberg

Conclusion

Apache Iceberg with S3 provides a robust foundation for modern data engineering. Whether building real-time pipelines or simplifying your data lake architecture, the combination of Iceberg and S3 unlocks unparalleled flexibility and performance.

Experiment with these examples, and consider integrating AWS services like Glue or Athena to enhance your setup further.

Explore the Apache Iceberg documentation or AWS's S3 Tables Guide for more details.

Share this article

Table of Contents

Start Building For Free

About the author

Dani PálmaHead of Data Engineering Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Getting Started with S3 Tables for Apache Iceberg

Why Use S3 Tables for Apache Iceberg?

Prerequisites

Step 1: Set Up an S3 Table Bucket

Create a Table Bucket

Integrate with AWS Analytics

Create an EMR Cluster

Initialize Spark

Step 3: Create and Populate an Iceberg Table

Create a Namespace and Table

Insert Data into the Table

Load Data from External Files

Step 4: Querying Data with SQL

View All Orders

Filter by Date Range

Perform Aggregations

Maintenance Support for Amazon S3 Tables

Compaction

Snapshot Management

Unreferenced File Removal

Getting Data into Apache Iceberg

1. Unified Streaming and Batch Processing

2. Seamless Schema Evolution

3. Enterprise Support

Conclusion

Start streaming your data for free

About the author

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard

Why You Should Reconsider Debezium: Challenges and Alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming Pipelines.

Simple to Deploy.

Simply Priced.