Estuary

Getting Started with S3 Tables for Apache Iceberg

Learn how to set up S3 Tables for Apache Iceberg and unlock scalable, real-time data processing with AWS. Follow this step-by-step guide to get started.

Share this article

Apache Iceberg is a modern table format that handles massive datasets in cloud-native environments. When combined with S3, Iceberg offers a powerful solution for scalable, real-time data processing and analytics. This tutorial guides you through setting up S3 tables with Apache Iceberg, showcasing the potential for managing structured data in distributed systems.

Why Use S3 Tables for Apache Iceberg?

Apache Iceberg provides features like schema evolution, time travel, and ACID transactions, making it an ideal choice for large-scale analytics workloads. By integrating it with S3, you gain:

  1. Scalability: Seamlessly handle petabytes of data.
  2. Cost-Effectiveness: Pay only for the storage and compute you use.
  3. Flexibility: Use various AWS analytics tools like Athena and Redshift.

Most importantly, there’s no need to manage any Iceberg catalog, as AWS takes care of that for you! Let’s take a look at what you’ll need to get started.

Prerequisites

Before starting, ensure the following:

  • An AWS account with appropriate permissions.
  • AWS CLI installed and configured.
  • Spark installed on your local machine or an EMR cluster.
  • A basic understanding of Iceberg and S3 concepts.

Step 1: Set Up an S3 Table Bucket

Setting up an S3 table bucket is almost identical to a normal one.

Create a Table Bucket

  1. Open the Amazon S3 Console.
  2. Select your desired region and create a bucket following these rules:
    • Names must be unique within your AWS account.
    • Use lowercase letters, numbers, and hyphens.
    • Avoid special characters or spaces.
s3 tables for apache iceberg - S3 Console

Example bucket name: iceberg-demo-bucket

Integrate with AWS Analytics

To enable querying via AWS tools like Athena or Redshift, ensure that your bucket supports table-based analytics. You can manage this integration in the bucket's properties.

Step 2: Launch Spark with Iceberg Support

To interact with S3 Tables from various services, AWS provides a maven jar called s3-tables-catalog-for-iceberg.jar. You can build the client catalog maven jar from the AWS Labs GitHub repository or directly download it from Maven Central. To use a third-party Spark process, you must configure your Iceberg session to use the Maven jar.

We’ll use EMR for this tutorial, as it already has this library built-in; we only need to flip a configuration switch.

Create an EMR Cluster

Use the following AWS CLI command to set up an EMR cluster:

plaintext
aws emr create-cluster \ --release-label emr-7.5.0 \ --applications Name=Spark \ --configurations file://configurations.json \ --region us-east-1 \ --name IcebergCluster \ --instance-type m5.xlarge \ --instance-count 3 \ --log-uri s3://your-log-bucket/ \ --service-role EMR_DefaultRole \ --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole

Sample configurations.json:

plaintext
[ {    "Classification": "iceberg-defaults",    "Properties": {      "iceberg.enabled": "true"    } } ]

This configuration snippet lets the EMR cluster know that we’re planning on utilizing it with S3 Iceberg tables.

Initialize Spark

SSH into your EMR primary node and start a Spark session:

plaintext
spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.0 \ --conf spark.sql.catalog.demo=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.demo.warehouse=s3://iceberg-demo-bucket/ \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Note that if you’re running a local spark-shell instead of EMR, you’ll need to put the above-mentioned JAR in its proper folder before executing this command.

Step 3: Create and Populate an Iceberg Table

Now, let’s look at some everyday operations practitioners might use when interacting with Iceberg tables.

Create a Namespace and Table

Define a namespace (sales_db) and a table (orders) using Spark SQL:

plaintext
spark.sql("CREATE NAMESPACE IF NOT EXISTS demo.sales_db") spark.sql(""" CREATE TABLE IF NOT EXISTS demo.sales_db.orders (  order_id BIGINT,  customer_id BIGINT,  order_date DATE,  total_amount DOUBLE ) USING iceberg """ )

Insert Data into the Table

Manually insert data for quick testing:

plaintext
spark.sql(""" INSERT INTO demo.sales_db.orders VALUES  (1, 101, '2023-01-15', 150.00),  (2, 102, '2023-01-16', 200.50),  (3, 103, '2023-01-17', 75.25) """)

Load Data from External Files

For real-world scenarios, you’ll likely load data from external files like Parquet:

plaintext
dataPath = "s3://path-to-data/orders.parquet" ordersDF = spark.read.parquet(dataPath) ordersDF.writeTo("demo.sales_db.orders") .using("iceberg") .tableProperty("format-version", "2") .append()

Step 4: Querying Data with SQL

Iceberg enables seamless querying using Spark or AWS analytics tools. Let’s start with some Spark SQL examples:

View All Orders

plaintext
spark.sql("SELECT * FROM demo.sales_db.orders").show()

Filter by Date Range

plaintext
spark.sql(""" SELECT * FROM demo.sales_db.orders WHERE order_date BETWEEN '2023-01-15' AND '2023-01-16' """).show()

Perform Aggregations

plaintext
spark.sql(""" SELECT customer_id, COUNT(order_id) AS order_count, SUM(total_amount) AS total_spent FROM demo.sales_db.orders GROUP BY customer_id """).show()

Maintenance Support for Amazon S3 Tables

Amazon S3 Tables offers robust maintenance operations to optimize table performance and manage storage effectively. These maintenance features include CompactionSnapshot Management, and Unreferenced File Removal. These options are enabled for all tables by default but can be configured or disabled using maintenance configuration files. Configuration changes require specific permissions, such as s3tables:GetTableMaintenanceConfiguration and s3tables:PutTableMaintenanceConfiguration.

Compaction

Compaction improves query performance by combining smaller objects into fewer, more prominent objects based on a target file size. This process also incorporates row-level deletes, ensuring data consistency. Key details include:

  • Default Target File Size: 512MB, configurable to suit specific access patterns.
  • Supported File Types: Only available for Apache Parquet file types.
  • Configuration Scope: This can only be configured at the table level.
  • Additional Costs: Incurs costs as detailed in Amazon S3 pricing.

Limitations

  • It only supports Parquet file types.
  • It doesn’t support certain data types (e.g., Fixed) or compression types (e.g., Brotli, LZ4).

Snapshot Management

Snapshot management governs the retention and expiration of table snapshots. It ensures that older snapshots are removed to optimize storage, with configurations based on:

  • Minimum Snapshots to Keep: Default is 1.
  • Maximum Snapshot Age: Default is 120 hours.

Expired snapshots result in:

  • Deletion of Noncurrent Objects: These are marked and deleted based on the NoncurrentDays property.
  • Time Travel Limitations: Once metadata is deleted, time travel queries for expired snapshots are no longer possible.

Limitations

  • Snapshot management overrides any retention policies configured in metadata.json or via SQL commands if conflicting values exist.
  • Snapshot deletions are irreversible; recovering noncurrent objects requires AWS Support intervention.

Unreferenced File Removal

This feature identifies and deletes objects no longer referenced by table snapshots, optimizing storage. Key configurations:

  • Expire Days: Objects older than the ExpireDays property are marked as noncurrent (default: 3 days).
  • Noncurrent Days: Noncurrent objects are permanently deleted after this period (default: 10 days).

Limitations

  • It is only configurable at the table bucket level.
  • Deletes are irreversible and incur additional costs.

Getting Data into Apache Iceberg

Data engineers have many options to get data into Iceberg data lakes. Check out our guide for an overview.

Estuary Flow offers a powerful solution for streaming data integration into Apache Iceberg. Its Iceberg Materialization Connector allows you to seamlessly load real-time and batch data into Iceberg tables, ensuring your data lakehouse remains up-to-date and query-ready.

Here’s how Estuary Flow simplifies and enhances your Apache Iceberg workflows:

1. Unified Streaming and Batch Processing

With Estuary Flow, you can handle streaming and historical batch data in a single pipeline. Flow ensures consistent schema management and efficient writes to Iceberg tables, whether you're processing high-velocity streams or backfilling historical datasets.

2. Seamless Schema Evolution

Schema evolution in Iceberg can be challenging, especially when dealing with dynamic data sources. Estuary Flow automatically handles schema changes from upstream systems, ensuring compatibility with Iceberg’s metadata layers without manual intervention.

3. Enterprise Support

Estuary Flow is ready to be deployed in any networking environment. Your data never leaves your network with private deployments with no compromise in performance.

With Estuary Flow, teams can focus on extracting insights from their data rather than managing complex data pipelines. Whether building a streaming data lakehouse or modernizing your batch workflows, Flow's Iceberg Materialization Connector ensures your data is always in sync and ready for action.

Check out this video to see how easy it is to set up an end-to-end data flow for Apache Iceberg:

Streaming Data Lakehouse Tutorial: MongoDB to Apache Iceberg

Conclusion

Apache Iceberg with S3 provides a robust foundation for modern data engineering. Whether building real-time pipelines or simplifying your data lake architecture, the combination of Iceberg and S3 unlocks unparalleled flexibility and performance.

Experiment with these examples, and consider integrating AWS services like Glue or Athena to enhance your setup further.

Explore the Apache Iceberg documentation or AWS's S3 Tables Guide for more details.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of Dani Pálma
Dani Pálma

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.