
Moving MySQL to Apache Iceberg is essential for organizations building scalable, cost-efficient, and analytics-ready data architectures. While MySQL is excellent for transactional workloads, it lacks the flexibility and scalability needed for big data analytics. Apache Iceberg, a high-performance table format for data lakes, solves this by enabling efficient querying, schema evolution, time-travel queries, and ACID compliance.
This guide covers two real-time MySQL to Iceberg data replication methods:
- Using Estuary Flow – A fully managed, real-time solution with no-code setup.
- Manual Method – A DIY approach using open-source tools like Debezium, Kafka, and Spark.
Why Move Data from MySQL to Iceberg?
Challenges with MySQL for Analytics:
- Scalability Limitations: MySQL isn’t optimized for big data workloads.
- Slow Query Performance: Complex queries slow down MySQL and impact production systems.
- Lack of Schema Evolution: Schema updates can be painful and rigid.
- High Storage Costs: Storing large datasets in MySQL can be expensive.
Benefits of Iceberg:
- Optimized for Analytics: Works seamlessly with Spark, Trino, and Flink.
- Schema Evolution: Add, remove, or modify columns without breaking queries.
- Partitioning & Time Travel: Improves query performance and historical data tracking.
- Cost-Effective Storage: Supports object storage like Amazon S3, reducing costs.
Organizations unlock faster analytics, better data governance, and cost savings by moving data from MySQL to Iceberg.
Watch this quick video to see how Apache Iceberg organizes your data lake and how Estuary Flow simplifies real-time MySQL-to-Iceberg integration.
Method 1: Connecting MySQL to Iceberg Using Estuary Flow
Estuary Flow simplifies moving MySQL to Iceberg with real-time Change Data Capture (CDC). This allows you to replicate MySQL tables into Apache Iceberg on Amazon S3 with sub-second latency, ensuring your Iceberg tables remain up to date.
Step 1: Configure MySQL as the Source
- Log in to your Estuary Flow account.
- In the left navigation pane, click "Sources" and then "+ NEW CAPTURE" to add a new source connector.
- Search for MySQL in the connectors list and click the "Capture" button.
- Fill in the required details under the Capture Details section:
- Name: A unique name for this MySQL source.
- Host: The hostname or IP address of your MySQL database.
- Port: Default is 3306, unless specified otherwise.
- Database Name: The name of your MySQL database.
- Username & Password: Credentials to access the database.
- Enable Binlog: Ensure MySQL’s binary logging is enabled for CDC.
- Click "NEXT > SAVE AND PUBLISH" to complete the MySQL source configuration.
Step 2: Configure Amazon S3 Iceberg as the Destination
- Once the MySQL capture is complete, a pop-up window with capture details appears. Click "MATERIALIZE COLLECTIONS" to begin setting up the destination.
- Alternatively, go to "Destinations" in the Estuary Flow dashboard and click "+ NEW MATERIALIZATION."
- Search for "Iceberg" in the connectors field and select the Amazon S3 Iceberg connector.
- Fill in the required fields:
- Name: A unique name for your Iceberg destination.
- AWS Access Key ID & Secret Access Key: IAM credentials for accessing Amazon S3.
- Bucket: The S3 bucket where Iceberg tables will be stored.
- Region: The AWS region where the S3 bucket exists.
- Namespace: A namespace for Iceberg tables.
- Choose Iceberg Catalog Type:
- AWS Glue Catalog (requires Glue permissions for your IAM user).
- REST Catalog (requires Catalog URI, credentials, and warehouse name).
- Link MySQL Capture to Iceberg Destination:
- Verify if MySQL data is automatically linked. If not, manually click "SOURCE FROM CAPTURE" to connect it.
- Click "NEXT > SAVE AND PUBLISH" to complete the destination setup.
Finalizing the MySQL to Iceberg Integration
Once configured, Estuary Flow will:
- Continuously capture MySQL changes and apply them to Iceberg.
- Ensure real-time data updates with sub-100ms latency.
- Automatically handle schema changes and optimize data format.
Benefits of Using Estuary Flow for MySQL to Iceberg
- Real-Time CDC with Low Latency – Captures MySQL changes and updates Iceberg in real-time, ensuring data is always fresh and consistent.
- No-Code, Easy Setup – Set up MySQL-to-Iceberg replication in minutes with a user-friendly interface—no coding required.
- Automatic Schema Evolution – Detects schema changes in MySQL and automatically updates Iceberg tables, eliminating manual adjustments.
- Reliability & Exactly-Once Consistency – Ensures zero duplicates or data loss with fault-tolerant processing and automatic retries.
- Scalability & Performance – Handles high-throughput ingestion of thousands of events per second, with no need for Kafka or Spark tuning.
- Built-In Monitoring & Security – Get real-time dashboards, alerts, encryption, and access controls, ensuring smooth operations with enterprise-grade security.
Method 2: Manual Approach to Load MySQL to Iceberg
The manual approach to loading data from MySQL to Iceberg involves using a combination of tools or custom code to extract the data, transfer it, and load it into an Iceberg table. This method gives you complete control over the process but is significantly more complex and time-consuming. Typically, a manual solution will include steps for initial batch export and a strategy for ongoing change data capture if you need continuous updates.
Step 1: Export MySQL Data
- Option 1: Using mysqldump – Export tables as CSV or SQL files with:
plaintextmysqldump --tab=/path/to/output mydatabase mytable
- Option 2: Using Apache Spark or Pandas – Read MySQL data via JDBC connector for distributed processing.
Step 2: Set Up Iceberg Environment
- Choose Storage – Store Iceberg data in Amazon S3, HDFS, or Azure Data Lake.
- Set Up Catalog – Use AWS Glue, Hive Metastore, or Apache Nessie to track Iceberg tables.
- Create Iceberg Tables – Use SparkSQL or HiveSQL to define tables:
plaintextCREATE TABLE prod_db.my_table (id INT, name STRING) USING iceberg PARTITIONED BY (date)
Step 3: Load Data into Iceberg
- Using Apache Spark – Read CSVs and write to Iceberg:
plaintextdf = spark.read.option("header", "true").csv("s3://bucket/export/mytable.csv")
df.write.format("iceberg").mode("append").save("prod_db.my_table")
- Using Trino or Flink – SQL engines to insert data into Iceberg.
Step 4: Set Up CDC for Real-Time Updates (Optional but Complex)
- Capture MySQL Binlog – Use Debezium or MySQL CDC to track data changes.
- Stream Data via Kafka – Use Kafka to queue updates from MySQL.
- Apply Changes to Iceberg:
- Kafka Connect with Iceberg Sink – Writes directly to Iceberg.
- Flink or Spark Structured Streaming – Processes MySQL change events in real-time.
- Handle Deletes & Updates – Use UPSERT/MERGE operations in Iceberg to sync data changes.
Step 5: Monitoring & Optimization
- Validate Data Sync – Compare row counts & schema changes between MySQL and Iceberg.
- Tune Performance – Optimize Kafka retention, batch commit frequency, and file compaction to improve efficiency.
- Continuous Maintenance – Manually manage Kafka, Debezium, Flink/Spark, and schema evolution.
Limitations of the Manual Method
- High Complexity & Development Effort—It requires expertise in MySQL bin logs, Debezium, Kafka, and Iceberg. Setup can take weeks and demands extensive testing.
- Heavy Maintenance Burden – You must monitor and troubleshoot Kafka brokers, CDC connectors, and infrastructure 24/7. Scaling requires constant tuning.
- Slower Data Updates – Achieving real-time replication is challenging without careful Kafka and Flink optimizations. Delays can occur if tuning isn’t perfect.
- No Automatic Schema Handling – Schema changes in MySQL aren’t automatically applied to Iceberg, requiring manual updates to schemas and transformation logic.
- Lack of UI & Debugging Tools—Troubleshooting data issues is difficult with fragmented logs across MySQL, Kafka, and processing jobs. There is no single-pane monitoring.
- Hidden Costs—Open-source tools may be free, but infrastructure costs (Kafka, Flink, storage) and engineering time add up, making them expensive in the long run.
Estuary Flow provides a more straightforward, scalable, and cost-effective alternative for most use cases!
Use Cases for MySQL to Iceberg Replication
Replicating MySQL to Apache Iceberg isn’t just a technical upgrade—it enables real-time analytics, scalability, and cost savings across industries. Here are some key use cases where MySQL to Iceberg replication delivers measurable benefits.
1. Real-Time BI with MySQL & Iceberg
- Use Case: To optimize decision-making, an e-commerce company needs real-time insights into customer behavior, sales trends, and inventory levels.
- How It Helps:
- MySQL struggles with complex analytical queries, slowing down performance.
- Iceberg enables faster reporting and BI dashboards using Trino, Presto, or Spark.
- Supports real-time CDC, ensuring instant data availability for analytics teams.
2. Offloading OLTP Workloads & Reducing MySQL Load
- Use Case: A fintech company handling millions of daily transactions must offload analytical queries from MySQL to avoid slowing down operational systems.
- How It Helps:
- Iceberg enables running heavy analytical queries on historical transaction data without impacting MySQL performance.
- Supports time-travel queries, allowing financial teams to analyze historical records efficiently.
- Scales effortlessly to handle massive financial datasets while maintaining data integrity.
3. Machine Learning & AI Pipelines
- Use Case: A healthcare provider wants to use AI models to predict patient readmission rates based on historical patient data.
- How It Helps:
- Iceberg provides a scalable data lake for storing large ML training datasets.
- Supports batch and streaming data ingestion, ensuring ML models remain up-to-date.
- Enables seamless data transformation for feature engineering before training.
Conclusion: The Best Way to Sync MySQL Data to Apache Iceberg
Replicating MySQL to Apache Iceberg unlocks scalable, cost-efficient, and high-performance analytics—but how you move the data matters.
- The manual approach offers complete control but is complex, time-consuming, and requires ongoing maintenance with Kafka, Debezium, and Flink.
- Estuary Flow simplifies the process with real-time CDC, no-code setup, automated schema evolution, and seamless scalability, making it the best choice for fast, reliable data pipelines.
Which Method Should You Choose?
1️⃣ Use Estuary Flow for a hassle-free, automated MySQL-to-Iceberg pipeline with real-time updates and minimal setup.
2️⃣ Go manual only if you require highly customized CDC pipelines and have a dedicated data engineering team to manage them.
Next Steps
Try Estuary Flow – Sign up and configure your first pipeline.
Related Articles

About the author
Emily is a software engineer and technical content creator with an interest in developer education. She has experience across Developer Relations roles from her FinTech background and is always learning something new.
Popular Articles
