Estuary

Google BigQuery to Iceberg Tables: Best Methods and Tools

This guide explains how to move data from BigQuery to Apache Iceberg, covering automated sync options, manual export workflows, and key considerations for analytics and lakehouse use cases.

bigquery to iceberg
Share this article

You can move data from Google BigQuery to Apache Iceberg using two practical approaches: an automated pipeline that continuously syncs data, or a manual export-and-load workflow using native BigQuery exports and Apache Spark. The right method depends on how frequently your data changes, how much operational complexity you can support, and whether Iceberg tables need to stay up to date over time.

Apache Iceberg is often used alongside BigQuery to reduce long-term storage costs, enable multi-engine analytics, and support large-scale historical analysis. Estuary is the Right-Time Data Platform that allows teams to automate data movement from BigQuery into Iceberg tables, with control over when data moves, including near real-time or batch delivery.

Key Takeaways

  • There are two common ways to move data from BigQuery to Iceberg: an automated integration using Estuary, or a manual workflow using BigQuery exports and Apache Spark.

  • Automated pipelines are best for ongoing analytics and production workloads, where Iceberg tables must remain synchronized with BigQuery data over time.

  • Manual export-based approaches are better suited for one-time migrations or archival use cases, but require more engineering effort and infrastructure management.

  • Apache Iceberg enables capabilities that complement BigQuery, including open table storage, schema evolution, time travel, and support for multiple query engines.

  • The choice depends on data freshness, cost predictability, and operational ownership, not just dataset size.

BigQuery to Apache Iceberg Data Movement Options: How to Choose the Right Method

There are multiple ways to move data from Google BigQuery to Apache Iceberg, but in practice most teams choose between an automated incremental pipeline or a manual export-and-load workflow. The right approach depends on how frequently your BigQuery data changes, how fresh Iceberg data needs to be, and how much operational complexity your team can support.

In most real-world architectures, BigQuery is not fully replaced. Instead, data is copied or synchronized into Iceberg to support long-term storage, multi-engine analytics, or cost optimization. For this reason, it is more accurate to think in terms of data movement or offloading, rather than a one-time migration.

BigQuery to Iceberg: Method Comparison

CriteriaAutomated Data Movement (Estuary)Manual Export and Load (BigQuery + Spark)
Primary use caseOngoing data synchronizationOne-time or periodic data offloading
Data freshnessIncremental, poll-based (near real-time to batch)Batch only
Setup effortLowMedium to high
Ongoing maintenanceMinimalRequires scripting and job management
Schema changesAutomatically propagatedManual handling required
Infrastructure requiredManaged by platformSpark environment required
Failure recoveryBuilt-in state trackingCustom retry logic required
Best suited forProduction analytics and lakehouse pipelinesBackfills and archival use cases

When to Use Automated BigQuery to Iceberg Data Movement

An automated approach is best when BigQuery tables change over time and Iceberg tables are expected to stay up to date without repeated manual exports.

This pattern is common when:

  • BigQuery is used for ingestion or transformation
  • Iceberg serves as a long-term analytical or lakehouse layer
  • Data must be queried by multiple engines such as Spark, Trino, or Flink
  • Teams want predictable operations without maintaining export scripts

Estuary supports this pattern by periodically querying BigQuery tables, using cursor columns to detect new or updated rows, and materializing those results into Iceberg tables on object storage. This approach aligns with BigQuery’s execution model and avoids assumptions about log-based CDC, which BigQuery does not provide.

When Manual Export and Spark-Based Loading Makes Sense

A manual export-based approach is more appropriate for controlled, batch-oriented scenarios.

This method is typically used when:

  • Data needs to be moved once or infrequently
  • BigQuery tables are relatively static
  • Teams already operate Spark infrastructure
  • Full control over export and load logic is required

In this workflow, BigQuery data is exported to cloud storage using native SQL commands and then loaded into Iceberg tables using Apache Spark. While flexible, this approach requires additional effort to manage schema evolution, retries, and job orchestration.

Key Factors to Consider Before Choosing a Method

Before deciding how to move data from BigQuery to Iceberg, consider:

  • Update frequency: Frequently changing tables benefit from automated incremental sync.
  • Operational ownership: Automated pipelines reduce long-term maintenance.
  • Cost predictability: Repeated full exports can increase BigQuery scan and egress costs.
  • Long-term architecture: Iceberg is often part of a shared lakehouse rather than a replacement for BigQuery.

Suggested read: How to load Data into Apache Iceberg

Best Tools to Move Data from BigQuery to Apache Iceberg

Several tools and approaches can be used to move data from Google BigQuery into Apache Iceberg tables. In practice, these options fall into two categories: managed data movement platforms that automate incremental sync, and custom export-and-load pipelines built using BigQuery and Spark.

The best tool depends on how often your BigQuery data changes, how fresh Iceberg data needs to be, and how much operational complexity your team is willing to manage.

Estuary (Automated Incremental Data Movement from BigQuery to Iceberg)

Estuary is the Right-Time Data Platform used to automate data movement from BigQuery into Apache Iceberg tables stored on object storage. Instead of relying on log-based CDC, Estuary’s BigQuery connector works by periodically executing queries against BigQuery tables and capturing new or updated rows into collections.

Incrementality is achieved by configuring cursor columns, such as update timestamps or monotonically increasing IDs. On each poll, Estuary queries only rows with cursor values greater than the last observed value, making it well suited for large analytical tables that change over time.

Estuary is best suited for:

  • Ongoing synchronization between BigQuery and Iceberg
  • Lakehouse architectures where Iceberg serves as long-term storage
  • Teams that want to avoid maintaining custom export jobs
  • Use cases where near real-time to batch updates are acceptable

Key characteristics:

  • Poll-based incremental data capture aligned with BigQuery’s execution model
  • Automatic schema propagation from BigQuery to Iceberg
  • Writes to Apache Iceberg tables on object storage using a catalog such as AWS Glue or a REST catalog
  • Minimal operational overhead compared to custom pipelines

Because Estuary manages state, polling, and delivery internally, it is commonly used when Iceberg tables must remain accurate and continuously updated without manual intervention.

Apache Spark with Native BigQuery Exports (Manual Pipeline)

Apache Spark is often used when teams want full control over how BigQuery data is exported and loaded into Iceberg.

In this approach:

  • Data is exported from BigQuery using the EXPORT DATA SQL command, typically in Parquet format
  • Exported files are stored in cloud storage such as Google Cloud Storage
  • Spark reads the exported files and writes them into Iceberg tables using an Iceberg catalog

This method is best suited for:

  • One-time migrations or historical backfills
  • Periodic archival of BigQuery data
  • Teams with existing Spark infrastructure and expertise

While flexible, this approach requires engineers to manage:

  • Export scheduling
  • Schema evolution
  • Spark job configuration
  • Failure handling and retries

As a result, it is generally not ideal for keeping Iceberg tables continuously in sync with BigQuery.

BigQuery-Native and Cloud-Native Tooling (Limited Scope)

BigQuery-native tools, such as scheduled queries and export jobs, can be combined with cloud services to move data into Iceberg. These tools are typically used as building blocks rather than complete solutions.

They are best suited for:

  • Simple batch exports
  • Infrequent data movement
  • Low operational requirements

However, these approaches still require Spark or similar engines to write Iceberg tables and do not provide built-in handling for incremental updates or schema changes.

Tool Selection Summary

For most teams:

  • Use Estuary when you need an automated, low-maintenance way to keep Iceberg tables synchronized with BigQuery data.
  • Use Spark-based pipelines when performing one-time data movement or when full control over the export and load process is required.
  • Use BigQuery-native exports alone only for simple batch scenarios where Iceberg is not continuously updated.

Step-by-Step: Automated BigQuery to Apache Iceberg Data Movement Using Estuary

This method uses Estuary to automate incremental data movement from Google BigQuery into Apache Iceberg tables stored on object storage. It is best suited for teams that want Iceberg tables to stay up to date over time without building and maintaining custom export scripts or Spark jobs.

Estuary’s BigQuery connector operates by periodically executing queries against BigQuery tables and capturing new or updated rows using cursor columns, which aligns with BigQuery’s execution model and avoids assumptions about log-based CDC.

Prerequisites

Before setting up the pipeline, ensure the following are in place:

  • An active Estuary account
  • A Google Cloud project with BigQuery enabled
  • A Google Cloud service account with the following roles:
    • BigQuery User
    • BigQuery Data Viewer
  • A service account key in JSON format
  • An object storage bucket for Iceberg tables (such as Amazon S3)
  • An Iceberg catalog, such as AWS Glue or a REST-based catalog
  • Credentials with permission to write to the Iceberg storage and catalog

Step 1: Configure BigQuery as the Source

  • Log in to your Estuary account.
  • From the left navigation pane of the Estuary page, click the Sources option.
BigQuery to iceberg - new capture
  • Click the + NEW CAPTURE button on the Sources page and search for BigQuery using the Search connectors field.
BigQuery to iceberg - bigquery connector search
  • When the BigQuery connector appears in the search results, click the connector’s Capture button.
BigQuery to iceberg - bigquery source configuration
  • You will be redirected to the BigQuery connector configuration page; enter a unique source capture name in the Name field under the Capture Details section.
  • Expand the Endpoint Config section and fill in the mandatory fields, including:
    • Project ID: It is the Google Cloud Project ID associated with the BigQuery dataset(s).
    • Service Account JSON: JSON credentials to authorize the service account. 
    • Dataset: The name of your BigQuery dataset.

For each selected table, configure cursor columns to enable incremental data capture. Common cursor choices include update timestamps or monotonically increasing identifiers. When a cursor is defined, Estuary queries only rows with cursor values greater than the last observed value on each polling cycle.

Once the configuration is complete, save and publish the capture.

Step 2: Configure Apache Iceberg as the Destination

Estuary Flow provides Amazon S3-managed Apache Iceberg tables as a destination for faster analytics. Once your BigQuery data capture is complete, a pop-up window with the capture details appears. Click the MATERIALIZE COLLECTIONS option in this window to begin configuring the destination end of the data pipeline.

Alternatively, on the Estuary Flow dashboard, click Destinations > + NEW MATERIALIZATION. On the Create Materialization page, begin your setup using the following steps:

  • Search for Iceberg in the Search connectors field; an Amazon S3 Iceberg connector will appear. Using this connector, you can materialize delta updates from Flow collections into Apache Iceberg tables. The connector leverages Amazon S3 for object storage and AWS Glue as the catalog.

Delta updates refer to changes made to data since the previous modification. These updates are batched within Flow, transformed into Parquet files, and then appended to the Iceberg tables according to your defined schedule. 

BigQuery to iceberg - iceberg connector search
  • Click the connector’s Materialization button to proceed with your destination configuration.
BigQuery to iceberg - iceberg materialization configuration
  • On the Create Materialization page, fill all the necessary fields, including:
    • Name: A unique name for your materialization.
    • AWS Access Key ID: The Access Key ID to access AWS services.
    • AWS Secret Access Key: A secret access key to access AWS services.
    • Bucket: Name of the S3 bucket name into which you need to write the data.
    • Region: The AWS region where your S3 bucket exists.
    • Namespace: A namespace for bound collection tables.
  • In the Catalog section, you can choose either REST or AWS GLUE as the Iceberg catalog. 
    • The AWS Glue Catalog requires that your IAM user have the permissions to access AWS Glue.
    • The REST Catalog requires the following additional configuration:
      • The Catalog URI
      • Credentials to connect to the catalog
      • The name of the Warehouse to connect with
  • Under the Source Collections section, you must verify if the captured data from the BigQuery source is automatically loaded to your materialization. If not, you can manually link the capture by clicking the SOURCE FROM CAPTURE button.
  • Click NEXT > SAVE AND PUBLISH to finish your destination configuration.

Estuary will begin materializing data from BigQuery into Iceberg tables according to the configured polling intervals.

How Incremental Updates Work

Estuary captures BigQuery data incrementally by:

  • Running SELECT queries against source tables at configurable intervals
  • Filtering rows using cursor columns
  • Tracking the highest observed cursor value between polls
  • Writing only new or updated rows to Iceberg tables

This approach allows Iceberg tables to remain synchronized with BigQuery data without repeatedly exporting full tables.

How Data Is Written to Iceberg

Captured data is:

  • Converted into columnar files
  • Written to object storage
  • Committed to Iceberg tables using snapshot-based metadata

Iceberg features such as schema evolution and time travel remain available, and tables can be queried by engines such as Spark, Trino, Flink, or Redshift Spectrum.

When This Method Is the Best Choice

Use this approach when:

  • BigQuery tables change regularly
  • Iceberg is used as a long-term analytical or lakehouse layer
  • You want to minimize operational overhead
  • Incremental updates are sufficient (near real-time to batch)

This method is commonly used for analytics pipelines, historical data offloading, and shared data lake architectures.

Step-by-Step: Manual BigQuery to Apache Iceberg Data Movement Using BigQuery Exports and Apache Spark

This method moves data from Google BigQuery to Apache Iceberg using a batch-based export and load workflow. It is best suited for one-time data movement, historical backfills, or periodic archival use cases where continuous synchronization is not required.

Unlike automated pipelines, this approach requires manual orchestration and infrastructure management, but provides full control over how data is exported, transformed, and written to Iceberg.

Prerequisites

Before starting, ensure you have:

  • A Google Cloud project with BigQuery enabled
  • Permissions to run BigQuery export jobs
  • A cloud storage bucket (typically Google Cloud Storage) for exported files
  • A Spark environment (Dataproc, Databricks, EMR, or self-managed Spark)
  • Apache Iceberg libraries compatible with your Spark version
  • An Iceberg catalog, such as:
    • AWS Glue Catalog
    • Hadoop catalog
    • REST catalog

Step 1: Export Data from BigQuery to Cloud Storage

BigQuery provides a native EXPORT DATA statement that allows you to export query results directly to cloud storage in columnar formats such as Parquet, which are compatible with Iceberg.

Example export command:

plaintext language-sql
EXPORT DATA OPTIONS( uri='gs://your-bucket/bigquery-export/your_table_*.parquet', format='PARQUET', overwrite=true ) AS SELECT * FROM dataset_name.table_name;

This operation:

  • Executes a snapshot query against the table
  • Writes the results as Parquet files to Google Cloud Storage
  • Produces a static view of the data at export time

Each export job handles one query result, so multiple tables require separate export operations.

Step 2: Configure Apache Spark with Iceberg

To load data into Iceberg, Spark must be configured with Iceberg extensions and a catalog.

Example Spark configuration:

plaintext language-bash
spark-sql \\ --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1 \\ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \\ --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \\ --conf spark.sql.catalog.iceberg.type=hadoop \\ --conf spark.sql.catalog.iceberg.warehouse=s3://your-bucket/iceberg-warehouse

You can replace the Hadoop catalog with AWS Glue or a REST catalog, depending on your environment.

Step 3: Load Exported Data into Iceberg Tables

Spark can read the exported Parquet files directly from cloud storage without downloading them locally.

Example Spark SQL or PySpark workflow:

python
spark.sql("CREATE DATABASE IF NOT EXISTS analytics") df = spark.read.parquet( "gs://your-bucket/bigquery-export/your_table_*.parquet" ) df.writeTo("iceberg.analytics.your_table") \\ .using("iceberg") \\ .createOrReplace()

At this stage:

  • The data is written into an Iceberg table
  • Iceberg metadata tracks snapshots and file locations
  • The table becomes queryable by Spark and other compatible engines

Optional optimizations include:

  • Defining explicit schemas instead of relying on inference
  • Applying partition transforms such as days(timestamp_column)
  • Running compaction jobs to merge small files

Step 4: Validate and Maintain the Iceberg Table

After loading data:

  • Validate row counts against BigQuery
  • Run sample queries to confirm schema correctness
  • Monitor file sizes and metadata growth
  • Schedule compaction if data volumes are large

Because this method is batch-based, any new data added to BigQuery after the export requires a new export and load cycle.

Limitations of the Manual Export-and-Load Approach

This approach introduces several trade-offs:

  • Batch-only behavior: Iceberg tables are not continuously updated
  • Manual orchestration: Scheduling, retries, and monitoring must be handled by engineers
  • Schema drift risk: Changes in BigQuery schemas require manual updates
  • Higher operational overhead: Spark infrastructure must be managed

For these reasons, this method is typically used for migrations or archival workflows rather than production synchronization.

Operational Considerations and Best Practices for Moving Data from BigQuery to Apache Iceberg

Moving data from Google BigQuery to Apache Iceberg is not just a matter of exporting and loading tables. To ensure correctness, performance, and long-term maintainability, teams need to account for cost behavior, schema evolution, partitioning strategy, and ongoing table maintenance.

The following best practices apply whether you use an automated pipeline or a manual export-based approach.

Understand BigQuery Scan and Export Costs

BigQuery pricing is largely driven by data scanned during query execution. Exporting data using EXPORT DATA or incremental polling queries still incurs scan costs.

Best practices:

  • Avoid frequent full-table exports for large tables
  • Use incremental approaches when tables change regularly
  • Limit exports to required columns rather than SELECT * when possible
  • Be mindful of repeated scans caused by poorly chosen cursor columns

Automated incremental pipelines typically reduce scan costs over time compared to repeated full exports.

Choose Cursor Columns Carefully for Incremental Data Movement

When moving data incrementally from BigQuery, cursor columns determine which rows are captured on each run.

Recommended cursor characteristics:

  • Monotonically increasing values
  • Update timestamps when available
  • Creation timestamps for append-only tables
  • Composite cursors only when necessary

Avoid:

  • Non-deterministic columns
  • Columns that can decrease or be updated retroactively
  • Cursors that require scanning large portions of the table repeatedly

Well-chosen cursors significantly reduce query cost and improve reliability.

Plan for Schema Evolution Early

BigQuery schemas often change over time as new fields are added or data types evolve.

Best practices:

  • Validate how new columns are handled in Iceberg
  • Avoid incompatible type changes such as string-to-integer
  • Track nullable versus non-nullable fields
  • Monitor schema changes in BigQuery datasets proactively

Iceberg supports schema evolution, but not all changes are backward compatible without careful planning.

Design an Effective Partitioning Strategy in Iceberg

Partitioning has a major impact on Iceberg query performance and storage efficiency.

Recommended practices:

  • Use low-cardinality columns such as dates or regions
  • Prefer Iceberg partition transforms like days(timestamp_column)
  • Avoid over-partitioning, which increases metadata overhead
  • Base partitioning on query patterns, not source table structure

Iceberg’s hidden partitioning allows you to change partition layouts later without rewriting data.

Manage File Sizes and Compaction

Incremental data movement often produces many small files, which can degrade query performance.

Best practices:

  • Periodically compact small files into larger ones
  • Target file sizes in the 256 MB to 512 MB range for analytics
  • Schedule compaction as a background maintenance task
  • Monitor snapshot and metadata growth over time

Regular compaction improves scan efficiency and reduces planning overhead.

Account for Cross-Cloud Data Movement

In many architectures, BigQuery data is exported to Google Cloud Storage while Iceberg tables are stored on Amazon S3 or another object store.

Considerations:

  • Network egress costs between cloud providers
  • Latency for large exports
  • Security and credential management across clouds

For large datasets, cross-cloud data movement can significantly impact cost and performance.

Validate Data After Movement

After moving data into Iceberg, validation is essential.

Recommended checks:

  • Compare row counts between BigQuery and Iceberg
  • Validate sample records for type and value correctness
  • Run representative analytical queries to confirm performance
  • Monitor query plans and scanned file sizes

Early validation prevents subtle issues from propagating into downstream analytics.

Monitor and Maintain Iceberg Tables Over Time

Once data is in Iceberg, ongoing maintenance is required.

Best practices:

  • Monitor snapshot growth and metadata size
  • Clean up expired snapshots when appropriate
  • Review access controls on object storage and catalog services
  • Track query performance across different engines

Iceberg tables are durable, but they benefit from periodic housekeeping.

Conclusion

Moving data from Google BigQuery to Apache Iceberg is a common pattern for teams that want more control over long-term storage costs, support multiple analytics engines, or build an open lakehouse architecture. In most cases, this process is not a full migration away from BigQuery, but a deliberate way to offload or synchronize data into Iceberg for broader analytical use.

There are two practical approaches to achieve this. An automated, incremental pipeline is best suited for production analytics and long-lived data platforms where Iceberg tables need to stay up to date as BigQuery data changes. A manual export-and-load workflow is more appropriate for one-time data movement, historical backfills, or archival use cases where continuous updates are not required.

For teams that want to minimize operational overhead and avoid maintaining custom export jobs, automated data movement provides a more reliable and scalable solution. For teams that need full control and already operate Spark infrastructure, manual exports remain a viable option. The right choice depends on data freshness requirements, cost considerations, and long-term ownership of the pipeline.

To understand how Estuary automates data integration and enhances your business productivity, feel free to connect with Estuary experts.

Related Sync with BigQuery

FAQs

    How do I move data from BigQuery to Apache Iceberg?

    Data can be moved from Google BigQuery to Apache Iceberg using either an automated incremental pipeline or a manual batch-based workflow. An automated approach uses a data movement platform to periodically query BigQuery tables, capture new or updated rows, and write them into Iceberg tables on object storage. A manual approach involves exporting data from BigQuery using the EXPORT DATA command and then loading the exported files into Iceberg tables using Apache Spark. The automated method is better suited for ongoing analytics, while the manual method is typically used for one-time or infrequent data movement.
    The best approach depends on how often the data changes and how the Iceberg tables are used. For production analytics and shared lakehouse environments, an automated incremental pipeline is usually the best option because it keeps Iceberg tables synchronized with BigQuery over time and reduces operational effort. For historical backfills or archival use cases, a manual export-and-load process can be sufficient, though it requires more engineering effort to manage.
    Iceberg is often chosen when teams want to separate storage from compute, reduce long-term storage costs, support multiple query engines, or enable features such as schema evolution and time travel. BigQuery remains a strong option for serverless analytics and interactive querying, but Iceberg provides greater flexibility for modern data lake and lakehouse architectures.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Jeffrey Richman
Jeffrey Richman

With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.