Estuary

Google BigQuery to Iceberg Tables: Best Methods and Tools

Simplify data movement from Google BigQuery to Iceberg tables using effective techniques for efficiency, cost savings, and scalability.

Share this article

Google BigQuery is a fully managed warehouse solution that allows your businesses to run fast, SQL-based queries on large datasets. However, as data volumes increase, you may encounter certain difficulties with BigQuery. This includes increasing storage costs for historical or less frequently used data and limitations in complex analytical queries.

Moving data from BigQuery to Apache Iceberg tables can improve query performance even with vast amounts of historical or current data. Apache Iceberg is a high-performance table format built for efficient data storage and optimized to manage analytical workloads at any scale across distributed systems.

In this guide, we’ll show you how to move data from Google BigQuery to Apache Iceberg efficiently using two methods: the Automated Method with Estuary Flow and the Manual Method with custom scripts. These approaches cater to different technical needs, offering both ease of use and flexibility.

A Quick Overview of BigQuery

bigquery logo

BigQuery is a multi-cloudserverlessAI-powered data analytics platform. It helps you transform large datasets into meaningful business insights. BigQuery provides a single workspace consisting of SQL, a notebook, and a natural language-based canvas interface. With this unified platform, you can streamline data workflows from integration and transformation to visualization and machine learning model development as well as deployment. 

One of the main strengths of BigQuery is its real-time querying capabilities through continuous queries. These queries are the SQL statements that allow you to transform, replicate, and analyze incoming data in BigQuery instantly. This ensures your insights are always current and actionable.

In terms of data security, BigQuery supports managing various data types with fine-grained access controls, ensuring secure and efficient data handling. Its built-in governance features further enhance this by providing robust tools to monitor, control, and enforce policies across your datasets.

A Quick Overview of Apache Iceberg

iceberg logo

Apache Iceberg is an open-source table format for handling large-scale datasets, including petabyte-scale data. With an open specification, Iceberg ensures compatibility with multiple data processing engines, such as Apache Spark, Trino, Flink, Presto, Hive, and Impala. These engines help you read and write to the same table simultaneously using familiar SQL syntax, preventing data inconsistencies or corruption.

Here are a few other key features of Apache Iceberg:

  • Full Schema Evolution: Apache Iceberg supports schema evolution, allowing you to add, drop, rename, update, or reorder columns efficiently, ensuring existing data remains unaffected. This approach reduces the need to rewrite the entire table, saving time and computational resources.
  • Time Travel: The time-travel feature helps you query and analyze historical versions of a table. With this feature, you can access historical data snapshots and run reproducible queries. As a result, you will get consistent results even if the data has changed.
  • Flexible SQL Support: Iceberg supports flexible SQL commands, allowing you to merge new data, update existing rows, and perform deletes. It offers two options for optimizing performance - rewriting data files to enhance read performance or using delete deltas for quicker updates.
  • Hidden Partitioning: You can handle the complex and error-prone task of generating partition values for rows in a table through hidden partitioning. It also optimizes query performance by automatically avoiding unnecessary partitions during data access. This ensures that you do not need to manage partition details or apply extra filters to your queries. You can also dynamically update the table layouts if data or query patterns change.

Suggested read: How to load Data into Apache Iceberg

2 Methods to Move Data From Google BigQuery to Apache Iceberg

To efficiently move data from BigQuery to Iceberg tables, you can utilize any one of the following methods: 

  • Automated Method: BigQuery Iceberg integration Using Estuary Flow
  • Manual Method: BigQuery to Iceberg Replication Using Custom Scripts and Google Cloud Console 

Method 1: Connecting BigQuery to Iceberg Using Estuary Flow

Estuary Flow is a real-time data integration platform that facilitates faster data movement and transformation. It provides an intuitive interface to develop and automate ETL/ELT pipelines in minutes. Using Estuary Flow, you can develop batch pipelines for analytical use cases or streaming pipelines for Ops and AI, all with millisecond latency.

Here are some of the features of the platform:

  • Change Data Capture (CDC): Estuary Flow supports end-to-end CDC, helping you track changes in the source schema and copy them to your destination in real time. With the CDC approach, you can stream data across the platforms with a sub-100 millisecond latency for real-time data transfers.
  • No-Code Connectors: With over 200 pre-built connectors, Estuary Flow allows you to automate ETL workflows across multiple sources and destinations. These connectors minimize manual intervention in the pipeline setup.
  • Multi-Cloud Deployment: Estuary Flow offers various deployment modes to meet your organization’s needs. The different modes include public deployment for hosted solutions, private deployment for full infrastructure control, and BYOC (Bring Your Own Cloud) for scalable cloud configurations.
  • Efficient Data Transformation: Using Estuary Flow, you can create and apply transformations for your ETL pipelines using Streaming SQL or TypeScript. With dbt integration, Estuary Flow enables you to transform the data right after the initial data sync.

Let’s see how you can move data from BigQuery to Iceberg using Estuary Flow. Before starting the data integration, ensure the following prerequisites are in place:

  • An active Estuary Flow account.
  • A Google Cloud Project with BigQuery enabled. 
  • A Google Cloud Service Account with the BigQuery User and BigQuery Data Viewer roles in your GCP project.
  • A Service Account Key for authentication. 
  • An Amazon S3 bucket to write files to Apache Iceberg tables.
  • An IAM user with read/write privileges to the S3 bucket.
  • The IAM user's access key and secret access key.

Step 1: Configure BigQuery as the Source

  • Log in to your Estuary Flow account.
  • From the left navigation pane of the Estuary page, click the Sources option.
BigQuery to iceberg - new capture
  • Click the + NEW CAPTURE button on the Sources page and search for BigQuery using the Search connectors field.
BigQuery to iceberg - bigquery connector search
  • When the BigQuery connector appears in the search results, click the connector’s Capture button.
BigQuery to iceberg - bigquery source configuration
  • You will be redirected to the BigQuery connector configuration page; enter a unique source capture name in the Name field under the Capture Details section.
  • Expand the Endpoint Config section and fill in the mandatory fields, including:
    • Project ID: It is the Google Cloud Project ID associated with the BigQuery dataset(s).
    • Service Account JSON: JSON credentials to authorize the service account. 
    • Dataset: The name of your BigQuery dataset.
  • After you specify all the required information, click NEXT > SAVE AND PUBLISH.

The above steps will help you complete the BigQuery source configuration successfully. This setup enables the connector to capture data from BigQuery and convert it into a data collection.

Step 2: Configure Amazon S3 Iceberg as the Destination

Estuary Flow provides Amazon S3-managed Apache Iceberg tables as a destination for faster analytics. Once your BigQuery data capture is complete, a pop-up window with the capture details appears. Click the MATERIALIZE COLLECTIONS option in this window to begin configuring the destination end of the data pipeline.

Alternatively, on the Estuary Flow dashboard, click Destinations > + NEW MATERIALIZATION. On the Create Materialization page, begin your setup using the following steps:

  • Search for Iceberg in the Search connectors field; an Amazon S3 Iceberg connector will appear. Using this connector, you can materialize delta updates from Flow collections into Apache Iceberg tables. The connector leverages Amazon S3 for object storage and AWS Glue as the catalog.

Delta updates refer to changes made to data since the previous modification. These updates are batched within Flow, transformed into Parquet files, and then appended to the Iceberg tables according to your defined schedule. 

 

BigQuery to iceberg - iceberg connector search

 

  • Click the connector’s Materialization button to proceed with your destination configuration.
BigQuery to iceberg - iceberg materialization configuration
  • On the Create Materialization page, fill all the necessary fields, including:
    • Name: A unique name for your materialization.
    • AWS Access Key ID: The Access Key ID to access AWS services.
    • AWS Secret Access Key: A secret access key to access AWS services.
    • Bucket: Name of the S3 bucket name into which you need to write the data.
    • Region: The AWS region where your S3 bucket exists.
    • Namespace: A namespace for bound collection tables.
  • In the Catalog section, you can choose either REST or AWS GLUE as the Iceberg catalog. 
    • The AWS Glue Catalog requires that your IAM user have the permissions to access AWS Glue.
    • The REST Catalog requires the following additional configuration:
      • The Catalog URI
      • Credentials to connect to the catalog
      • The name of the Warehouse to connect with
  • Under the Source Collections section, you must verify if the captured data from the BigQuery source is automatically loaded to your materialization. If not, you can manually link the capture by clicking the SOURCE FROM CAPTURE button.
  • Click NEXT > SAVE AND PUBLISH to finish your destination configuration.

The Amazon S3 Iceberg connector will materialize data from your BigQuery Flow collection into your Iceberg table, completing the BigQuery to Iceberg integration.

Method 2: Connecting BigQuery to Apache Iceberg Using Custom Scripts and Google Cloud Console

Manually transferring data from Google BigQuery to Apache Iceberg without using third-party connectors requires significant time and manual effort. This will involve complex configuration steps to extract data from BigQuery, standardize it, and load it into an Iceberg table.

Here’s how you can manually execute a BigQuery-Iceberg external table integration.

Prerequisites:

  • A GCP account with BigQuery enabled.
  • A local or cloud environment, such as a Hadoop cluster, to run Apache Iceberg.
  • Necessary permissions to read from BigQuery and write to the Iceberg table.
  • Install Apache Spark or a similar tool compatible with Iceberg to extract from BigQuery and load to Iceberg.
  • Install JDK 8 or above.
  • Configure Apache Iceberg by importing the Iceberg library to your Spark environment.

Step 1: Extract Data from BigQuery Using SQL

  1. Sign into your GCP account.
  2. Navigate to the BigQuery page from the Google Cloud console sidebar.
  3. In the BigQuery console, locate the dataset and table you want to export.
  4. Open the Query Editor and run the following SQL command to transfer your datasets to a Google Cloud bucket in a format (PARQUET) that is compatible with Iceberg.
plaintext
EXPORT DATA OPTIONS(  uri='gs://bucket_name/prefix_name-*.parquet',  format='PARQUET',  overwrite=true ) AS SELECT * FROM dataset_name.table_name;

The parameters in the query are:

  • EXPORT DATA: A SQL command used in Google BigQuery to export query results to a specified location, typically in a Google Cloud Storage (GCS) bucket.
  • OPTIONS: A clause that allows you to provide additional parameters to customize how to execute the export process.
  • uri: Specifies the destination path in GCS where the exported files will be stored.
  • format: The file format for the exported data. Options include CSV, JSON, AVRO, and PARQUET.
  • overwrite: Determines whether existing files in the specified GCS path should be replaced. The value should be set to true if you need to overwrite any files with the same name or prefix in the destination. Otherwise, set it to false, and the export will fail if files with the same name or prefix already exist.
  • SELECT * FROM dataset_name.table_name: A SELECT statement to extract data from a table in the BigQuery dataset.

Make sure to replace dataset_name.table_name with your dataset and table name. Also, specify the appropriate GCS URI in the uri field.

  1. Click the Run button and verify that the files have been exported to your GCS bucket successfully.

Step 2: Download Data from GCS

  1. Install and configure the gsutil command-line tool to connect with GCS.
  2. Download the exported files to your local PC or directly to a machine where Apache Iceberg is configured.
plaintext
gsutil cp gs://bucket_name/prefix_name-*.parquet /path/to/local/directory

Step 3: Set Up the Apache Iceberg Table

  1. Run this program on the spark-shell command line tool to initialize the Iceberg table if it does not already exist.
plaintext
val spark = SparkSession.builder()  .appName("Iceberg")  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")  .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")  .config("spark.sql.catalog.local.type", "hadoop")  .config("spark.sql.catalog.local.warehouse", "/path/to/iceberg/warehouse")  .getOrCreate()
  1. You can now create an Iceberg table using Spark SQL:
plaintext
spark.sql("CREATE TABLE IF NOT EXISTS local.db.table_name (id bigint, data string) USING iceberg")

Step 4: Load Data into Apache Iceberg Table

  1. Load the downloaded PARQUET file into Spark DataFrame.
plaintext
val ReadParquetFile = spark.read.parquet("/path/to/local/directory/prefix_name-*.parquet")
  1. Append the data to the Iceberg table.
plaintext
ReadParquetFile.write.format("iceberg").mode("append").save("local.db.table_name")
  1. Check that the data has been successfully written to the Iceberg table.
plaintext
spark.read.format("iceberg").load("local.db.table_name").show()

Step 5: Clean Up Resources

Once the data is in the Iceberg table, you must clear all the temporary files or unnecessary data from the process.

Limitations of the Manual Method

  • Lack of Real-time Integration: When exporting data from BigQuery to Iceberg, any new data added to BigQuery after the export job starts will not be included in that export. To include the new data, you would need to start another export job after the new data has been loaded. The associated delays make it challenging to maintain an up-to-date Iceberg table in sync with BigQuery.
  • One Table per Export Job: The process restricts the export of one BigQuery table per job. You cannot export data from multiple BigQuery tables in a single export job. This can complicate the migration to Iceberg if multiple tables are involved, as you will need separate export jobs for each table. 

Use Cases for BigQuery to Apache Iceberg Data Transfer

  • Advanced Analytics: Gaining insights from large datasets can be complex or costly within BigQuery alone. Replicating data to Iceberg allows you to perform complex queries and large-scale analysis on your BigQuery data to extract more detailed insights.
  • Long-term Data Retention: BigQuery is most suitable for short-term analytics and real-time queries. When transferring data from BigQuery to Iceberg tables, you can achieve long-term data retention and efficient querying for large datasets, especially for historical data analysis.
  • Cost Optimization: You can optimize storage costs by moving infrequently accessed data from BigQuery to Iceberg. Iceberg’s columnar storageschema evolution, and partitioned tables help lower the cost of durable storage and analytics. This makes it a more effective solution for large datasets over extended periods than BigQuery.

Conclusion

When evaluating BigQuery vs. Iceberg for your organizational needs, consider factors like cost, performance, and workload type to determine the best solution. For petabyte-scale advanced data processing and analytics, moving to Iceberg is a better choice. This integration helps you reduce storage costs and provide deeper insights into your data than BigQuery.

This comprehensive guide highlighted two approaches to moving data from BigQuery to Iceberg. The manual method requires human intervention and running code scripts, which can be time-consuming. However, with an automated real-time integration platform like Estuary Flow, you can efficiently move the data with minimal effort.

To understand how Estuary Flow automates data integration and enhances your business productivity, feel free to connect with Estuary experts.

FAQs

Why would I need to transfer data from Google BigQuery to Apache Iceberg?

Transferring data from Google BigQuery to Apache Iceberg can be beneficial for cost-effective persistent storage, efficient big data analytics, and long-term data management.

Does BigQuery support Iceberg Tables?

Yes. BigQuery supports Iceberg tables, also known as BigQuery Iceberg tables. This integration offers a foundation for building open-format lakehouses on Google Cloud. It delivers the same fully managed services of BigQuery while storing data in user-owner storage buckets.   

How does Apache Iceberg improve performance compared to BigQuery for data storage?

Apache Iceberg improves performance by using its columnar storage format, partitioned tables, effective metadata management, and compatibility with multiple distributed processing engines. 


Related Sync with BigQuery

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of Jeffrey Richman
Jeffrey Richman

With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.