Estuary

How to Easily Connect Microsoft SQL Server to BigQuery

Explore seamless integration steps for connecting Microsoft SQL Server to BigQuery effortlessly.

Picture of Jeffrey Richman
Jeffrey Richman
How to Easily Connect Microsoft SQL Server to BigQuery
Share this article

In the world of data engineering, it's common to work with multiple databases to manage and store data. Microsoft SQL Server and Google BigQuery are two examples of popular databases that offer unique features and benefits. While SQL Server is an on-premises database management system, BigQuery is a cloud-based data warehouse that provides more scalability and better performance.

In many cases, you may need to connect these two databases to integrate or migrate data from SQL Server to BigQuery, or vice versa. However, connecting these databases can be challenging and requires technical expertise. So, to ease up the process, this article provides you with a step-by-step guide on how you can connect Microsoft SQL Server to Google BigQuery rather seamlessly.

First, you will explore the key features of Microsoft SQL Server and Google BigQuery. Thereafter, you will dive into the specifics of connecting them. So keep reading!

What is Microsoft SQL Server?

Microsoft SQL Server is a relational database management system (RDBMS) developed by Microsoft Corporation. It is a powerful database that provides a variety of features for managing data, including data storage, retrieval, and analysis. SQL Server is widely used in enterprise applications and is known for its high-performance capabilities.

Some of the key features of Microsoft SQL Server include:

  • Robust security features to protect sensitive data. 
  • Support for multiple programming languages, including T-SQL, .NET, and Java. 
  • Integration with Microsoft Office and other Microsoft products.
  • Scalability and high availability through clustering and replication. 
  • Tools for data analysis and business intelligence, such as SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS)

With its advanced features and capabilities, Microsoft SQL Server is a popular choice for data engineers and developers who work with complex data structures and require a high level of data security. Yet, BigQuery has some advantages worth exploring. Read on to the next section. 

What is BigQuery?

Google BigQuery is a cloud-based data warehouse that provides a fully managed and scalable solution for storing and analyzing large datasets. It is designed for businesses that need to process vast amounts of data quickly and efficiently. BigQuery is built on top of the Google Cloud Platform, which provides a secure and reliable infrastructure for data storage and processing.

Some of the key features of BigQuery include:

  1. Scalability: BigQuery can handle petabytes of data and can process billions of rows in seconds.
  2. Speed: Queries are executed quickly due to the distributed architecture of BigQuery.
  3. Security: Data is encrypted both at rest and in transit, and access controls can be configured to restrict who can view and modify data.
  4. Integration: BigQuery integrates with other Google Cloud Platform services, such as Google Cloud Storage and Google Data Studio.
  5. SQL support: BigQuery supports standard SQL and has built-in functions for working with JSON and nested data.

BigQuery has scalability, speed, and security features that make it a popular choice for data engineers who need to process and analyze large datasets quickly and efficiently. You might be wondering what advantages BigQuery has over Microsoft SQL Server. Well, the next section is about that.

Advantages of BigQuery Over SQL Server

While SQL Server is a powerful database management system, BigQuery has some unique advantages that make it an attractive option for data engineers. Here are some of the advantages of BigQuery over SQL Server:

Scalability: BigQuery is designed to handle large datasets and can scale seamlessly as data volumes increase. SQL Servers, on the other hand, may require additional hardware or configuration changes to accommodate large datasets.

Cost: BigQuery is a cloud-based service that is billed based on usage—it is more cost-effective than managing on-premises SQL Server hardware and software.

Performance: BigQuery uses a distributed architecture that can process queries faster than SQL Server in many cases, especially for large datasets.

Flexibility: BigQuery supports standard SQL and has built-in functions for working with JSON and nested data. It is more flexible than SQL Server in these cases.

Integration: BigQuery integrates with other Google Cloud Platform services, such as Google Cloud Storage and Google Data Studio, which can streamline data engineering workflows.

While both SQL Server and BigQuery have their strengths and weaknesses, understanding the unique advantages of each can help you make informed decisions about which database to use for any specific use case. At this point, the question you most probably have in your mind is, how can you connect SQL Server to BigQuery? This is answered next.

How to Connect SQL Server to BigQuery

Connecting SQL Server to BigQuery requires several steps. Here's a high-level overview of the process:

  • Step 1: Set up a Google Cloud Platform account and create a new project. 
  • Step 2: Enable the BigQuery API in your Google Cloud Platform project.
  • Step 3: Create a service account with the necessary permissions to access your BigQuery data.
  • Step 4: Generate a private key for the service account and download it in JSON format.
  • Step 5: Install the Google Cloud SDK on your SQL Server machine.
  • Step 6: Use the Google Cloud SDK to authenticate your service account and configure the SDK with your project ID.
  • Step 7: Install the BigQuery ODBC driver on your SQL Server machine.
  • Step 8: Configure an ODBC data source for BigQuery in SQL Server.
  • Step 9: Test the connection to ensure that SQL Server can communicate with BigQuery.

Each of these steps requires some configuration and set-up that is rather taxing to execute, but once complete, you'll be able to use SQL Server to query and analyze data stored in BigQuery. For more detailed instructions and step-by-step guides, consult the official Google Cloud Platform documentation

Mind you, there are other methods and technologies to connect SQL Server to BigQuery, which you will learn a bit later. But before then, another question you might have is, why do you need to connect SQL Server to BigQuery? The next section will explore some common use cases for connecting SQL Server to BigQuery 

Common Use Cases for Connecting SQL Server to BigQuery

Connecting SQL Server to BigQuery can enable data engineers to take advantage of the unique strengths of each database system. For example:

Offloading data: SQL Server may not be optimized for processing large datasets, but with BigQuery, you can offload large datasets to the cloud for faster processing and analysis.

Data migration: If your organization is transitioning from SQL Server to BigQuery, connecting the two databases can facilitate a smooth migration process.

Data federation: If your organization has data stored in multiple locations, connecting SQL Server to BigQuery can enable federated queries that span both systems.

Data warehousing: BigQuery is a popular option for building data warehouses, and connecting SQL Server to BigQuery can enable you to build hybrid data warehouse solutions that take advantage of both databases.

Data integration: Connecting SQL Server to BigQuery enables you to integrate data from disparate sources and perform complex data transformations and joins.

By connecting SQL Server to BigQuery, every data engineer can leverage the unique strengths of each database system to build powerful data engineering solutions. 

Nevertheless, can this connection get easier? There are a couple of technologies that ensure the process is effectively easy—the next section takes you through them, so keep reading. 

Other Methods for SQL Server to BigQuery Integration

There are several technologies that you can use to connect Microsoft SQL Server to Google BigQuery. Here are some popular options:

  • BigQuery ODBC Driver: The BigQuery ODBC driver is a popular option for connecting SQL Server to BigQuery. It allows you to configure an ODBC data source on your SQL Server machine that connects to BigQuery over the internet. The driver supports standard SQL-92 syntax and can be used with most ODBC-compliant applications, including SQL Server Management Studio.
  • BigQuery Connector for SQL Server Integration Services (SSIS): If you're using SQL Server Integration Services (SSIS) to extract, transform, and load (ETL) data, the BigQuery Connector for SSIS allows you to connect to BigQuery directly from SSIS. It provides a set of SSIS components that enable you to perform data transfers, transformations, and validations between SQL Server and BigQuery.
  • Cloud Data Fusion: Cloud Data Fusion is a fully-managed, cloud-native data integration service that enables you to build and manage ETL pipelines for ingesting and processing data from various sources, including SQL Server and BigQuery. It provides you with a graphical interface for designing, deploying, and monitoring data pipelines, as well as pre-built connectors for various data sources and sinks.
  • Apache Beam: Apache Beam is an open-source, unified programming model that you can use to build batch and streaming data processing pipelines. It provides you with a set of language-specific SDKs for building data processing pipelines in Java, Python, Go, and other languages. You can use Apache Beam to build data pipelines that connect SQL Server to BigQuery and perform complex data transformations and aggregations.
  • Estuary Flow: Estuary Flow is a powerful technology that allows you to connect SQL Server to BigQuery in a breeze. It is a cloud-native data integration platform that you can use to move data between various sources and targets seamlessly, including from SQL Server to BigQuery and vice versa. 

Estuary Flow equips you with a simple, intuitive interface for configuring data pipelines, and it can handle both batch and real-time data integration. Additionally, Estuary Flow provides you with advanced capabilities such as data mapping, transformation, and enrichment, making it an ideal choice for complex data integration scenarios. 

With Estuary Flow, you can quickly and efficiently create data pipelines that move data from SQL Server to BigQuery and vice versa, to help streamline data integration workflows and improve overall data quality.

Nevertheless, each of the technologies has its strengths and weaknesses, though. So it's important to evaluate each option based on your specific use case and requirements. Therefore, the next section dedicates itself to examining some challenges that come with each option and the solution to them.

How to set up Estuary Flow for SQL Server to BigQuery

Step 1. Configure SQL Server for CDC

In SQL Server, change data capture utilizes the SQL Server Agent to log insertions, updates, and deletions occurring in a table. So, it makes these data changes accessible to be easily consumed using a relational format.

There are some prerequisites required to enable CDC in SQL Server. You can use this init.sql script to automate the creation of these objects, just make sure to update the variables before executing it.

plaintext
USE <database>; -- Enable CDC for the database. EXEC sys.sp_cdc_enable_db; -- Create user and password for use with the connector. CREATE LOGIN flow_capture WITH PASSWORD = 'secret'; CREATE USER flow_capture FOR LOGIN flow_capture; -- Grant the user permissions on the CDC schema and schemas with data. -- This assumes all tables to be captured are in the default schema, `dbo`. -- Add similar queries for any other schemas that contain tables you want to capture. GRANT SELECT ON SCHEMA :: dbo TO flow_capture; GRANT SELECT ON SCHEMA :: cdc TO flow_capture; -- Create the watermarks table and grant permissions. CREATE TABLE dbo.flow_watermarks(slot INTEGER PRIMARY KEY, watermark TEXT); GRANT SELECT, INSERT, UPDATE ON dbo.flow_watermarks TO flow_capture; -- Enable CDC on tables. The below query enables CDC the watermarks table ONLY. -- You should add similar query for all other tables you intend to capture. EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'flow_watermarks', @role_name = 'flow_capture';

The init.sql script configures your database with Change Data Capture (CDC) functionality enabled. It also creates a login and user named flow_capture with the necessary permissions. The script also creates a table named flow_watermarks, which is an internal table used by Flow to keep track of the state of the ongoing replication.

Step 2. Create SQL Server Capture in Estuary Flow

Head over to the Estuary Flow dashboard and create a new SQL Server capture.

create a new SQL Server capture.png

During the endpoint configuration, fill out the connection details according to your SQL Server environment. After pressing next, in the following section, you can configure how the incoming data should be represented in Flow as collections.

Captures run continuously: as soon as new documents are made available at the endpoint resources, Flow validates their schema and adds them to the appropriate collection.

Estuary Flow writes all change data into collections, which are append-only durable logs similar to a WAL. Like replication, Estuary Flow transactionally guarantees change data, including the modified chunks. 

Collections are a real-time data lake. Documents in collections are stored indefinitely in your cloud storage bucket (or may be managed with your regular bucket lifecycle policies). This means that the full historical content of a collection is available to support future data operations and perform backfills without going back to the source.

Step 3. Configure BigQuery Materialization

As this is an ELT dataflow, there’s no need to do any transformations before the data lands in BigQuery. Head over to the Destinations page to create a new materialization and search for the BigQuery connector.

Configure BigQuery Materialization.png

To use this connector, you'll need:

  • A new Google Cloud Storage bucket in the same region as the BigQuery destination dataset.
  • A Google Cloud service account with a key file generated and the following roles:
    • roles/bigquery.dataEditor on the destination dataset
    • roles/bigquery.jobUser on the project with which the BigQuery destination dataset is associated
    • roles/storage.objectAdmin on the GCS bucket created above

To configure your service account, complete the steps available in the connector documentation. After all the details are in place, take a look at the Update Delay parameter.

The Update Delay parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. It represents the amount of time the system will wait before it begins materializing the latest data.

For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available. For this tutorial, let’s set it to 0s, as in zero seconds, in order to demonstrate the flow in real-time.

Press Next, then Save and Publish in the top right corner to provision the materialization connector and data from SQL will start to be replicated into BigQuery immediately with a complete backfill.

update delay parameter.png

Estuary Flow vs other technologies for SQL Server to BigQuery connection

While the aforementioned technologies provide solid options for connecting Microsoft SQL Server to Google BigQuery, Estuary Flow offers several advantages that make it an attractive choice for the majority of data engineers.

BigQuery ODBC Driver

BigQuery ODBC driver is a popular choice for connecting SQL Server to BigQuery. However, it has several limitations that Estuary Flow can overcome. For example, configuring the driver can be complex and time-consuming, and it may not be compatible with all ODBC-compliant applications. Additionally, the driver can be slow when transferring large volumes of data between SQL Server and BigQuery.

Estuary Flow addresses these limitations by providing you with a simple, user-friendly interface for configuring the connection between SQL Server and BigQuery. Its built-in connectors for SQL Server and BigQuery eliminate the need for manual configuration and ensure compatibility with both platforms. Estuary Flow also uses optimized transfer protocols to maximize the speed of data transfer between the two platforms, ensuring that large volumes of data can be transferred quickly and efficiently.

BigQuery Connector for SQL Server Integration Services (SSIS)

The BigQuery Connector for SSIS is a good choice if you are using SSIS for ETL tasks. However, it can be complex and time-wasting to set up and configure—sometimes it may not even be compatible with all versions of SQL Server and SSIS.

Estuary Flow simplifies the process of connecting SQL Server to BigQuery by providing you with a drag-and-drop interface for building data pipelines. Its visual pipeline builder allows you to quickly and easily create data integration workflows that move data between SQL Server and BigQuery, without the need for complex scripting or manual configuration.

Cloud Data Fusion

While Cloud Data Fusion is a powerful tool for building and managing data integration pipelines in the cloud, it can be expensive and may not be suitable for smaller-scale data integration tasks.

Estuary Flow offers a cost-effective and flexible alternative to Cloud Data Fusion. Its pay-as-you-go pricing model means that you only pay for the resources you use. Hence it is cost-effective for smaller-scale data integration tasks. Additionally, Estuary Flow has a user-friendly interface and built-in connectors for SQL Server and BigQuery. They allow you to set up and configure data integration pipelines without the need for specialized expertise.

Apache Beam

Apache Beam is a robust open-source tool for building data processing pipelines. But it requires a significant amount of development effort to set up and configure, and may not be suitable for you if you aren't comfortable with programming.

For this, Estuary Flow provides a no-code alternative to Apache Beam for data engineers who want to build data integration pipelines between SQL Server and BigQuery. Its drag-and-drop interface and visual pipeline builder enable you to create data integration workflows smoothly without the need for specialized programming expertise.

Key Takeaways

So far you have learned that connecting Microsoft SQL Server to Google BigQuery,  albeit complex and time-consuming, can be made seamless with the right tools and technologies.

Before that, you explored both Microsoft SQL Server and Google BigQuery–what each is and their various strengths. You equally answered the question, "Why connect SQL servers to BigQuery," and also explored some use cases for such a connection, including offloading data, data migration, data federation, data warehousing, and data integration.

Then you went on to understand the advantages and disadvantages of each method and technology for SQL Server to BigQuery connection and how you can make informed decisions about the best approach to employ for any specific use case. Additionally, you learned how to leverage Estuary Flow to streamline and automate data integration workflows–to free up valuable time and resources for other critical tasks. 

You realized that whether you're looking to migrate data from SQL Server to BigQuery, or simply looking to integrate data between the two platforms, there are many powerful tools and techniques available to help you achieve your goals.

And to ensure that your data integration workflows are efficient, reliable, and scalable, both now and in the future, you must stay up to date with the latest developments in data integration technology and best practices. 

Estuary Flow has a growing community of data engineers and developers who share their experiences on Slack—you too are encouraged to share your own experience of the Estuary Flow platform on the public channel. If command-line interfaces tickle your fancy more, Estuary Flow documentation has lots of information on that, too.

Start streaming your data for free

Build a Pipeline

Author

Author's Avatar
Jeffrey Richman

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.