About connector stories
Estuary maintains a repository of open-source data connectors. These connectors are dual-licensed under Apache 2.0 and MIT and follow an open-source standard. You can read more about our philosophy on connectors here.
Though a wide community of developers will be essential to the repository’s growth, we’re kicking things off on our own development team by building the connectors we find most critical for integrating the modern data stack. In this series, we talk to those developers about their process and the insights they’ve gained.
Spotlight on Estuary’s Snowflake and BigQuery connectors
Snowflake and Google BigQuery are undoubtedly two of the biggest names in cloud data warehousing today. For modern organizations across industries, data warehouses play a non-negotiable role in the data stack. They provide a type of scalable data storage that’s ideally suited to power analytical workflows. In addition to being great environments for exploring large datasets, warehouses can be integrated with other analytical tools to operationalize your data.
Both BigQuery and Snowflake are known for their serverless design and relative ease of use. Both have particular strengths and weaknesses, which are outside the scope of this article (though many articles on that subject exist; here’s a good one from Poplin). Regardless of which warehouse you use, integrating it fully with your data pipeline is critical. That’s why Estuary prioritized building Snowflake and BigQuery materialization connectors.
Today we’re talking with the two senior software engineers who built them to get an inside look at their process.
Flow drives connectors differently than other data integration platforms. In light of that, how does building materialization connectors for Flow differ from these other systems?
Phil (builder of Snowflake materialization):
There are a few specific things that Flow does differently. Firstly, Flow doesn’t just export data, but instead materializes it. That means that existing database rows will be updated, whereas an export will only insert new rows. This means that Flow materialization connectors like `materialize-snowflake` need to load records from the database so that Flow can reduce them with new data that’s been added to the collection.
Another big difference is that Flow is a transactional system with end-to-end exactly-once guarantees. This is actually required in order to ensure the correctness of updates when errors are encountered. But it’s also totally awesome because it enables Flow to be used in a huge range of scenarios where typical ETL and ELT tools can’t. It also means that Flow’s materialization protocol needs to support transactions, while still supporting materializing to systems with weaker guarantees. This makes Flow’s materialization connector protocol somewhat more complex than other connector protocols out there. To help tame that complexity, we built some common abstractions to help materialize to SQL databases.
Lastly, Flow of course does everything in real time. You don’t schedule export jobs like you would with other systems. Instead, data is inserted or updated in the database incrementally, as soon as it’s committed in Flow. That’s fantastic, but it also presents some special challenges when working with cloud data warehouses like Snowflake or BigQuery. OLTP databases like PostgreSQL have absolutely no problem processing high volumes of insert and update statements with low latency. But Snowflake, like other OLAP systems, wasn’t built for that. To get reasonable throughput on updates, you need to chunk things up into pretty large transactions. For example, materializing a high-volume Flow collection into Snowflake will often use transactions with hundreds of thousands of rows.
All that awesomeness does mean that Flow materialization connectors are a little different from typical batch-oriented “export” connectors. It’s a bit of a conceptual hurdle for a connector developer, but it’s still possible for a full-featured SQL materialization connector to be implemented in roughly 600 lines of Go code.
Though they’re similar technologies on a surface level, Snowflake and BigQuery each have their own idiosyncrasies. How did those come up as you built the connectors?
Zach (builder of Snowflake materialization):
An interesting part of BigQuery development and usage is its large-scale nature. Like Phil alluded to, BigQuery isn’t really a tool you would likely use to back any sort of application requiring real-time responsiveness. It’s designed to handle very large datasets. Setting up and running operations incurs a minimum cost so to speak and it’s not a small amount of time.
The interesting bit is the scalability, though. While an operation with, say, tens of thousands of rows might take ten seconds to complete, an operation with exponentially more data might not really take any more time. As long as you are mindful of this — and patient — you’re able to churn massive amounts of data in a relatively small amount of time compared to a typical database.
SQL systems are all like this, in my experience. I’ve built connectors for lots of databases, and I can confirm that there is no such thing as “standard SQL” in reality. The dialects are all different, and I’ve learned to approach every new database with a healthy amount of skepticism and distrust, even the really good ones.
Snowflake’s biggest surprise (so far!) is its `PRIMARY KEY` keyword. Their docs will tell you that Snowflake does not ensure that primary keys are actually unique. It will happily let you insert any number of rows all with the same primary key, and let you query all of them. Uniqueness constraints must be enforced by the application. (If you’re using Estuary’s Snowflake connector, we do this for you.) But that’s not what I found so surprising. What really surprised me was that Snowflake does automatically add (and enforce) a `NOT NULL` constraint on every `PRIMARY KEY` column.
In a similar vein, what are some headaches other engineers might avoid by using these connectors in their pipelines, or problems they might solve?
One of the coolest things about Flow is that it provides end-to-end exactly-once processing guarantees out of the box, without needing to jump through hoops. This allows Flow to materialize views that are 100% correct and consistent, and fault-tolerant. This lets you easily build analytics pipelines that are extremely robust by default. You don’t need to do anything special to opt into transactional guarantees and fault tolerance. It’s just the default.
Another really nice thing that Flow gives you is the ability to backfill data. This is surprisingly useful! For example, you could start out materializing two separate collections of data into two separate snowflake tables, and later change it to join the data in Flow and materialize as a single table, and the new table will contain all the same data. It also lets you try materializing the data into several different systems simultaneously, so you can always use whatever system best serves your use case. Or for example, you could take the same collection and materialize it to both BigQuery and Snowflake, and see which one gives you the best balance of price and performance.
The whole Flow ecosystem takes a lot of the headache of having to build custom ETL or using other expensive tools to get data to many different places. Being able to maintain it as a GitOps style catalog as well just takes a lot of the pain out of moving large amounts of data. You can describe what you want to happen, apply it and let it do its thing.
Any big surprises or lessons you took away from this project? Advice for engineers hoping to build real-time connectors in the Flow ecosystem?
These systems aren’t magically fast. They just have designs that tend to balance the tradeoffs of storing lots of data and allowing you to perform ad-hoc queries with reasonable efficiency. Any database can perform poorly under some circumstances, and even the most “scalable” systems have their limits. Snowflake, like other cloud data warehouses, is not super efficient at updates, and it took us a while and a couple of tries to find the right approach for updating data efficiently. There’s still some more work we could do there to improve update performance.
I think something to be wary of when using BigQuery is that it’s not necessarily a database that’s designed to be updated a lot. It’s designed from the ground up to be append-biased. So be mindful of that when exporting data into BigQuery. The connector supports a mode where everything is kept in sync using updates but also supports an append-only delta mode. I would suggest a user understand the difference and in light of their use case ensure they are picking the best method for what they want to accomplish. The update mode could get expensive as updates can require processing the entire table’s worth of data for every update operation. For very large tables this might become cost-prohibitive.
How is building a connector for cloud data warehouses different from operational databases like PostgreSQL?
As I mentioned earlier, cloud data warehouses just aren’t built to handle low latency processing of a ton of updates. There are a few reasons for this. One is that there’s a significant per-statement overhead. For example, updating a single row might take several seconds in a big table, but updating several thousand rows might not take noticeably longer. So we can’t just use the naive approach of executing a separate `SELECT` and `UPDATE` statement for each Flow document, because there’s no way it would keep up with a high volume collection.
Instead, we try to batch things up into as few expensive operations as possible. For example, every SELECT statement requires a table scan in Snowflake, since there are no indexes. So, instead of executing a separate SELECT statement for each key we want to load, we instead insert all the keys into a temporary table. Then, once we know all of the keys that will participate in the transaction, we execute a single query that joins the keys in the temp table to the actual data. So instead of performing a separate table scan for each key we want to load, Snowflake will do a single table scan that returns all of the rows for that transaction.