Organizations are generating large amounts of data to understand their customers and business better. As a result, there’s a growing demand for data platforms that can handle and process large volumes of data. And when the platform can provide real-time insights, it’s an added advantage. Snowflake is one such cloud-based platform with increasing popularity. It has the ability to handle massive amounts of data and provide near-instantaneous query results. Snowflake CDC is a crucial part of Snowflake that helps capture and stream real-time data changes from various sources into a central location.
This article will explain all about Change Data Capture, Snowflake, and data streams. You’ll also see how to set up Snowflake CDC with streams.
We’ll wrap up with some pointers for setting up CDC pipelines into Snowflake from external systems.
What is Snowflake?
Snowflake is a fully-managed cloud-based data warehousing platform that provides a Platform-as-a-Service (PaaS) to store, manage, and analyze large amounts of data. The Snowflake Data Warehouse is designed around a proprietary SQL Engine and a cloud-optimized architecture.
You can use Snowflake only to deploy on public cloud infrastructure like AWS, Azure, and GCP, not as a private cloud-hosted or on-premises solution. However, since Snowflake is a fully-managed service, you won’t need any hardware, software, or in-house teams to install, configure, manage, and maintain data infrastructure. Any upgrade, management, or maintenance is handled by Snowflake.
Key Features of Snowflake
Here are some of the key features of Snowflake that make it among the most popular data warehousing solutions in the market:
- Scalability: Snowflake separates its storage and compute capabilities for better performance, allowing you to scale each component independently. This means you can pay only for the services used.
- Semi-Structured Data Support: Unlike traditional warehouses that handle structured data, Snowflake has the ability to handle semi-structured and unstructured data. It can handle a wide variety of data types, including Parquet, Avro, JSON, and more.
- Security: Snowflake offers several security features that help you keep your data safe and secure. These include multi-factor authentication, role-based access control, OAuth, and more.
- Standard and Extended SQL Support: Snowflake provides support for most of the SQL DDL and DML statements for querying data. This means that apart from the standard Insert, Update, and Delete statements, it also supports transactions, stored procedures, and statistical aggregate functions.
- Database Replication and Failover: You can replicate a Snowflake database and sync across multiple Snowflake accounts in different regions. You can also configure databases for failover to specific Snowflake accounts. This helps ensure business continuity and enhances disaster recovery.
What is Change Data Capture (CDC)?
Change Data Capture (CDC) is a process that captures data changes in a database and delivers them to other systems in real time. CDC is among the best ways to enable real-time data migration and replication. CDC helps maintain accuracy and consistency across all systems.
CDC works by capturing changes made to the database at the transaction level. Whenever a change is made to a table, CDC captures the change from the database’s transaction or write-ahead log. This change table contains the data before and after the change, along with metadata like transaction ID, timestamp, etc. The change data is then delivered to other systems in real time.
To be able to perform efficient data analytics, you require access to real-time data streams. CDC helps implement this by providing near real-time movement of data. It processes the data as soon as new database events occur. With CDC, you can send the latest data updates to BI tools and also keep team members up to date in near real-time.
What is Snowflake CDC?
Snowflake executes CDC with the use of streams. Snowflake CDC is based on log-based data replication. It captures changes in a source database and propagates them to a target database using a change stream. A stream takes logical snapshots of a source object, like external tables, tables, or views. Streams can also record the information of DML changes to source objects in Snowflake. The change stream is a log of all the source database modifications, including updates, inserts, and deletes. As new changes occur, the change stream is continuously updated.
The Snowflake CDC process begins with the creation of a CDC-enabled table in the source database. This table contains all the data that needs to be replicated and is configured to capture changes using Snowflake CDC. After the table is configured, any changes made to it are captured automatically and sent to the target database in real time.
After the creation of a stream for a table, additional columns are added to the source table. These new columns store metadata that tracks data changes in the source table. Any data changes tracked by a stream must be consumed or moved to permanent storage within the set retention period. If it doesn’t, the changes captured by the stream will no longer be accessible. And a new stream will be created to track changes from that point forward.
What are Snowflake Streams?
A Snowflake stream, also known as a table stream, is a feature of the Snowflake data warehouse platform. It allows you to capture and track changes in data sources in real time. A Snowflake stream tracks all DML changes made to the source table rows and stores the metadata of each change.
The stream takes an initial snapshot of every row of the source object, like external tables, tables, or underlying tables. It initializes an offset, a point in time, as the object’s current transactional version. The stream’s change tracking system records DML-change information after this snapshot is taken. These change records reflect the state of a row before and after the change. Information about the change will involve the source object’s column structure and additional metadata columns for describing each change event.
A stream itself doesn’t contain any table data. It only stores the offset of the source object and leverages the source object’s version history to return CDC records. The additional columns providing the metadata include:
METADATA$ACTION: This column’s value indicates the DML operation (INSERT, DELETE) that was recorded. The UPDATE statement is represented by two row entries for INSERT and DELETE.
METADATA$ISUPDATE: This column value indicates if the row entry was in an UPDATE statement. If the entries were part of an UPDATE, the value is TRUE, else it’s FALSE.
METADATA$ROW_ID: This is the unique and immutable row ID that can be used to track changes to particular rows over time.
Types of Snowflake Streams
There are three stream types in Snowflake, based on the metadata recorded by each. Here are the three types of Snowflake streams:
- Standard: This type of stream is supported for streams on views, tables, and directory tables. Also known as a delta stream, it tracks all DML changes in the source object. This includes tracking updates, inserts, deletes, and table truncates. By performing a join on the deleted and inserted rows in a change set, it returns the net change in a row called row-level delta.
If you insert a row and then delete it between two transactional points of time, the standard stream won’t return it as part of a query.
- Append-only: The append-only stream type is supported for streams on directory tables, standard tables, and views. It only tracks row inserts, not updates or deletes.
For example, let’s say you insert ten rows in a table but delete five of those rows. Any query to an append-only stream will return all ten rows. Since append-only streams return only the appended rows, they consume fewer resources. Hence, they’re more performant than standard streams, especially for ELT and similar scenarios that exclusively depend on row inserts.
- Insert-only: The insert-only streams are supported for streams on external tables only. Similar to the append-only streams, these streams also track row inserts only, not deletes.
Consider an external table that references files on cloud storage locations. File 1 is removed from the referenced location in between any two offsets, and File 2 is added. Then, the stream will return the rows of records of File 2. Unlike in standard tables, Snowflake can’t access historical records for files in cloud storage. Moreover, appends may not trigger auto-refresh of the external table metadata.
Setting Up Snowflake Change Data Capture With Streams
To get started with Change Data Capture Snowflake, you must first log in to Snowflake Web-based UI or Snow SQL. Here’s an example of creating a table to store the names and fees paid by some gym members.
Note: Snowflake Find and Replace is an essential Snowflake function. It’s a column transformation function that replaces substrings. The REPLACE function works only on columns or values that are a string datatype.
Step 1: You must create a database in Snowflake. Run the following command to do this:
plaintextcreate or replace database CDC_DEMO; Use CDC_DEMO;
Step 2: You’ll need to create a source table. You can create a source table, members, by running the following command:
plaintextcreate or replace table members ( id number(8) NOT NULL, name varchar(255) default NULL, fee number(3) NULL );
Step 3: Now, you must create a stream to track the changes of date in the
members table. This stream will track changes made to the table.
plaintextcreate or replace stream member_check on table members;
Step 4: Create a signup table that stores the dates when the gym members joined:
plaintextcreate or replace table signup ( id number(8), dt date );
Step 5: Populate the source table with dummy data using the command:
plaintextinsert into members (id, name, fee) values (1,’Joe’,0), (2,’Jane’, 0), (3,’George’,0), (4,’Betty’,0), (5,’Sally’,0);
Similarly, populate the signup table with values:
plaintextinsert into signup values (1,’2018-01-01’), (2,’2018-02-15), (3,’2018-05-01), (4,’2018-07-16), (5,’2018-08-21);
Step 6: The stream will record the inserted rows. To view the change log,
plaintextselect * from member_check;
Step 7: Apply a $90 fee to those members who joined the gym after the end of the free trial period:
plaintextmerge into members m using ( select id, dt from signup s where datadiff (day, ‘2018-08-15’::date, s.dt::date) < -30) s on m.id = s.id when matched then update set m.fee = 90;
To view the result:
select * from members;
As you can see, the stream records the updated FEE column with inserts instead of updates indicated by delete and insert entries. This is because the stream contents haven’t been consumed yet.
With the following command, you’ll be able to see the metadata columns:
plaintextselect * from member_check;
Step 8: Next, create a production table that will contain the details of the members.
plaintextcreate or replace table members_prod ( id number(8) not NULL, name varchar(255) default NULL, fee number(3) NULL );
Populate the production table with the first stream data batch.
plaintextinsert into members_prod(id,name,fee) select id, name, fee from member_check where METADATA$ACTION = ‘INSERT’;
The stream position advances.
plaintextselect * from member_check;
To access and lock the stream, use
Step 9: To increase the paying members’ fee,
plaintextupdate members set fee = fee + 15 where fee > 0;
These changes won’t be visible since the stream object’s change interval is from the current offset to the current point of time of the transaction. It means the beginning time of the transaction.
plaintextselect * from member_check;
Commit the changes with
Now, the changes will surface since the stream object uses the current transactional time as the end point of the change interval. The changes are also included in the source table.
select * from member_check
What About Change Data Capture into Snowflake?
Snowflake Change Data Capture is among the most impressive features that make Snowflake an excellent warehousing option. Snowflake CDC is especially helpful when millions of records are transacted on a daily basis, and you want to update only the modified ones. Snowflake CDC and the MERGE command will update the destination without eating up your time and resources.
But discerning readers might have noticed that Snowflake native CDC only works when the source and target tables both in Snowflake. That’s not always what you need.
Here’s a common use case that might sound familiar: You have an upstream data system, like a transactional or NoSQL database, or even a hefty CRM, and you’d like to use CDC to plumb that data into Snowflake in real time.
From there, you can use Snowflake for what it does best: performing analytical workflows on an always-up-to-date view of your data.
To do this, you’ll need an external change data capture tool that can perform CDC on your source system and move that data into Snowflake.
You can find a roundup of CDC tools here… but we (of course) recommend Estuary Flow. With Flow, you can set up CDC to Snowflake from systems like:
… in a no-code UI.
Your first pipeline is free — register here!