Relational Databases, Snowflake, and Primary Key Constraints
Snowflake handles primary key constraints a little differently than other relational databases such as PostgreSQL. This article will cover the following:
- What is a relational database?
- What are database constraints & what purpose they serve
- Primary key constraints in Snowflake
- How to handle lack of enforcement
- Two ways to approach autoincrementing primary keys in Snowflake
What is a relational database?
A relational database is a method of organizing data through predefined relationships. A table in a relational database always has rows and columns.
Snowflake is a cloud-hosted relational database and warehousing tool that allows you to query data with SQL. Companies implement tools like Snowflake into their tech stack because it supports a more cohesive data strategy and accessibility. This is particularly relevant in downstream analytics use cases, and gives various teams appropriate access to company data in one centralized place.
What are database constraints?
Constraints are an essential part of enforcing data integrity and consistency across various entities and attributes. They are declarative rules, which means any data inserted must be adhered to. These ensure each row in the table is uniquely identifiable, and allow connections to be made across different tables. Other than primary key, commonly enforced constraints include unique key, foreign key, check, and not null.
Snowflake Primary Key Constraints
Here’s where things shift a bit: Although Snowflake supports constraints, it does not enforce them like PostgreSQL, MySQL, or similar do. The one exception to this is in non-hybrid tables (more on what a hybrid table is below). In Snowflake, constraints are used mostly for data modeling and compatibility purposes. This could look like ingesting client data into your Snowflake environment, and using the client’s constraints as a checkpoint to verify compatibility and cross-team alignment.
To illustrate this point, see the example below. Here, a table is created with two columns. The first,
PRODUCT_ID, is an integer column with custom precision and scale, and primary key constraints. The second column,
CUSTOMER_ID, is an integer column with standard precision and scale, and has a unique constraint.
If you were to insert duplicate values into the
PRODUCT_ID column using the second query, it would work. Essentially this means a Snowflake user can create a table with a primary key constraint, but it is not enforced because duplicate data can be inserted into the same column without issue.
plaintext--Create test table --PRODUCT_ID; number (6,6) with primary key constraint --CUSTOMER_ID; with unique constraint CREATE OR REPLACE TABLE TEST_DB. TEST_SCHEMA. TEST_TABLE ( PRODUCT_ID NUMBER (6,6) PRIMARY KEY, CUSTOMER_ID INT UNIQUE ); --Insert duplicate values into PRODUCT_ID column INSERT INTO TEST_DB.TEST_SCHEMA. TEST_TABLE (PRODUCT_ID, CUSTOMER_ID) VALUES (1, 'A'), (1, 'B');
One reason Snowflake doesn’t enforce this may be for performance reasons. One benefit of Snowflake is that it can ingest 4 uncompressed Terabytes of data in 6 minutes or less. This performance would not be possible if enforcement was enabled. For more detailed information on this, reference this article written by Allen Wong, a Snowflake Sales Engineer.
How to Enforce Primary Key Constraints in Snowflake
There are a couple ways to handle Snowflake’s lack of enforcement around primary key constraints in non-hybrid tables.
- Create or improve your testing protocol within Snowflake via SQL queries.
- Implementing an additional service like Snowflake’s Unistore, or some other third-party tool.
You could also approach it further downstream in your stack, using a tool like dbt. Here are a few questions to consider:
- How important is it to make sure these constraints are enforced? Meaning, what would the larger impact be if these remain informative only?
- If this is deemed important, what resources are available to research different approaches, workarounds, or strategy? Do we have the bandwidth to perform a proof of concept for potentially multiple solutions? What about testing?
- If we don’t have the resources, do we have the financial bandwidth to implement a new product or service to handle this on our behalf?
- What existing tools are available that could be helpful here?
The list above is not exhaustive and is meant to serve as a starting point as you consider what strategy makes best given your circumstance.
If you determine you need constraints in Snowflake to be enforced, you can approach this in a few ways.
Via Additional Testing
If you’re seeing incorrect primary key values in your data within Snowflake, this is a great conversation to have with other engineers on your team. This could be with your upstream platform engineers, who maintain the source database and applications. This could be with your Analytics Engineers, who may own dbt and testing performed there. It could be a problem your own Data Engineering team needs to own. In reality, if you need to enforce constraints at the Snowflake level, it’s probably going to involve a little bit of everyone.
A better understanding of how data is handled at various points in any given process ensures that everyone who interacts with said data can be confident in its usability and accuracy. Again, the solution here is dependent upon what your Engineering team looks like.
Although over a decade old, author Sam Bendayan offers general information on testing constraints that is still highly relevant today. Check it out in this article here.
Via Snowflakes Unistore
As noted by Wong, Snowflake now offers Unistore, which offers a hybrid table that can enforce column constraints as mentioned above. This article written by Carl Perry of Snowflake offers an in-depth analysis of Unistore, hybrid tables, and constraints.
More of a visual learner? Check out this YouTube video produced by Snowflake on the same topic.
If your team is already a Snowflake client, and needs an out-of-the-box solution to handle constraints, Unistore may be an appropriate solution. However, it may not be. There are a number of considerations to be made prior to integrating any new service, particularly cost, performance, documentation, enterprise support and potential alternatives or workarounds.
Author Marc Staimer offers a critical, but thoughtful discussion around Snowflake’s approach to this use case in this article.
Snowflake Primary Key Autoincrement
There are some situations where you may need to add an autoincrement column (or identity key) in your Snowflake table. An autoincrement generates and inserts a unique number when a new record is inserted into a table. Often, this is the primary key field.
If you need your primary key column to include this type of functionality, there are a couple ways to do this. First is by creating a sequence. The second is by including identity syntax on the column itself.
It’s important to understand that autoincrement and identity are synonymous, but are executed in different ways. The second thing to keep in mind is that both autoincrement and identity can only be used on columns with numeric data types. Meaning, if you have a primary key column that is non-numeric, neither of these are viable options.
Method #1: Create a Sequence
One way to handle a Snowflake primary key autoincrement requirement is through creating a sequence. A sequence is essentially an object separate from the table itself, which allows you to apply its functionality across both sessions and statements. It’s typically used to generate values for a primary key, or any column that requires a unique value.
This is a useful approach when you have a table with many columns. One benefit to this is when you have your own sequence, you have access to next value in said sequence. This is also a good approach when an existing table needs to be modified to include an autoincrement requirement. This is because when you create a sequence, you have access to the following value in said sequence. This is particularly helpful while backfilling data, and you need to add the next incremental value.
For detailed information on how to create and use sequences, see this documentation from Snowflake.
Method #2: Add Identity Syntax on Column
You can also implement auto-increment functionality by including identity syntax on the necessary columns while creating your table. This approach is different than creating a sequence because the identity syntax is part of the table itself, rather than being an object.
For more information on how to add identity syntax, see this documentation from Snowflake.
Despite ever-changing business requirements and increased complexity, one thing is for certain: primary key constraints will always be an important part of data integrity, and have to be properly handled. Whether your organization decides using them as informative-only, or needs to implement their enforcement at the Snowflake level, there is a solution.
Do you have a use case you’re working on & would like to run it by a team of data experts? Reach out to the Estuary Team on Slack.