Estuary

What is DuckLake? The New Open Table Format Explained

DuckLake is the latest open table format. Learn how it stands out from Iceberg, Delta Lake, and others, and try it out for yourself.

What Is DuckLake? How to Use this Open Table Format with Estuary
Share this article

Ready to take your data to the next level? Open table formats are becoming the de facto way to manage large data for analytics, combining the best of both worlds of structured, queryable warehouses and flexible, scalable data lakes.

But table formats and lakehouses are relatively new phenomena and the top existing formats haven't necessarily gotten everything right yet. With their file system architecture, updating data can be slow and create mazes of manifest lists to wade through.

Not with DuckLake. A smart new table format, DuckLake solves some of the design decisions holding table formats back.

So whether you’re just starting to look for ways to organize your data lake or whether you’ve tried all the table formats on offer and are looking for something better, DuckLake makes working with data smooth sailing.

This article is an all-in-one guide, covering DuckLake essentials, features, and implementation. We’ll even take a peek into the future at the end. 

What is DuckLake?

DuckLake is an open table format specification designed for data lakes from the creators of the open-source database DuckDB.

Great! So, what’s a table format?

In short, a table format lets you organize unstructured data into a more table-like manner without actually using a database. It gives you all the freedom of a data lake while retaining structured metadata to help with data discoverability. Table formats are what transform a data lake into a data lakehouse.

Certain structure and functionality have become standard with table formats, and DuckLake is no exception:

  • Storage is written using the Parquet file format, which efficiently compresses data while including data types
  • This object storage is managed by a metadata and catalog layer
  • ACID compliance ensures transactions behave as expected, mitigating conflicts
  • Features like schema evolution account for changing data

Whether you use Iceberg, Delta Lake, or DuckLake, all of these aspects are fairly common across the board.

But there is one thing that makes DuckLake stand out.

How DuckLake Differs from Apache Iceberg, Delta Lake, and Hudi

So, let’s address the duck in the room: who needs another table format when there’s already Apache Iceberg? Or Delta Lake? Or Hudi? But DuckLake isn’t just a duck-themed clone of an existing table format. It renovates the entire idea of a table format, patching up weaknesses and increasing efficiency.

It does this by storing metadata in a SQL database.

“Hold on,” you might say. “Isn’t the point of data lakes and lakehouses that they aren’t rigidly SQL-based?”

And you’d be right. But the SQL database is only for the metadata. The object storage remains untouched.

The database simply keeps track of the files that are available and their versioning—everything that the manifest files would usually cover in other table formats. Since this information is stored in a database rather than a collection of individual files, it can be far more efficient to look up the files in your object storage. And you don’t even need a separate catalog to manage all those metadata files.

Anyone who’s tried comparing catalogs to hone in on the best option may already consider this a substantial improvement.

Switching to a database doesn’t just improve your time and efficiency on reads (though that’s already a huge benefit). You don’t need to reconstruct snapshots by compiling manifest files, and you don’t need to write whole new manifest files and manifest lists for every little change. And you certainly don’t need to look up specific Parquet file paths to write a query.

It’s like DuckLake is a standard modern table format… but with a superpower.

DuckLake vs Other Table Formats: Feature Comparison Table

 

DuckLake

Apache Iceberg

Delta Lake

ACID compliant

Schema evolution

Using merge

Time travel

Metadata storage

SQL database

Serialized JSON files

File-based transaction log

Catalog

Built-in

Options including AWS Glue, REST API catalogs such as Apache Polaris, Unity catalog…

Options including AWS Glue, Unity catalog, etc.

File operations

Just on Parquet data files

Parquet data files, JSON manifest files, manifest lists

Parquet data files, JSON transaction log or manifest file

Speed

Fast metadata lookups

Limited on extensive file I/O

Limited on extensive file I/O

DuckLake Core Components and Architecture

We’ve covered how DuckLake is different by implementing a database for metadata management. What does that look like in practice?

The overall architecture still reflects that of other table formats: there’s a data layer with Parquet files, and metadata with an overarching catalog. But in this case, the metadata resides in a database with the catalog built-in.

Data, metadata, and database/catalog layers in the DuckLake table format
Image Source

So, what flavor is this database? DuckDB, PostgreSQL, MySQL? The answer is that it can be any of those, or more. DuckLake can work with any SQL-based database. And you can easily switch to a different database if your requirements change.

It may seem like you’re simply trading the catalog wars for the database wars (which are, at least, an old, familiar battleground). But catalogs were always a kind of separate entity stuck on to a table format to make it work. They integrated with specialty catalog APIs, and were overall less accessible.

Databases are a known quantity. And DuckDB works perfectly well as a default. You just don’t have to be locked in to DuckDB if you’d prefer to use a different system.

It’s the same story for data storage. You can use S3, Google Cloud Storage, Azure Blob Storage, whatever object storage you want.

The point is to be flexible to fit your existing architecture and know-how, not to add confusion and complexity into the mix. This simplicity over complexity makes up a core component of DuckLake’s design philosophy.

DuckLake Design Philosophy

DuckLake was designed to solve problems, with several core competencies in mind. From the DuckLake manifesto, these include simplicity, scalability, and speed.

Simplicity

We’ve already touched on how DuckLake simplifies the complex. Traversing a file system to look up metadata to find the correct storage file is complex. A SQL query is not. Databases are familiar and their transactional patterns have long been hammered out.

DuckLake makes metadata management more intuitive and streamlined.

Scalability

The move to table formats and data lakehouses away from databases in the first place was driven in part by scalability. It might be tempting to wonder if a database-driven table format can keep up.

Of course, data storage itself remains in object storage. It’s only the metadata that lives in a database. And metadata can be miniscule compared to the actual data.

Add to that the fact that you can always migrate to a different database for your metadata if your existing one starts to feel the squeeze, and you’re basically set to scale however you need.

Speed

Databases are designed for efficient transactions. Compared to complex file I/O, sometimes across numerous small manifest files, a database query is not only going to be simpler, but much faster.

This speed can also reduce conflicts when multiple transactions try to interact with the same data, improving efficiency further.

Besides these three S’s DuckLake seems to have another guiding principle: no vendor lock-in. Every step of the way, DuckLake was designed to avoid proprietary formats so that you always own your own data, whether that’s Parquet files in your object store of choice or metadata in your own database. You don’t need to depend on proprietary technology.

It’s gratifying to see more technology embracing this open mindset. Learning about DuckLake, I was struck by the similarities with Estuary’s values, making the complex simple and easily scalable while focusing on real-time speed without vendor lock-in.

DuckLake Technical Features

Enough with the philosophy. Time to get a little technical. Modern table formats tend to come with a host of standard features and capabilities. DuckLake is no exception, and isn’t afraid to improve on the standard where possible.

These features can broadly be broken out into a couple of categories, focusing on performance and data management.

DuckLake Features: Transactions and Performance

ACID-Compliant Transactional Support

ACID transactions for databases

Atomicity, Consistency, Isolation, and Durability (ACID) define the expected behavior for reliable transactions. This covers concepts like no partial transactions and concurrent transactions not interfering with each other. Things that you’re glad your bank’s database implements.

DuckLake guarantees full ACID compliance, unlike traditional data lakes.

Cross-Table Operations

People usually think of databases and other data structures as containing multiple tables, but table formats aren’t always written that way. DuckLake supports multi-table ACID transactions over all managed schemas, tables, and content.

Reduced File I/O

Database queries instead of file traversal mean that query performance is faster and more efficient. DuckLake spends as little time as possible in the critical path.

Concurrent Transactions

Because transactions are more efficient, conflicts and conflict resolution are mitigated. Add to that the fact that standard SQL databases, unlike file systems, are built for concurrent transactions. DuckLake can easily handle many more concurrent transactions than file-based metadata can.

DuckLake Features: Data Management

DuckLake table format features

Schema Evolution

Data changes. Companies grow, adding new features to track. Regulations are put in place, limiting the types of data you can store. Being able to flexibly update your schemas to keep pace is just good sense.

DuckLake not only supports full schema evolution (appends, updates, and deletes), but also multi-table schema evolution, even allowing you to update columns in the metadata without needing to rewrite data files.

Time Travel and Snapshot Isolation

It’s important to know your history. Isolated snapshots allow you to track discrete changes and travel back in time, to see the exact state of your data in the past.

DuckLake can handle tons of snapshots because of their compact format in the metadata, and because DuckLake allows snapshots to refer to discrete parts of a Parquet file. This means you can travel back in time with impunity without needing to prune older snapshots that could still offer valuable insights.

Hidden Partitioning

A huge benefit of Iceberg was its hidden partitioning, which automatically produces and optimizes partition values. Well, DuckLake can do that, too. DuckLake handles partition pruning and does so behind-the-scenes so you can simply write a WHERE clause without caring about specific partitions.

Incremental Scans

Sometimes what’s changed is what matters most. DuckLake allows you to specify two snapshots and only retrieve data that changed between the two.

Encryption and Data Security

Sensitive data isn’t always a good mix with data lakes. Zero-trust principles can help implement and inform rigorous security practices. DuckLake’s contribution is the ability to encrypt all data files written to your object storage, with the catalog database managing keys.

All of these attributes help make DuckLake a fully-featured modern table format. Learn more in DuckLake’s documentation.

How to Implement DuckLake in Your Data Architecture

Ready to try out DuckLake? Thankfully, implementation can be a snap. We’ll cover a more production-oriented setup in a bit, but it’s worth noting that you can test out a simple setup right from your own laptop.

To create a local instance, all you need to do is download and start up a copy of DuckDB, version 1.3.0 or higher. Then run the commands:

plaintext
INSTALL ducklake; ATTACH 'ducklake:metadata.ducklake' AS my_ducklake;

That’s all there is to it. You can start creating tables, inserting data, and querying using your local DuckLake instance. See DuckDB’s guide for more.

When to Use DuckLake

Before jumping into DuckLake in production, it’s first wise to consider how you plan to use it.

Any table format working with file storage, even DuckLake, is best off used for analytical workloads. Even with DuckLake’s addition of a metadata database that cuts down on transaction conflicts, these are not transactional systems. Your application-backing workhorse is still better off being a standard database like PostgreSQL, MySQL, or MongoDB. These systems are optimized for writes and can handle high-traffic scenarios for your application.

But given an analytical use case, DuckLake is in its element, especially if you plan on complex data operations. DuckLake excels at handling frequent schema changes and time travel scenarios, especially if you need to store many historical snapshots.

It’s also a great choice if you value customization without the headaches. If you’re looking for a straightforward setup, the default DuckDB-driven implementation hosted by MotherDuck works well. On the other hand, if you’re comfortable digging in and shuffling things around to meet your exact specifications, you can choose your object storage, metadata database, and—coming soon—your compute engine, even in flexible multi-engine environments.

For now, though, let’s stick with the default path for an actual implementation example.

MotherDuck: Hosted DuckDB and DuckLake

MotherDuck: Hosted DuckDB and DuckLake

Open table formats are great. They foster a community that comes together to solve problems and provide input on real use cases. But, similar to open-source databases, you don’t always want to set up and manage your own instance, even if you can. This is where cloud services come in.

MotherDuck is a cloud warehouse service specifically designed for DuckDB and DuckLake. The platform lets you scale effortlessly, even seamlessly transitioning from standard storage to DuckLake as your data grows.

Together with their goal for super-fast analytics queries, MotherDuck offers features like DuckDB extension support and (my favorite) a SQL FixIt feature that intelligently suggests corrections for incorrectly formatted queries.

With MotherDuck, DuckLake setup can be as simple as one line:

plaintext
CREATE DATABASE my_ducklake (TYPE ducklake);

Not to mention that using a hosted service often makes it a breeze to get data into DuckLake in the first place as well. Data pipeline tools like Estuary offer hundreds of easy-to-use connectors so you can start stocking your lake with data from all your SaaS, database, and message streaming sources.

Connecting DuckLake to Your Data Pipelines with Estuary Flow

Estuary: Real-Time ETL Platform

You can easily integrate your new MotherDuck-hosted DuckLake instance with the rest of your data architecture using Estuary. Estuary is a real-time and batch ETL platform that simplifies data movement. With hundreds of in-house, open-core connectors, you can easily wire up data captures that then materialize to your DuckLake lakehouse.

To do so, simply create a new MotherDuck materialization in Estuary. You’ll need to provide configuration details such as:

  • MotherDuck Service Token: Used to authenticate your account.
  • Database: The name of your DuckLake database (such as my_ducklake).
  • S3 or GCS Staging Bucket: Name, region, and access keys for a storage bucket that Estuary will use as a staging area to load data into MotherDuck.
    • Note: MotherDuck should also have access to this bucket!

See a step-by-step guide in How to Load Streaming Data into DuckLake. Or watch MotherDuck’s webinar with Estuary to see the integration in action:

No additional configuration is required to use Estuary’s MotherDuck connector for DuckLake versus DuckDB: it simply depends on the type of database you create in MotherDuck.

Migrating Between Other Table Formats and DuckLake

Hoping to replace your existing table format rather than starting fresh? There are plans to implement seamless Apache Iceberg imports and exports. As-is, metadata-only migrations are allowed: DuckLake’s data and positional deletion files are compatible with Iceberg. That’s the beauty of open-source standardized formats like Parquet.

If you know the Parquet data files you want to transfer and are willing to start fresh with the metadata, Estuary can help with the migration. You can export your Parquet files to a cloud storage solution of your choice. Estuary’s cloud storage source connectors (Amazon S3, Google Cloud Storage, and Azure Blob Storage) can parse these Parquet files and prepare the data to be sent to any destination. You can then materialize the parsed data collections to a MotherDuck DuckLake instance.

Or, if you’re going the other way, migrating away from DuckLake, you can save your data to Apache Iceberg instead. Why would you want to do this? It’s unclear. But it’s understandable that knowing you can is comforting. It means you can always retrieve your data if the situation changes.

The Future of DuckLake: Roadmap and Upcoming Features

Already a fully-featured modern table format that cleverly solves for standard table format inefficiencies, this is only the beginning for DuckLake. Here’s a preview of what’s yet to come from MotherDuck’s roadmap:

Expanded Import/Export Capabilities

We already mentioned that direct Iceberg import and export is planned. DuckDB’s roadmap also seems to indicate that Delta Lake compatibility is planned as well. DuckDB extensions for Iceberg and Delta Lake will still be supported and improved on, so you should be able to switch between top table formats with ease.

Multi-Engine Support

While you can choose your own database and cloud storage when using DuckLake, DuckDB is currently the only compute engine option. Apache Spark support is coming soon (funded by MotherDuck!), as well as Ray, and other compute frameworks. These options will further DuckLake’s customizability while retaining smart defaults that just work.

Automated Data Management

Developers tend to appreciate programming languages with automatic garbage collection. Why should it be any different for data structures? MotherDuck will soon make working with DuckLake that much simpler by implementing automatic data compaction and garbage collection.

Efficient File Writes with Data Inlining

DuckLake already works efficiently with files, reducing file I/O time and eliminating the metadata manifest file structure in favor of a database. But high-traffic streaming scenarios can balloon lakehouse files dramatically, no matter how efficiently they manage those files. DuckLake’s experimental data inlining feature collects a certain number of rows before writing to Parquet so you don’t have a mess of individual, tiny files for each change. MotherDuck also plans to support this feature soon.

Get Involved with DuckLake

DuckLake is fully open-source: both the DuckLake specification itself and the DuckDB extension for DuckLake are released under the MIT license. So if you want to get in on the ground floor and help steer the course of this exciting new table format, now’s a great time to get involved.

You can find the DuckDB extension on GitHub and join in on the discussion. Or you can start implementing your own engine following the specification—DuckLake is completely vendor-neutral.

In the meantime, keep up with the latest advancements in the data world by following Estuary on LinkedIn or subscribing to our YouTube channel. We’re always interested to see what the folks at MotherDuck and DuckDB come up with next!

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Emily Lucek
Emily LucekTechnical Content Creator

Emily is a software engineer and technical content creator with an interest in developer education. She has experience across Developer Relations roles from her FinTech background and is always learning something new.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.