Understanding the modern data stack, and why open-source matters
Every organization’s data goes through a complex journey every single day. It’s collected, moved, transformed, explored, and analyzed to inform business decisions. Along the way, it must be safeguarded from errors and checked for quality.
As you might expect, there’s a huge array of platforms and tools designed to cover every step of this process. The “modern data stack” is one framework used to conceptualize how different data tools work together to allow a complete data journey.
But there’s more to it than that.
In this post, we’ll dive into what exactly a data stack is, what makes one modern, and how data stacks will continue to modernize in upcoming years.
We’ll also cover an important decision you’ll need to make as you design your stack: whether to use open-source components.
What is the modern data stack?
To get at the definition of the modern data stack, it helps to take a quick tour of the evolution of the term.
We begin with the overarching grandparent term: the technology stack, or tech stack. A tech stack is a combination of different technology systems used to build a product or service. This term generally applies to software development, but can be further categorized by specific functional area.
This leads us to our next term: the data stack. A data stack is a type of tech stack designed to facilitate the storage, access, and management of data.
Finally, we get to the modern data stack. What exactly makes a data stack modern? That depends on who you ask, but one of the biggest indicators is the cloud-based architecture of its components, specifically data warehouses and SaaS.
Older data stacks were limited in speed and scale. Analytics were slow because they pulled from data storage systems that weren’t optimized for that purpose. The data pipelines that connected other systems were typically ad-hoc, prone to breakage, and introduced latency. But most importantly, everything was on-premise. Even advances in storage and tools, like the advent of OLAP data warehouses, couldn’t remove the limitations that came with using physical servers.
What we know as the “modern data stack” came about in the mid-2010s, as cloud-based technology became the status quo. It wasn’t an overnight change: first, Amazon Web Services, Google Cloud, and Microsoft Azure made general-purpose cloud storage widely available. Then solutions (like the analytics-focused Vertica) that you could self-host either on-premises or on a cloud platform of your choosing.
The next logical step was specialized, fully cloud-native platforms. Amazon Redshift broke ground in this space as the first cloud-native OLAP database. A 2020 article on dbt’s blog makes the case that the launch of Redshift is what allowed most products that define today’s “modern data stack” to flourish. Regardless of the exact catalyst, we can all agree that most of the components of a modern data stack are fully cloud-based or follow the SaaS model.
Combining all of this, we can finally define the modern data stack.
A modern data stack is a combination of different data systems used by organizations to facilitate the storage, access, and management of data in a cloud-based environment.
There are other desirable characteristics of a modern data stack, but these haven’t been universally achieved, and the tooling is still actively being developed. We’ll discuss these more in a bit.
Basic components of a modern data stack
Modern data stack tools combine various different functions, but there isn’t a set checklist of roles each piece must fulfill. The systems a given organization uses will depend on their goals for their data and business.
What’s more, data technology is a dynamic and evolving space. Sometimes, a single system can fulfill multiple roles. For example, an organization that uses a real-time data integration platform to ingest data may not need a data orchestration system.
For this reason, we’ll break down the essential categories of modern data stack tools: things for which you’ll likely have a dedicated system. Then, we’ll cover other important components that are more likely to be covered by other systems, or can be omitted in certain cases.
- Data warehouse: Your cloud-based, OLAP storage designed to power analytics.
- Popular options: Redshift, BigQuery, Snowflake
- Popular option with similar characteristics: Databricks
- Ingestion: Connects data sources to data warehouse and other components of the stack. You can think of this as the “EL” in “ELT.”
- Popular options: Fivetran, Airbyte, Meltano
- Up-and-coming option: Estuary Flow
- Transformation: Apply queries, joins, and similar modifications to raw data. This creates a user-friendly version of your data that’s ready to be operationalized, or leveraged. Sometimes, this functionality is included in your ingestion system, but it helps to have another environment for exploratory analysis.
- Popular option: dbt
- Business intelligence or data visualization: Where data is analyzed to produce business-applicable insights in the form of metrics, visuals, and reports
- Popular options: Looker, Mode, Tableau, Preset, Superset, Thoughtspot
- Operationalization, or “reverse ETL”: Moves transformed data into BI or other SaaS tools, where it is put to work. The tool used for ingestion may already cover this.
- Popular options: Census, Hightouch, Rudderstack
- Observability and monitoring: Tracks logs and metrics to give your team insights into data health, system behavior, and, sometimes, the path data takes through your stack.
- Popular options: Monte Carlo, Observe.ai, Splunk, Datadog, Datakin
- Orchestration: Executes jobs; manages the data lifecycle and movement throughout different components of the stack.
- Popular options: Airflow, Prefect, Dagster, Astronomer
- Metadata management: Maintains a central repository of metadata across the stack, which other components can pull from.
- Popular options: OpenMetadata, Informatica, MANTA
Note that the above is not meant to be an exhaustive list of valuable modern data stack companies and tools! It’s simply a list of some names you might recognize.
Below is a visual of how a stack’s components might interact. Again, this varies widely, so it’s just an example.
The future of the modern data stack
We’re entering a new wave of innovation for the data stack. Platforms and solutions are proliferating at a dizzying pace.
We’ve more or less solved the problem of big data storage, and SaaS is a common business model. Now, new platforms can refine the basics and further specialize.
Looking into all these new platforms can get overwhelming. As you research, focus on the themes of change in the industry — specifically, how they affect your organization and how a given tool can help you incorporate positive change.
Here are a couple of these important industry trends:
- Real-time data: Most of the current data ingestion, reverse ETL, and orchestration tools rely on a batch data paradigm. That is, they run at an interval and introduce some amount of time lag. However, data delays are becoming less acceptable with every passing year, and at the same time, the tools for painless real-time data integration are finally within reach. As time goes on, we’ll see real-time become the standard.
- Data democratization: Working with data systems has always required a high degree of technical expertise. As a result, data traditionally lived in a silo, where only a small group of specialists were able to meaningfully manage it. As data becomes more important across organizations, it should be useable and manageable by different types of professionals and different user groups. That’s what data democratization is about: data as a shared resource, accessible for all. Getting to this point is the industry’s next frontier.
To read more about these trends, see our posts on Estuary’s real-time vision and data mesh.
Benefits of open-source in the modern data stack
As you build out your stack and choose between different components, you’ll notice varied price points and levels of usability. And some tools are completely open-source and free to use.
The debate between open-source and closed-source software is much older than the modern data stack. There are valid advantages and disadvantages to both.
Here’s a quick rundown of the arguments for each:
Advantages of open-source:
- It’s free! You won’t be paying a bill to use it.
- More contributors add a range of features and cover more use-cases.
- Enhancements won’t get bottlenecked by a single engineering team.
- The tool isn’t dependent on the welfare or survival of a specific company.
Advantages of closed-source:
- You get what you pay for: closed-source products can be more user-friendly. Unlike open-source, this means you’re less likely to incur a hidden cost of engineering time on your side of the equation. And you probably won’t have to pay for self-hosting, either.
- Quality guarantees: closed-source products are sometimes seen as more stable and trustworthy.
Fortunately, there’s a way to get the best of both worlds, and that’s what we’re seeing in many of the new data systems today.
Many of these tools are built and overseen by a company, but are at least partially open-source. This can mean certain components are open-source, but other features are paid. Often, the code is open-source for anyone who’s able to host the platform on their own infrastructure, but paid, hosted options are also available.
Some companies also use a business source license (BSL). This means that that code is open to external contributors, but can only be used in non-production environments. That is, you’re free to change things yourself, but if you want to use the code for your own business, you still have to pay.
- dbt offers an open-source, self-hosted version of their product, as well as paid tiers.
- Estuary Flow’s runtime is licensed under BSL. Its plug-in connectors are completely open-source and compatible with Airbyte, a platform known for its many open-source connectors.
Let’s take another look at all the popular products listed above. The bolded products are either completely open-source or have some open-source component.
- Data warehouse: Redshift, BigQuery, Snowflake, Databricks
- Ingestion: Fivetran, Airbyte, Meltano, Estuary Flow
- Transformation: dbt
- Business intelligence or data visualization: Mode, Tableau, Preset, Superset, Thoughtspot
- Operationalization, or “reverse ETL”: Census, Hightouch, Rudderstack
- Observability and monitoring: Monte Carlo, Observe.ai, Splunk, Datadog, Datakin
- Orchestration: Airflow, Prefect, Dagster, Astronomer
- Metadata management: OpenMetadata, Informatica, MANTA
This means you can more or less build a high-quality, completely open-source modern data stack, with one notable exception: none of the popular data warehouses are open-source (except for Databricks’ Delta Lake, which isn’t actually a warehouse). For an example of how you might build a completely open-source stack, see this article in Towards Data Science.
It’s unlikely you have that specific goal, though. The important thing here is that the varied approaches to open-sourcing allow flexibility and cost savings across your stack.
You can save money where it makes sense for your team by leveraging open-source offerings in that area. For other functionalities, paid options will save you valuable time, as well as the trouble of self-hosting. And a community of open-source contributors working on the codebases that underlie many products makes those products more feature-rich and valuable.
Data flexibility and scale for your business goals
The “modern data stack” isn’t something that’s easy to define. It’s an umbrella term with a few key characteristics:
- Allows you to store, access, and manage data.
- Made up of mix-and-match components, so it’s flexible.
- Has a cloud-based architecture, so it’s scalable.
A modern data stack can be open-source, paid, or a combination. Though technical, it’s a business tool, so design your stack with organizational goals in mind.
If a real-time data foundation is one of those goals, but it’s proved too challenging in the past, Estuary Flow was designed for you.
Flow is a DataOps platform that provides end-to-end data integration. It covers many of the functionalities we’ve mentioned: ingestion, transformation, and reverse ETL.
Learn more on our website, docs, or GitHub.
To try Estuary Flow for your open-source modern data stack, sign up for the free tier here.
Keywords: modern data stack, open-source