For raw data to turn into actionable business insight, it must move between multiple systems. The process of moving data from point A to point B in general is called data integration. And two popular data integration paradigms are called ETL and ELT. But what’s the difference between ETL and ELT? Obviously, we swapped a couple of letters, but it must go deeper than that, right?
Right. There’s a lot more to ETL vs ELT than the order of letters and the processing steps they represent. In data integration, timing is everything, so there are some serious implications when you switch things up.
In this article, I’ll break down…
- What ETL and ELT are, anyway.
- The practical differences between ETL and ELT: advantages, disadvantages, and where you’re likely to encounter each “in the wild.”
- Hybrid architectures and how to get the best of both worlds.
Let’s dive in.
ETL (Extract, Transform, Load)
ETL describes the three major steps of a data integration mechanism (aka a data pipeline). These are:
- Extract. Data is extracted, usually from multiple sources — APIs, applications, webhooks, sensors, etc.
- Transform: Data is formatted, structured, enriched, or otherwise modified for reporting and answering analytical questions.
- Load: Data is loaded into a final storage platform, such as a physical database or cloud storage.
For a pipeline to be considered ETL, the steps have to happen in this order. Transformation must happen before loading.
Let’s illustrate ETL with a common use case.
Suppose you need to ingest data into a data warehouse. Once in the data warehouse, your team plans to perform analysis. There’s a data science project planned and your growing squad of analytics engineers can’t wait to explore the data in dbt.
But let’s say the data you’re bringing into the data warehouse is loosely structured… or not structured at all. It could be amorphous collections of user data from Firestore, or IOT sensor data coming in through Kafka.
It’s not described by a schema. It’s not in any sort of useful or predictable shape. Before analysis can happen, you need to transform this data: manipulate it into a known format — a table, in this case. You may also join multiple datasets or trim unneeded data.
You use a processing engine to perform these transformations before loading the data into the warehouse. That way, as soon as the data lands, your team can be assured it’s usable for analysis.
Advantages and Disadvantages of ETL
ETL is an older method than ELT. That’s because in the early days of data integration, its advantages were non-negotiable. And they’re still incredibly important today. I’m discussing these ETL advantages in terms of a data warehouse, because it’s the most common destination, but they’re pretty universal.
- Protects the destination system from corrupt or unusable data. ETL pipelines will almost always transform messy data into the correct form. If they can’t, they’ll usually halt, keeping the mess outside the warehouse.
- Guarantees data in the warehouse is ready for use. Often, multiple teams rely on a central source of truth to run analyses across a company. With ETL, there’s no potential for confusion about what data is clean and ready for use (it all is!)
- Saves data warehouse storage and compute resources. When you pre-aggregating and filter data, less of it needs to be stored. And executing the transformations in a separate environment saves data warehouse query costs. Sometimes, you can find a more cost-effective way to transform data outside the warehouse.
But ETL has some painful disadvantages. These became obvious with the rise of big data across industries. These disadvantages include:
- Hard to scale. In traditional data pipeline architectures, transformations don’t perform well at high volume.
- Adds latency. Unless you’re using a sophisticated on-the-fly transformation (more on that in a bit), transformations always extend the data’s time en route to the destination. Factor in the struggling performance at scale, and this latency increases.
- Limits analytical possibilities. Transformations like joins, filters, and aggregations, make some operations impossible that would be possible with the raw data.
ETL Use Cases and Examples
You’re likely to encounter traditional ETL when…
- The incoming data is highly unstructured, and the final data format is certain. For example, you’re using a webhook to collect user data from several online store, and you know your analysis team needs to perform standard marketing research.
- Latency isn’t a big concern. For example, you only update your models weekly and use them to perform long-term planning, not instant decision-making.
- Data warehouse governance is strict. You’re an enterprise with important data management standards. Every dataset in your warehouse must have a schema that meets several criteria.
Not all organizations fall into these categories. That’s why, as the rise of cloud computing unlocked unlimited scaling and transformation demands got more complex, engineering teams came up with a new alternative.
ELT (Extract, Load, Transform)
ELT takes those same three data integration steps and flips them.
- Extract. Data is extracted, usually from multiple sources — APIs, applications, webhooks, sensors, etc.
- Load: Data is loaded into the storage platform: often a cloud data lake, which has a lenient structure and cheap scaling.
- Transform: Data is formatted, structured, enriched, or otherwise modified on an as-needed basis.
Again, we’ll use an example to illustrate this. Say you plan to train a machine learning model on a massive amount of social media data from various platforms. While you don’t have a rigid schema for that data, you know more or less what to expect. The data can include media attachments and you plan to store it in a cloud data lake.
In this case, your main priority is to make a huge volume of data available for analysis fast. And you want the data in its raw form for your model. You can always transform it later from the data lake for other purposes.
Advantages and Disadvantages of ELT
The main advantages of ELT are:
- Fast and easy loading. No need to worry about engineering a fast transformation process before loading. You can pipe the data straight into storage and get access immediately.
- One less staging area. Because the transformations happen on top of your storage, there’s no need for a separate staging area. This simplifies your architecture.
- Unlimited possibilities for your data. You have the data in its purest form. Without a pre-load transform, you haven’t given up any options for future transformations and analysis.
However, that last advantage comes with a catch, which leads us to ELT’s disadvantages:
- Risk of creating a “data swamp.” Sometimes, the unstructured nature of the raw data makes it impossible to parse, extract, or query — you may even forget what you have in storage. Such a pool of raw but ultimately unusable data is often called a “data swamp.”
- Post-load transforms can be expensive. This depends on the data and storage system you’re using. But in some cases, running transformations on stored raw data in a platform like a data warehouse can get pricey. This is especially true if you find your team repeating the same basic transformations over and over.
ELT Use Cases and Examples
You’ll likely run into ELT when…
- The source and destination system are the same or the source is highly structured. For example, moving data between two SQL databases.
- Someone’s performing big data analytics.
- The destination is a less-structured storage type, like a NoSQL database.
Key Differences Between ETL and ELT
Use cases aside, let’s take a look at the tangible differences between ETL and ELT processes and components.
Order of steps
This is obvious but key. In ETL, transformation occurs before loading the data into storage. In ELT, transformation happens after data is loaded. The other differences are byproducts of this distinction.
Data processing requirements.
In an ETL pipeline, processing happens before loading. This means that the processing component is completely independent of your storage system. ETL transformation can be batch or real-time. Either way, it needs its own staging area, which you either build or get from an ETL vendor.
Common processing agents include Apache Spark and Flink. These are powerful systems, but they can be expensive and require experienced engineers to operate. Alternatively, an ETL vendor might provide you with processing tools.
By contrast, ELT processing happens on top of the data storage platform — the data warehouse or data lake. This simpler architecture doesn’t require a staging area for transformations, and you can transform data with a much simpler method, SQL.
Often in ELT, what we call operational transformations (joins, aggregations, and cleaning — the “T” in ELT) happen at the same time as analytical transformations (complex analysis, data science, etc). In other words, you’re adding a prerequisite step to any analytical workflow, but if you set this up well, it can be fairly straightforward for analysts to do.
The “T” in ELT can come in other forms, too. For instance, you might clean up data with a pre-computed query and use the resulting view to power analysis. Be careful, though: this can get expensive in a data warehouse environment!
Data storage requirements
ETL usually requires less data storage, and offers you more control and predictability over the amount of data stored. That’s why ETL was the go-to method in the days before cloud storage. ETL is designed for more traditional, rigid data storage that will only accept data adhering to a schema. Traditional relational databases and data warehouses are common storage systems, but more modern, flexible data storage systems can be used, too.
ELT requires a large, highly scalable storage solution, so cloud storage is pretty much the only option. It must also be permissive and flexible. Data lakes, certain warehouses, and hybrid architectures are popular in ELT pipelines.
When to Use ETL vs ELT
Unsure whether ETL or ELT is the right choice? Go through this quick cheat sheet.
Can you store data in the cloud?
If not, stick with ETL to keep data volume manageable.
Do you have strict schema requirements for stored data?
…as in an RDBMS? You’ll only be able to meet them with ETL.
If your storage system isn’t strict (like a NoSQL database or data lake), ELT is possible.
What does your data look like?
If your data is unstructured by nature and can’t be re-shaped to fit a conventional schema or table architecture without compromising it, use ELT.
If your data will conform happily to a schema — especially if it might be more useful in that form — use ETL.
Do multiple stakeholders use the data, and do their plans align?
If the basic transformations used by one team will compromise another team’s work, either use ELT or use ETL with lighter transformations.
What’s your budget?
Finding the most economical data pipeline is a big factor, and there’s no right answer. Take into account…
- Whether you’re using a pipeline vendor (ELT or ETL).
- Cost of storage and querying in the storage system (ELT).
- Cost of the transformation engine of choice (ETL).
How quick do you need your data?
Most (but not all) ETL processes introduce latency prior to loading.
Other Patterns: Modern ETL and More
Despite ELT being a newer paradigm and getting a lot of buzz in the 2010s, it’s not inherently better than ETL. In fact, it brings new issues to the scene.
The division of ETL vs ETL implies a binary: that you must choose the lesser of two evils.
Fortunately, the evolution of data integration didn’t stop there. Many data integrations today are hard to label, and definitely don’t fall under the definitions of ETL or ELT.
Let’s use Estuary Flow as an example.
We call Flow a real-time ETL platform. But in reality, it’s a bit more complex. Here’s how it works. Note where I’ve labeled multiple steps E, T, and L:
- Data is captured from a source using a connector (E).
- A basic descriptive schema is applied, but the data isn’t transformed.
- Captured data is stored in Flow collections, real-time data lakes in cloud storage (L).
- Optionally, you can add transformations like joins and aggregations (T).
- Collections are materialized to the destination. This is a two-step operation:
- Data is transformed to fit the schema of the destination system, and basic reductions are applied automatically (T).
- Data is incrementally loaded to the destination (L).
The final acronym for Flow is EL(T)TL, though I’m sure some of my engineer colleagues would argue that’s an oversimplification.
There are also multiple ways to look at this process. From the perspective of the Flow data collection, it’s ELT. But from the perspective of the final destination, it’s ETL.
Whatever it is, by going beyond the simple dichotomy of ETL vs ELT, Flow proves that it’s possible to create a data pipeline that:
- Transforms and loads in real-time.
- Delivers orderly, analysis-ready data.
- Is affordable and intuitive to use.
There are endless possibilities for data integration architectures. Two major methods are ETL and ELT.
ETL (extract, transform, load) prioritizes orderliness over flexibility. It encourages more efficient storage but can limit the possibilities for analysis. ELT (extract, load, transform) prioritizes flexibility over orderliness. It allows full freedom for analysis and easy scaling but can cause major data governance issues if not handled perfectly.
Many modern data integration platforms find a middle ground. Hopefully by now, you have some ideas for how to strike a balance in your architecture. If you’re looking for some inspiration, check out Estuary Flow’s docs or check out the code — we develop in the open.