
When collecting data, enterprises are often unsure how they can get it to the right place at the right time. After all, some systems require instant updates, while others need everything to be 100% complete and accurate before they come near it. Unfortunately, most data pipeline architectures force you to choose between batch and real-time processing, as if those were the only two ways to do it.
This isn’t a tooling problem, though. It’s a design flaw in how data movement is architected across systems. Traditional pipelines were created for simpler setups, where data flowed from one point (the data source) to another (the destination). It was processed on a scheduled basis and used primarily for reporting purposes. Today, however, data is used in daily operations, not just for analytics, and this distinction is crucial for complex systems where both timing and reliability matter.
Luckily, there's a third way: Estuary, which treats data movement as a shared infrastructure, where each system is allowed to consume data at its own pace.
Key Takeaways
Traditional data pipelines don't scale well; they multiply your problems.
A data movement layer lets you collect data once and deliver it everywhere.
Change Data Capture (CDC) reduces source database load while enabling real-time data sync.
Stateful infrastructure means no more painful backfills or lost data.
The Pipeline Trap: Why Traditional Data Pipelines Don’t Scale
Data pipelines originally had one job: to let people analyze data by moving it from A to B. Back in the early days of data warehousing, this worked perfectly. The routine was predictable. You’d pull data from your operational database, transform it into the right format, and load it into the warehouse overnight. By the time you logged in the next morning, the reports were ready. For a while, that was all businesses needed.
Eventually, the number of systems increased and databases multiplied. Software-as-a-Service (SaaS) tools became ubiquitous, and analytics went from being used primarily for reporting purposes to supporting day-to-day operations. However, the same pipeline-based approach was used for data integration. Even today, many organizations continue to design and implement data pipelines as individual, point-to-point, single-use solutions.
Although each pipeline appears simple enough to understand on its own, together, they form a complex, interdependent system that requires continuous maintenance. And it doesn’t really matter if they are hand-built as scripts, scheduled in workflow orchestrators like Airflow, or defined in ETL tools; each of these pipelines comes with ongoing costs: development, scheduling, monitoring, retrying, and repairing failures.
Over time, these activities add load to source systems, increase the number of schema mappings that must be maintained, generate more alerts, and, ultimately, lead to a proliferation of pipelines that require constant attention. Data engineers end up spending most of their time maintaining these pipelines instead of working on higher-value problems.
And that, my friends, is the pipeline trap.
Why the Pipeline Model Fails at Scale
The most important reason why this data pipeline architecture is unable to scale is the sheer number of required connections as the system grows. In a traditional point-to-point architecture, every source system must be connected to each potential destination system.
If an organization has N sources and M destinations, they'll have a total of N x M pipelines. For example, 5 sources and 3 destinations already require 15 pipelines. Adding just one more destination means creating 5 more extraction jobs that will read directly from production databases.
But wait, there’s more!
Operational Load Increases
Each pipeline reads from the source independently, which results in an increased load on source systems. Operational databases experience an increase in query counts, CPU usage, and I/O usage. Consequently, teams often need to negotiate query windows or throttle jobs to ensure that they don't interfere with normal operations of user-facing applications.
Failure Rates Surge
Since pipelines are highly dependent upon source schemas, a single change to them (such as renaming a column or switching its type) can result in multiple pipelines failing simultaneously. Repairing the failure will require manual intervention, updating source schema(s), and running previously failed jobs. This process can introduce friction between application and data teams and slow down deployments.
Retry/Backfill Activities Become Painful
Most pipelines don't keep a permanent history of changes. Once data is moved, the pipeline forgets about it. If a destination fails or a logic error is identified after weeks, the pipeline won't be able to replay the history. Engineers will need to re-extract the data from the source system, which will put pressure on production systems.
Information Turns Stale
Batch pipelines are based on a schedule. Even when they're run frequently, there are inevitable gaps between when an event occurs and when it becomes available in other locations. In today's world of real-time applications, those gaps absolutely matter, and as systems continue to grow, the traditional pipeline model won’t adapt. In fact, it’s much more likely to break.
The problem isn’t just the delayed data. Different systems have different expectations around freshness and correctness. Optimizing exclusively for batch or real-time forces teams to make trade-offs that don’t reflect how the business actually works. Data has to arrive at the right time for each consumer, with clear guarantees regarding its freshness and completeness.
What Is a Data Movement Layer?
A data movement layer represents a radically different way of viewing data integration. Rather than treating data movement as a series of discrete tasks, it sees it as a service.
At a high level, the data movement layer sits between all data sources and all data consumers. It's important to recognize that data doesn’t always flow in a single direction. While the analytical warehouse continues to be the main consumer of data, operational systems, SaaS applications, and downstream applications rely on the same underlying data; reverse ETL use cases, operational dashboards, feature stores, and AI systems depend on access to it, often with different timing requirements.
All these can be supported by the data movement layer, without the need to create a separate pipeline for each destination.
Sources are read once for capture, and data from the layer is consumed by destinations at the speed they require. Real-time consuming systems can process changes as they happen, while others may prioritize the quality of data (consistency, aggregation, and similar) and historical completeness. The layer is not limited to a single latency model. Instead, it supports real-time delivery of data whenever required by a specific use case. Additionally, it retains data as it moves, creating a continuous and persistent history.
At its core, the data movement layer is a very simple concept: collect data once, and make it available everywhere. This fundamentally changes the way the system is structured. Instead of numerous discrete point-to-point pipelines, it creates a common substrate where data flows continuously into the layer and then out to multiple destinations. There is no need for the source to interact with a new destination.
Using Change Data Capture (CDC) for Real-Time Data Transfer
Instead of periodically querying tables using SQL statements, CDC reads the transaction log of operational databases. Inserts, updates, and deletes are captured as a stream of events as they occur.
Not only is this method efficient, but it also minimizes the load on operational databases.
The data movement layer supports real-time data transfers by design. Batch processing becomes a tool to replay historical data, not a different form of processing.
While CDC addresses the problem of efficiently capturing changes, it doesn't inherently provide solutions for the issues of durability, replayability, and coordination among multiple users. Without a persistent layer to store and manage the flow of changes, teams still need to integrate multiple tools to manage backfills, failures, and changing requirements.
Data Infrastructure vs. Middleware
The key difference between data infrastructure and middleware lies in state: middleware moves data temporarily, while infrastructure maintains a durable, authoritative record of what happened.
Traditionally, middleware and ETL tools were designed to move data, not manage it. Therefore, they operate as stateless conduits. Data comes in, it’s processed, and then it goes out. Once it’s gone, the system loses track of it.
Lack of Statefulness: Liability at Scale
If a stateless system experiences failure, the recovery relies solely on retries or rereading data. Once data is lost or logic changes, the only option is to reread data from the source. This can be costly, potentially dangerous, and sometimes even impossible.
Middleware is incapable of determining what really happened, as it maintains no historical record of the events. It cannot reconcile differences between systems or provide reliable playback of the past.
The data movement layer, on the other hand, is a stateful infrastructure layer. It maintains a permanent record of all transactions against it. The event stream is treated as an authoritative record, not a temporary message. As such, it enables replay, time travel, and safe reprocessing of events.
Estuary is built as this type of infrastructure.
How Estuary Works
Estuary uses CDC to collect data from multiple sources into a data movement layer. The data is stored in cloud object storage, which is low cost, highly scalable, and keeps a full history of changes. From there, Estuary delivers data to many different destinations, including warehouses, search systems, operational databases, vector stores, and SaaS tools.
When a team adds a new destination, Estuary doesn't need to read from the source again. It can backfill the destination using the stored history.
Since every destination reads from the same historical record, data can safely flow both into analytics systems and back into operational systems. Teams use this to sync enriched data into application databases, power customer-facing tools, and feed AI and machine learning systems that need up-to-date context.
This is the key difference between tools and infrastructure. Tools move data from one place to another, whereas infrastructure ensures that data is always available, correct, and ready to use.
The Impact of Using a Data Movement Layer
Switching to a data movement layer transforms the way your team works. Instead of constantly fixing broken jobs, engineers can now create the data flow once and let the system handle retries and durability.
Let's use the same example as above: a team with 5 source systems and 3 destinations. The typical way to implement a pipeline-based architecture would be to create 15 different pipelines to build, monitor, and support. Adding another destination would lead to 5 more pipelines and an increased number of jobs to maintain in the event of a schema change or a pipeline failure.
For the sake of simplicity, let's assume each pipeline takes 1 week of engineering effort to learn the relevant APIs, build a robust integration between the source and destination, and thoroughly test both happy and unhappy paths. Each new system adds more than a month of effort, and that only grows over time (not to mention the maintenance and refactoring triggered by breaking API changes in the middle of the night).
With the data movement layer, the same 5 source systems are configured once. Destinations receive data independently from the layer, and adding a new destination doesn’t require changes to the source systems or recreating the existing flows. Configuration of a new consumer may take just a few days instead of several weeks.
This also opens the door for innovation. Since Estuary keeps a complete record of all data received, teams can test new transformations or connect new destinations without affecting their production deployments. They don't need to write backfill scripts or query the source systems again.
In addition, the production environment becomes safer. Source databases are read only once, which makes the load process predictable. Teams no longer have to worry about a heavy-data job failing their operational systems.
Data availability eventually becomes the norm. Teams stop asking if they can get data and focus on how they can use it. A shared layer delivers data to each system exactly when needed, without building additional pipelines. Timing debates are a thing of the past.
By combining real-time capture with replayable streams, Estuary delivers data safely and on time, eliminating the trade-off between speed and security.
| Pipeline-based Architecture | Data Movement Layer | |
|---|---|---|
| Complexity | Every source must connect to every destination. | Data flows into a shared layer for all consumers. |
| Effort | Approx. 1 week per pipeline, over a month when adding new systems. | Configuring a new user takes days instead of weeks, no need to recreate flows. |
| Load | Sources have to be read multiple times for different jobs, potentially causing operational failure | Lighter. Sources are read only once, making load predictable and safer. |
| New connections | Requires backfill scripts, re-querying sources. | Uses a complete historical record to backfill new destinations without touching the source. |
| Maintenance | Fixing broken jobs, monitoring alerts, schema refactoring. | Engineers create once, retries and durability are handled automatically. |
End of the Pipeline Era
Pipelines are good for small, simple systems where some latency is acceptable. However, they break down as organizations grow and data demands increase.
A data movement layer treats data transfers as core infrastructure and ensures durability and speed by design. With Estuary, once data is collected, it can be served to any consumer. Brittle pipelines are replaced with a scalable and shared foundation.
Stop maintaining pipelines. Use Estuary to move data at the right time, by design.
FAQs
Do I have to replace my existing Snowflake or Databricks setup?
What happens to my data freshness if I switch to a shared infrastructure?
Can a data movement layer support both my real-time AI features and my weekly batch reports?
If I add a new destination system, do I have to re-extract all my historical data from the source?

About the author
I’m a Data Engineer who likes understanding how data moves and why things break. Always looking for answers and trying new technology and ideas on real problems. The small IT business that I operate keeps my focus on real-world issues and needs. I really enjoy writing and sharing my knowledge, especially when it helps others make sense of complex topics.



















