
Most batch ETL pipelines don't cause a commotion when they fail. They fail gradually, in ways that are easy to ignore until the cost of ignoring them is bigger than the cost of fixing them. If you're running batch jobs that worked fine two years ago and now feel increasingly fragile, the answer isn’t always that the data has changed. Sometimes the shape of the workload has, and batch ETL is starting to buckle under it.
Here are five signs that what worked before isn't going to keep working, and what each one is actually telling you.
Quick Summary: Signs Your Batch ETL is Failing
Scaling Limit: Batch windows colliding with morning queries indicate you've hit the model's linear scaling cap.
Resource Contention: Application latency spikes during extraction mean ETL is competing with production for IOPS.
The "Trust" Gap: Frequent "Full Refreshes" signal that incremental logic (like updated_at tags) is no longer reliable.
Compliance Drift: Batch pipelines often miss hard deletes, creating "Ghost Data" that violates privacy regulations like GDPR.
Opportunity Cost: If the data team spends >50% of their time on maintenance/firefighting, the architecture is the bottleneck.
Sign 1: Batch Window Overlap and Linear Scaling Limits
Are your batch windows starting to collide? When you set up a nightly ETL job, you probably had hours of headroom. The job ran in 20 minutes, finished by 2 AM, and was done before the morning queries hit the warehouse. Now the same job takes three hours, finishes at 5:30 AM, and is starting to compete with the first dashboard queries of the day.
This is the most visible sign of batch breakdown, and it's a function of how batch scales. Batch jobs do all their work at once, so the runtime grows with the data. Linear growth in data volume produces linear growth in batch duration, and eventually the job no longer fits in the window you allocated for it.
The usual response is to optimize the job, which buys you another six months at best. But that’s merely a bandaid over the bigger problem: the underlying pattern hasn't changed. The data is still growing, and the next optimization is harder than the last. Eventually you run out of optimizations and have to either accept the longer window or change the architecture.
What this sign is telling you: your data volume has outgrown the batch model. Faster batch jobs won’t work; it's time to consider moving to a model where work is distributed continuously rather than concentrated in a window.
Sign 2: Production Database Latency and Extraction-Induced IOPS Contention
Is your source database slowing down during extraction? Batch ETL pulls data by querying the source. On small databases with light query loads, this is fine. But it quickly becomes a problem on busy production databases.
The pattern is recognizable: every night at 1 AM, application latency spikes. The on-call engineer eventually traces it back to the ETL extraction job competing with the application for database resources. Sometimes the team negotiates a window with engineering, or they add read replicas. Maybe they try to stagger the jobs.
The batch extraction is essentially a load problem on the source. The bigger the table, the more the extraction costs. The more frequent the extraction, the more often the cost is paid. Teams that hit this sign have usually tried every workaround short of changing the architecture.
Consider a similar bind that pushed LOVESPACE to look at change data capture (CDC). They were running operations and analytics on the same SQL Server database, and every analytical query competed with the operational workload. Their warehouse team had to ask analysts to stop running queries so the system could keep up. The data was current, but operations were suffering for it.
What this sign is telling you: your extraction model is putting load on a system that wasn't designed to handle it. Clever scheduling can't save you here. Your model needs to read from a place that doesn't impact the operational system, which for log-based CDC means the database's transaction log.
Sign 3: Defensive Full Refreshes and the "Incremental Trust Gap"
Are full refreshes happening more often than they should? A full refresh is when the pipeline reloads the entire dataset rather than just the changes. In a healthy data architecture, full refreshes happen for specific reasons: a destination rebuild, a schema migration, or a recovery from a known data quality issue.
In an unhealthy one, full refreshes happen because nobody trusts the incremental load anymore. The "last updated timestamp" logic has missed updates before. The merge logic has produced duplicates. Schema changes have broken the incremental path. Eventually the team starts running full refreshes "just to be safe," and what was supposed to be an incremental pipeline is just a daily full reload.
When this happens, you've lost the benefits of incremental loading. You're paying the cost of a full refresh every day while still calling it incremental. The data is technically correct, but you're moving far more data than you need to, and the source database absorbs the burden every time.
What this sign is telling you: your incremental logic isn't reliable enough to trust, and the team has been working around it rather than fixing it. The fix is a capture mechanism, such as log-based CDC, that doesn't depend on application-level conventions like updated_at columns.
Sign 4: Missed Hard Deletes and Compliance/GDPR Risk
Are you missing deletes and not sure when? This type of failure mode does the most damage because nobody notices.
Most batch ETL pipelines don't capture deletes. The mechanism that pulls "rows that changed since last run" can see new rows and updated rows, but a deleted row just isn't there to query. Unless the team has built explicit delete handling, usually with soft deletes using an is_deleted column, the destination keeps records that no longer exist in the source.
The damage typically shows up months in. A customer count that nobody can quite reconcile with the source. Inventory totals that drift further from reality with every monthly close. Eventually the compliance team asks why records that were deleted six months ago are still showing up in the warehouse, and the data team realizes the pipeline has been retaining everything the source has dropped, because nobody ever told it to do otherwise.
This is a particular problem now because privacy regulations like GDPR require that deletions actually propagate. A pipeline that retains deleted records isn't just a data quality issue; it's a compliance issue.
What this sign is telling you: your capture mechanism doesn't have visibility into what's actually happening in the source database. Hard deletes leave no trace in the source for a query-based pipeline to find. Log-based CDC captures deletes as first-class events because the deletion is recorded in the transaction log.
Sign 5: High Maintenance Overhead and Data Team Burnout
Is the team spending more time fixing the pipeline than building on it? This is the meta-sign, and it's usually the one that finally forces the conversation.
A healthy data team spends most of its time building things: new pipelines, new models, new analyses. An unhealthy data team spends most of its time keeping existing pipelines from breaking, often dealing with issues like:
- Schema changes broke the loader.
- The incremental logic missed updates.
- The batch window blew through.
- The destination ran out of space.
Sure, each issue gets resolved, but the team moves on and a week later something else goes wrong.
Cosuno hit a version of this before adopting Estuary. Their previous data movement was expensive and unreliable, the kind of pipeline that needed regular intervention and produced regular surprises. As Maximilian Seifert at Cosuno described it, switching to a system that "just works" cut their data movement costs in half and eliminated the incidents that had been pulling engineering time away from product work.
Data teams in this situation start to realize their day-to-day is mostly reactive maintenance. The pipeline is consuming the team rather than enabling it. New analytics requests get queued behind firefighting, and trust in the data layer erodes because everyone has a bad data horror story. By the time leadership notices, the team has begrudgingly accepted the pipeline is a problem to manage until the end of days, rather than infrastructure to build on.
What this sign is telling you: the architecture isn't matching the workload anymore, and the team is paying the difference in their own time. Instead of throwing more engineers on the existing pipeline, consider investing in an architecture that reduces the number of failure modes the team has to handle in the first place.
Batch ETL vs. Real-Time CDC: A Technical Comparison
| Feature | Batch ETL (Query-Based) | Real-Time CDC (Log-Based) |
|---|---|---|
| Source Impact | High (Periodic heavy queries) | Low (Reads transaction logs) |
| Latency | High (Minutes to Hours) | Sub-second |
| Delete Handling | Difficult (Misses hard deletes) | Native (Captures delete events) |
| Reliability | Depends on updated_at tags | Guaranteed by the log |
| Complexity | Simple to start, hard to scale | Requires initial log-access setup |
The Solution: Moving from Batch ETL to Continuous Change Data Capture (CDC)
If you recognize more than one of these signs, the core issue is usually the same: batch ETL was the right architecture for a smaller, simpler workload, but the architecture hasn’t kept up as the workload has grown.
Change data capture is the most common alternative for the operational data that drives most of these problems. Instead of pulling data on a schedule, log-based CDC reads from the database's transaction log and ships only what changed, in real time. This eliminates the batch window. The source load diminishes because the log is being written anyway, and deletes get captured because they're in the log. Teams stop having to do full refreshes as a workaround because the incremental path is reliable.
The shift isn't risk-free. CDC has its own operational model, with its own idiosyncrasies to evaluate: capture method, delivery semantics, backfill handling, schema evolution, and pricing. But the alternative for teams hitting these signs is to keep optimizing a pipeline that's running out of valid optimizations.
The teams who recognize this transition early plan for it. The teams who don't end up making the change anyway, just under more pressure and on a worse timeline.
Future-Proofing Your Data Architecture
Batch ETL isn't broken in some abstract sense. It's an architecture that fits a specific kind of workload. The problem is that a lot of workloads end up growing out of it. The five signs above are how that growth shows up: longer batch windows, source database contention, defensive full refreshes, missing deletes, and a team that spends more time maintaining the pipeline than building on it.
If you're seeing any of these signs, the next step is to figure out what's actually causing them, and what the right architectural response is. For most teams, the answer involves moving the operational data layer off batch and onto continuous capture. The teams who do this well treat it as an architecture upgrade, not a tooling decision, and they plan it deliberately rather than reactively.
Ready to leave the batch window behind?
If you’re seeing these signs in your own pipelines, it might be time for an architectural upgrade. You can try Estuary for free or check out our documentation to see how we handle log-based CDC for your specific database.
FAQs
Do I need a complex streaming setup like Kafka to run CDC?
When is batch ETL still the better choice?
How does CDC handle "hard deletes" differently than batch?

About the author
Emily is an engineer and technical content creator with an interest in developer education. At Estuary, she works with data pipelines for both streaming and batch data and finds satisfaction in transforming a mess of information into usable data. Previous roles familiarized her with FinTech data and working closely with REST APIs.




