data pipeline

16 min read

August 5, 2025

Real-Time IoT Pipelines Are Hard to Get Right: Here’s How to Build One That Works

Struggling with late IoT data, schema drift, or Snowflake MERGE issues? Discover why real-time pipelines fail at scale and how Estuary Flow helps logistics teams build resilient systems.

Team Estuary Estuary Editorial Team

Building Real-Time IoT Pipelines That Actually Work

Share this article

Introduction: A Pipeline That Worked… Until It Didn’t

At first, everything looked like it was working.

You had real-time data flowing in from a fleet of IoT devices: GPS coordinates, engine diagnostics, temperature readings, even driver behavior metrics. You wired it all up with Kafka Connect and Debezium, piped it through Spark for aggregation, and pushed the results into Snowflake for downstream analytics.

The initial proof of concept flew through the checklist. The dashboards updated in near real time. Everyone was impressed.

Then things got real.

As data volume ramped up to millions of events per hour, new problems surfaced. A truck in a remote zone would reconnect hours later, flooding the system with backlogged events that arrived too late to make it into the right aggregation windows. Minor changes in firmware — like an added sensor or renamed field — would quietly break parts of the pipeline. Your Snowflake MERGE operations, once fast and smooth, started to lag, spike in cost, or silently fail under pressure.

What started as a promising setup began to feel fragile.

To fix it, the team added more glue: more Spark jobs, more retry logic, more checks around schema handling, more infrastructure to maintain. And yet, late-arriving events still caused duplicates or missing data. Schema drift kept sneaking through. Snowflake continued to bottleneck under stress.

It’s a story that plays out often in logistics, manufacturing, and any industry dealing with high-volume IoT streams. The tools you're using might be powerful on their own, but stitched together at scale, they reveal their limits.

In this post, we’ll break down the three core pain points behind real-time pipeline failures at scale — and walk through what a resilient solution actually looks like. You'll see how companies in the supply chain space are moving past brittle architecture to build robust pipelines that just work — even when the real world doesn’t.

Problem #1: Late-Arriving Data Breaks Aggregations

In logistics and IoT systems, data doesn't always show up on time.

A delivery truck goes out of range for two hours, then reconnects and pushes a burst of telemetry that technically belongs to earlier time windows. Or an overseas shipment uploads sensor data in bulk after arriving at a port with stable connectivity. Either way, you're left with the same problem: events arrive late, but your system has already moved on.

This kind of delayed ingestion causes a domino effect:

Missing or incomplete aggregations in your data warehouse
Duplicate inserts when retries or backfill logic isn’t perfect
Out-of-order records that throw off analytics and alerts
Extra logic to catch and reconcile discrepancies downstream

Even if you're using event timestamps in Spark or other processing tools, the windowing logic can easily break when the stream includes a mix of late, early, and duplicate records. Teams often try to patch this with buffering, retries, or batch reprocessing — which adds latency and complexity without fully solving the problem.

And if you're syncing that stream into a destination like Snowflake using MERGE operations, late-arriving events might never land in the correct rows at all. The data simply gets skipped or overwritten, leaving gaps in visibility and business logic.

What you need is a pipeline that knows how to handle lateness as a first-class concept — not an afterthought. That means respecting event time, deduplicating intelligently, and supporting backfill alongside streaming updates without conflict.

In the next section, we’ll look at another issue that often creeps in just as things are scaling: schema drift.

Problem #2: Schema Drift from Firmware Updates

In a perfect world, the structure of your IoT data would stay consistent. But in the real world, devices evolve. Your hardware team pushes a firmware update, adds a new sensor, or renames a field for clarity. Suddenly, the shape of the incoming data changes, and your carefully tuned pipeline hits a wall.

This is schema drift. And when you're dealing with JSON or semi-structured data from IoT devices, it's not just likely — it's inevitable.

Even small changes can create big problems downstream:

A new temperature sensor adds an unexpected field
A renamed property causes type mismatches in your warehouse schema
Nested structures change slightly and fail validation during ingestion
Aggregation logic breaks because required fields are missing or in a new format

Teams often reach for tools like Avro or Protobuf with a schema registry to enforce structure. But these systems are rigid by design. They expect schemas to change slowly, with careful coordination across teams. That doesn't work when your hardware team is shipping weekly firmware updates or experimenting with different payload formats.

The more dynamic your device ecosystem, the harder it becomes to maintain schema stability across your pipeline. As a result, you're left writing more validation logic, building version-aware transformers, or backfilling data when a breaking change slips through.

What’s missing is a way to adapt quickly and safely. You need schema enforcement, yes — but you also need flexibility. A pipeline that supports evolution without constant rewrites. One that validates data before it breaks things and can gracefully handle new or changed fields without sacrificing stability.

In the next section, we'll explore what happens when these upstream challenges collide with the performance limits of your data warehouse.

Problem #3: Snowflake MERGE Operations Don't Scale

Snowflake is a powerful analytics platform, but it's not built to handle high-frequency upserts from streaming data at massive scale. And yet, for many teams, the go-to pattern for syncing real-time IoT data into Snowflake is the MERGE operation.

At low volume, this works fine. You use Kafka Connect or Spark Streaming to push updates into a staging table, then run a MERGE to deduplicate records and sync into your production table.

But as the number of records grows from thousands to millions per hour, this approach starts to buckle.

Here's what usually happens:

MERGE performance degrades as table size increases
Clustering and partitioning become a full-time tuning project
Storage costs spike due to versioning and intermediate tables
Streaming ingestion slows down while MERGE catches up
Failures start to occur during high-traffic periods or backfill jobs

Teams try to solve this by adding more infrastructure: larger warehouses, smarter partitioning, scheduled retries, and external orchestration tools like Airflow. But all of this adds complexity, cost, and operational risk.

Even if you get the MERGE pattern working under pressure, you're still fundamentally limited by how Snowflake handles upserts. It's not optimized for high-throughput, low-latency event streams coming from dozens or hundreds of edge devices. You're treating your warehouse like a database, and that can lead to performance walls you can't tune your way out of.

What you really need is a way to sync real-time data into Snowflake without constantly battling the limits of MERGE. A system that understands change data capture, supports exactly-once semantics, and doesn't rely on periodic batch jobs to clean things up.

In the next section, we'll take a look at why patching all these problems with more Spark jobs or custom glue code only makes things worse.

Why Patching It With More Spark Isn’t the Answer

When late-arriving data, schema drift, and warehouse limitations start piling up, the default instinct is to patch things with more code. Add another Spark job. Write another script. Set up a cron job to reprocess missing records. Layer in retry logic and hope for the best.

At first, this works. You feel productive. You're solving problems.

But over time, you're not fixing the pipeline — you're just building more of it.

Each new Spark job comes with orchestration overhead, scheduling logic, infrastructure provisioning, monitoring, and failure handling. You need to define schemas, manage dependencies, and make sure changes upstream don't break everything downstream. When schema drift happens again, you end up rewriting transformation logic just to keep things from falling apart.

Before long, the system becomes fragile. You're spending more time maintaining the pipeline than delivering value from the data itself.

And adding even more tools doesn't help. Some teams try to glue everything together with Flink, Airflow, or dbt. These tools are powerful, but they also assume a level of operational maturity and resourcing that many teams simply don't have. More moving parts means more coordination, slower iteration, and higher risk every time something changes.

The real problem isn't your team or your tools. It's that the architecture wasn't designed for this kind of real-time complexity in the first place.

What’s needed is a shift toward systems that are built to handle streaming, evolution, and scale from the ground up. Systems that treat event time, schema changes, and data consistency as core features — not edge cases.

In the next section, we'll look at what that kind of architecture actually looks like, and what to prioritize if you're building for resilience instead of repair.

What a Resilient, Real-Time IoT Pipeline Should Look Like

When you’re building a pipeline that ingests millions of IoT events from devices spread across the globe, fragility isn’t an option. You need architecture that embraces unpredictability — delayed messages, evolving schemas, and fluctuating volumes without breaking.

So what does that actually look like?

It starts with event-time awareness

A resilient pipeline doesn’t just accept incoming data as it arrives. It understands when that data was actually generated. Event-time processing allows the system to correctly place late-arriving records into the right logical windows, even if they come in hours late. That means accurate aggregations and clean analytics — without duct tape logic or reprocessing jobs.

It handles schema evolution gracefully

When devices update firmware or add new sensors, the pipeline should adapt without breaking. That means automatic schema detection, validation, and the ability to work with both existing and new structures. A rigid schema system forces rework. A flexible one moves with your data.

It supports both backfill and streaming

You should be able to load historical data and stream new data in the same pipeline, without conflict. Backfill jobs shouldn’t overwrite or duplicate your live stream. A resilient system keeps these paths consistent and aligned.

It scales without constant tuning

You shouldn’t need to rebuild your architecture just because your fleet doubled. Whether you’re processing thousands or millions of events per hour, the system should scale predictably and efficiently — ideally without manual resource tuning, job retries, or growing infrastructure costs.

It delivers consistent, accurate results

Exactly-once delivery should be the default, not a luxury. This ensures that no matter how complex your sync logic or how variable your data sources are, your warehouse, lake, or dashboard reflects the true state of the world.

And it doesn’t require a team of operators to keep it alive

Most importantly, a resilient pipeline reduces operational load. It doesn’t depend on constant babysitting, custom orchestration, or a graveyard of failed jobs. It frees your team to focus on insights, not infrastructure.

In the next section, we’ll show how Estuary Flow brings all of these principles together into a real-world solution that’s already powering supply chain and logistics pipelines today.

How Estuary Flow Solves This

Estuary Flow is built from the ground up to handle exactly the kinds of challenges that real-time IoT pipelines face at scale. From late-arriving data to evolving device schemas to overloaded warehouses, Flow provides a fully managed, streaming-first architecture that simplifies the entire process without sacrificing flexibility or performance.

Here’s how it works in practice.

Ingest real-time data with built-in late arrival handling

Estuary connects to your data sources — including MQTT, Kafka, Postgres, MySQL, MongoDB, and more — and captures changes in real time using change data capture (CDC) where available. For IoT data, Flow can ingest from message brokers like MQTT or Kafka and store records in a schema-aware, versioned collection.

Every record includes system-managed metadata like flow_published_at, which captures the true event time. This enables late-arriving records to be correctly ordered, deduplicated, and processed without breaking your aggregations or introducing duplicates downstream.

Handle schema drift with confidence

Flow collections enforce JSON schemas but also support schema evolution. That means if your device payload changes — a new sensor field is added or an optional field disappears — you can update the schema without breaking the pipeline.

Schema evolution in Flow is safe, transparent, and version-controlled. You can update schemas manually through the UI or CLI, or enable automatic discovery when appropriate. If needed, derivations (written in SQL or TypeScript) can shape or filter incoming data to match downstream requirements, even when the source structure varies.

Combine backfill and streaming in a single pipeline

Flow supports both initial backfills and continuous streaming updates from the same source. There’s no need to manage two separate pipelines or run costly reprocessing jobs. Backfill ensures that historical data is captured and stored, while streaming ensures you never miss a change going forward — even when devices reconnect after being offline.

Deliver exactly-once results to Snowflake and other destinations

Flow materializations let you sync data to destinations like Snowflake, BigQuery, Delta Lake, and more. The protocol guarantees consistency with exactly-once semantics, even under high load.

If you're using Snowflake, Flow avoids the need for expensive MERGE operations. Instead, it intelligently writes updates using streaming-aware logic, reducing compute costs and improving reliability. This is especially useful in supply chain use cases where visibility must be both accurate and timely.

Already trusted in logistics and supply chain environments

Estuary Flow is used by logistics and supply chain teams to track inventory, shipments, and fleet telemetry in real time. It supports high-volume, global data ingestion while keeping costs predictable and operational effort low.

Whether you're syncing millions of events per hour from a distributed fleet or managing real-time inventory across regions, Flow adapts to your data and scales with your operations.

Deploy in Your Own Cloud for Lower Latency and More Control

Logistics and IoT pipelines often span across hybrid environments. Some data originates from private VPCs, edge networks, or on-prem infrastructure, where pushing everything to a public cloud isn't ideal.

Estuary Flow supports a Bring Your Own Cloud (BYOC) deployment model that lets you run the data plane inside your own cloud or private network. The connector that captures and processes data runs close to your source systems, reducing the number of network hops and avoiding cloud egress bottlenecks.

This setup delivers key benefits:

Lower latency for time-sensitive workloads like fleet tracking or equipment monitoring
Reduced operational risk from VPNs, NATs, and unpredictable network routes
Better compliance and security by keeping data within your environment until you're ready to move it

You still get the simplicity of Estuary’s managed control plane, including the UI, catalog, versioning, and orchestration. But the heavy lifting happens where your data lives, giving you full control and real-time performance that centralized services can't match.

You can learn more about how Estuary helps supply chain organizations here.

Example Architecture for IoT Ingestion with Estuary Flow

To bring all of this together, let’s walk through what a resilient, real-time IoT data pipeline looks like using Estuary Flow. This setup is optimized for late-arriving events, schema evolution, and high-throughput ingestion across cloud or hybrid environments.

Data Flow Overview

1. IoT Devices
Sensors installed in vehicles, containers, or warehouses emit telemetry data such as location, temperature, and engine status. These events are typically sent via MQTT, Kafka, or HTTP to a message broker or edge gateway.

2. Message Broker or Source System
Depending on your setup, you might capture data directly from:

An MQTT broker (for edge devices and sensors)
A Kafka topic (if using stream aggregation internally)
A database like PostgreSQL or MongoDB (for device metadata or command logs)

3. Estuary Flow Capture
Flow connects to these sources and begins capturing data in real time.

If deployed via BYOC, the capture connector runs close to the source in your own VPC
Events are timestamped and validated against JSON schemas
Late-arriving events are correctly processed using event-time metadata

4. Flow Collections
Captured records are written to a versioned, schema-enforced collection within Estuary.

Supports automatic schema evolution
Enables real-time preview and inspection
Maintains consistency even as devices or payload formats change

5. Optional Derivations
You can apply SQL or TypeScript transformations to filter, join, or reshape incoming data before materialization. This is helpful when dealing with complex sensor payloads or when syncing to multiple destinations.

6. Materializations to Analytics Tools
The final output is written to one or more destinations:

Snowflake for warehousing and BI dashboards
Delta Lake or Iceberg for open table formats and data lake integration
BigQuery, Redshift, ClickHouse, or Elasticsearch depending on your analytics stack

Materializations in Flow use exactly-once delivery and intelligent upsert logic. There’s no need for custom MERGE operations or scheduled batch jobs.

Conclusion

Building a real-time IoT pipeline often starts out simple — until late-arriving events, schema drift, and warehouse performance issues start pulling it apart. The complexity creeps in slowly. You patch it with more Spark jobs, retries, and transformations. But over time, it turns into a fragile, high-maintenance system that demands more from your team than it gives back.

The truth is, most pipelines aren’t broken because of one big issue. They’re broken because the architecture wasn’t designed to handle real-world messiness at scale. IoT environments are inherently unpredictable. Devices go offline. Data shows up late. Payloads change. Latency matters.

Estuary Flow was built to make this kind of chaos manageable.

With built-in support for late data, schema evolution, exactly-once delivery, and hybrid deployment models like BYOC, Flow replaces brittle glue code with a resilient foundation. You get real-time ingestion, backfill, and materialization in one consistent system — without stitching together five different tools.

If you’re working with high-volume IoT data in logistics, manufacturing, or any system where timing and accuracy matter, it might be time to rethink your approach.

Flow isn’t just another tool. It’s a new way to build pipelines that don’t break when reality hits.

👉 Explore how Estuary supports real-time data in supply chain and logistics
👉 Start building with Estuary Flow for free

Share this article

Table of Contents

Start Building For Free

About the author

Team EstuaryEstuary Editorial Team

Team Estuary is a group of engineers, product experts, and data strategists building the future of real-time and batch data integration. We write to share technical insights, industry trends, and practical guides.