Real-time data streaming and processing is now a key value-add for businesses.
As opposed to batch processing, in real-time processing, data is processed as it is generated and received. The data is continuously produced, and with this comes a lot of complexity and pitfalls that we need to be aware of as Data Engineers.
Imagine you’re a Data Engineer working at a gigantic e-commerce site. Most of the business needs that you’ll be required to power will need data joins, aggregations, and many other complex operations. Imagine this data is generated in a streaming manner; how would you go about answering business questions?
In this article, we’ll look at how you can work with streaming data to perform these complex operations: using stateful streaming.
We’ll explore important concepts, tools that you can use, and challenges you might face!
Bounded and unbounded data
Data generated in a (streaming) system will mostly exist in two types:
Bounded Data
Bounded data has a defined start and a defined end.
This is the sort of data that can be ingested in full and then processed; basically batch processing. It’s easier to work with, especially at low volumes, but as the velocity and size of data increases, bounded data can also be processed in a streaming manner.
Countries of the world and user demographics in a system are examples of bounded streams.
Unbounded Data
On the other hand, unbounded data has a defined start but no defined end. Essentially, this is an infinite stream of data.
Unbounded data is what we understand more as streaming data. It’s more complex to process unbounded data since we have to continuously process it as it arrives while taking into account data that might be late (but we probably don’t want to wait for it), it may arrive out of the expected order, among many other complexities.
Clickstream events and IoT device pings are examples of unbounded streams.
Stateless vs stateful stream processing
Stateful vs stateless streaming by Harleen
When working with both unbounded and bounded streams of data, there are generally two ways to work with the events received:
- Process each event as received
- Process each event as received, but including or taking into account history/context (other received and/or processed events)
With the first workflow, we have no idea about other events; we receive an event and process it as is. But with the second, we store information about other events, i.e. the state, and use this information to process the current event!
Therefore the first workflow is stateless streaming while the second is stateful streaming!
The limitations of stateless stream processing
The lack of context when processing new events — that is, not being able to relate this event with other events — means you lack the ability to do quite a bit:
- You can’t aggregate your data
- You can’t join or compare data streams
- You can’t identify patterns
- You’re limited in use cases
It’s clear that stateless streaming has limitations that might make it hard to answer your business questions.
Similar to the backfill problem that we explored before, being able to process your events in the context of other events (or in our previous case, working with both real-time and historical data) proves to be very useful.
Example use cases of stateful stream processing
The ability to process the current event in the context of other events comes very much in handy in the following example cases:
Fraud and anomaly detection
In transactional systems, detecting fraudulent and anomalous behavior in real time is critical.
Anomalies (and fraud) are behaviors that are outside the norm… “the norm” being previous habits in the system!
Therefore, when processing the current transaction, you need to be able to compare its attributes in real time with previous habits!
ML/AI systems
In the age of immediate responses, we expect on-the-fly recommendations when making purchases from Amazon, customer support to answer our questions as quickly and accurately as possible, our social media feeds to recommend the best content always.
These systems need to infer the best response based on your current interactions vs previous interactions. For instance, Amazon will recommend other items to buy based on your current cart contents and items you’ve viewed before.
Device (IoT) monitoring
When working with IoT devices, monitoring their health becomes quite important.
Health of the device can be defined as: for a defined window of time, the device sends a certain number of pings that are no more than 5% fewer than the previous window. Basically, if we expect our device to send 100 pings every hour, if the following hour we receive less than 95, we have a problem.
In this case, we need to store the state of the previous hour to process the current hour!
Important stateful stream processing concepts
Timely stream processing
Let’s think through the scenario above where you are monitoring the health of IoT devices.
If we’re processing this data in a batch manner, this monitoring would be relatively straightforward (with a lot of caveats of course). We can run a query every hour that compares data from this hour vs the previous.
But, you’re very likely not going to be using batch processing here. You therefore need a way to compare a continuously flowing stream of pings across windows of time.
This is a common paradigm of stateful stream processing where time places a key role, hence timely stream processing.
Let’s look at some important concepts
Windowing
Windowing is the mechanism by which we take an infinite stream of data and create bounded batches based on time (it doesn’t always have to be time-bound; for example, you can window by number of events received, but this is the most common way).
Depending on our use case, we can define different types of windows to discretize our data. Some common ones are:
Tumbling windows
These are non-overlapping, strictly defined windows. For our use case above, you would go for a tumbling window for every hour.
Tumbling windows by Apache Flink
Sliding windows
Say you wanted instead to compute a rolling count of pings received every 30 minutes.