Fast JSON Processing in Real-time Systems: simdjson and Zero-Copy Design
Discover how Estuary Flow handles massive data volumes by leveraging simdjson and a unique Combiner to optimize real-time JSON parsing and document merging.
Efficiently processing JSON data has historically been a common bottleneck in data-intensive applications, especially in real-time streaming environments.
At Estuary, we are building a data movement platform specialized in such challenges as supporting high-throughput, low-latency data pipelines, so we needed an approach that moved past traditional JSON decoding methods.
JSON parsing and processing can be a significant bottleneck when handling large volumes of data due to the overhead of converting the document structure into a usable format in memory. While this overhead may be manageable in (some) batch-processing environments, it becomes a critical performance concern for platforms processing tens of terabytes of data daily.
So, we took a different approach: by leveraging simdjson for parsing and introducing a unique Combiner for efficient document merging, Estuary Flow achieves exceptional performance in real-time JSON processing.
Why JSON Processing Is Traditionally Slow
Parsing JSON is generally very resource-intensive. While it is human-readable, substantial processing is required to convert it into a usable, structured format. This complexity arises from the following two key steps in the parsing process:
- Lexical Analysis: The JSON text is scanned to identify tokens such as strings, numbers, objects, and arrays.
- Structural Conversion: These tokens are organized into an in-memory representation, typically a tree structure or hash table, which applications can then traverse and manipulate.
Each of these steps introduces overhead. Lexical analysis demands CPU cycles to decode and validate the JSON structure, while the structural conversion consumes memory and may require expensive allocations. Additionally, garbage collection in managed languages like Java and Python can exacerbate the latency, as frequent memory allocations and deallocations constrain the runtime.
Batch vs. Streaming
In batch-processing environments, the overhead of JSON parsing is often amortized across large datasets processed in bulk. Here, the latency incurred during parsing is less noticeable, as the system is designed to process data in chunks with no hard real-time constraints.
In contrast, real-time systems like Estuary Flow require instantaneous processing of small, continuous data streams. The need for low-latency responses turns JSON parsing overhead into a significant bottleneck. For example, consider a high-throughput pipeline handling tens of thousands of JSON messages per second. Each millisecond spent parsing can cascade into latency spikes, queue backlogs, and degraded system performance.
Why Traditional Parsers Fall Short
Most conventional JSON libraries, such as Jackson (Java) or json (Python), use a two-step approach:
- Decode JSON Text: Read the JSON string and convert it into tokens.
- Build a Data Structure: Populate an intermediate object representation (e.g., dictionaries, arrays).
This method has a few inherent inefficiencies:
- Data Copying: The same JSON data is read, tokenized, and copied multiple times in memory, wasting both CPU cycles and RAM.
- Allocation Overhead: Dynamic memory allocations are frequent, slowing execution and increasing garbage collection pressure.
- Parsing Complexity: Validating JSON structure in real-time adds computational load, especially for large or deeply nested objects.
These limitations necessitated a fundamentally different approach to JSON processing for Estuary Flow.
How simdjson Transforms JSON Processing
Simdjson, a high-performance JSON parser, utilizes SIMD (Single Instruction, Multiple Data) instructions to accelerate lexical analysis dramatically. SIMD enables modern CPUs to process multiple data points simultaneously, processing many characters in parallel during tokenization and significantly reducing JSON parsing time.
Traditional parsers often operate on one character at a time, performing sequential checks to validate the structure and extract tokens. Simdjson instead reads data in chunks (e.g., 64 bytes at a time) and performs bulk operations, such as:
- Validating brackets and braces in parallel.
- Identifying string boundaries and numeric values across multiple characters simultaneously.
- Quickly rejecting malformed JSON by validating key structural elements in a single pass.
Simdjson’s design is pretty revolutionary. Instead of creating an intermediate in-memory representation of the JSON, simdjson maps the raw JSON bytes into a memory view that applications can directly query. This approach avoids unnecessary data duplication and enables quick traversal of JSON attributes.
Applications like Flow interact directly with the raw JSON bytes, eliminating the need for costly encoding/decoding steps. In addition to this, the JSON data remains unmodified, reducing the risk of errors and maintaining consistency across processing stages.
Zero-copy access
Zero-copy access fundamentally changes how JSON data is processed. By bypassing intermediate structures, we can:
- Eliminate Parsing Overhead: Operations like reading, indexing, and validating fields occur directly on the byte array.
- Reduce Memory Footprint: Instead of storing redundant copies of the JSON, the system operates on the original data, leaving more memory available for other tasks.
- Lightning-Fast Access: We use binary search to locate fields within parsed json documents, which is O(log n), where n is the number of properties at that level of nesting.
Simdjson’s zero-copy access shines with complex, nested JSON structures. For example, consider the following JSON representing an e-commerce order:
javascript{
"order_id": 98765,
"customer": {
"id": 12345,
"name": "John Doe"
},
"items": [
{"product_id": 54321, "quantity": 2},
{"product_id": 67890, "quantity": 1}
]
}
Traditional parsers require multiple decoding layers to access nested fields like customer.name or items[1].product_id. simdjson, however, enables direct field lookups with minimal overhead, ensuring blazing-fast performance even with deeply nested objects.
Simdjson directly maps the raw JSON bytes into memory, which is indexed for fast lookup. Accessing a nested field like customer.name proceeds as follows:
- Precomputed Index: During the parsing phase, simdjson identifies the byte offsets for keys and values and stores them in a lightweight index structure.
- Direct Lookup: To access customer.name, the application queries the index, which points directly to the byte offset of the value "John Doe" within the memory-mapped JSON.
Once the value is parsed into an ArchiveNode or HeapNode (intermediate data structures of Flow) we can extract fields referencing the underlying bytes instead of copying them.
A key component in this process is the combiner, which acts as a buffer for documents within a given transaction. If only a small number of documents need to be processed, they can be stored entirely in memory using HeapNodes. However, when the transaction size exceeds a defined threshold (currently 256MB), the combiner automatically spills over to a local SSD, storing documents as ArchivedNodes.
This approach not only optimizes memory usage but also makes sure that large transactions can be efficiently processed without exceeding memory constraints.
The Combiner: Efficient Document Merging
While simdjson handles parsing, the true uniqueness of our JSON processing lies in the combiner. The combiner is a custom-built mechanism that efficiently merges JSON documents with the same key, enabling the handling of streaming updates.
The combiner is configured with an array of JSON Pointers, which it uses to extract the key from each incoming document. This key determines how documents are grouped and merged, ensuring that the aggregation and deduplication within a streaming workload are as efficient as possible. The combiner optimizes real-time data processing while maintaining consistency across transactions.
How the Combiner Works
The combiner executes three steps:
- Key Identification: Each incoming JSON document is associated with a unique key.
- In-Memory Consolidation: Documents with matching keys are merged in memory, combining their fields or updating existing ones.
- Downstream Commit: Once merged, the consolidated document is committed downstream for further processing or storage.
This approach minimizes redundancy by making sure that the updates to the same entity are processed together. By consolidating updates in memory, the combiner reduces write amplification and improves system throughput.
Why the Combiner Is Unique
So why is this actually useful? The combiner enables Estuary Flow to:
- Handle high-throughput streaming data with ease.
- Provide consistent updates for entities without additional processing overhead.
- Optimize performance in scenarios with frequent partial updates, such as e-commerce orders or IoT device telemetry.
Let’s take a look at some practical examples.
Advanced Use Case: Parsing and Merging Nested JSON
Consider the following JSON representing an e-commerce order:
javascript{
"order_id": 98765,
"customer": {
"id": 12345,
"name": "John Doe"
},
"items": [
{"product_id": 54321, "quantity": 2},
{"product_id": 67890, "quantity": 1}
]
}
The Combiner can efficiently merge updates to such nested JSON structures. For example:
- Incoming Update: An update adds a new item to the order.
javascript{
"order_id": 98765,
"items": [
{"product_id": 11111, "quantity": 3}
]
}
- Merge Operation: The Combiner merges the update with the existing document, resulting in:
javascript{
"order_id": 98765,
"customer": {
"id": 12345,
"name": "John Doe"
},
"items": [
{"product_id": 54321, "quantity": 2},
{"product_id": 67890, "quantity": 1},
{"product_id": 11111, "quantity": 3}
]
}
But wait, there’s more!
Beyond just merging JSON documents, the combiner applies configurable reduction strategies defined within the JSON schema. This approach provides far more flexibility than a simple “JSON Merge,” allowing users to control how fields are aggregated, overwritten, or combined based on their specific needs.
Reduction strategies can range from accumulating numeric values, preserving the most recent update, concatenating arrays, or applying other transformations all defined in a schema-driven way.
In the future, we plan to expand this capability even further by supporting custom WASM modules. This will allow users to define entirely custom transformation logic, enabling advanced use cases such as domain-specific data normalization, filtering, and enrichment directly executed within the combiner.
Example: Real-Time IoT Sensor Data Aggregation
Consider a real-time IoT sensor monitoring system where multiple temperature sensors send periodic JSON updates for various locations. Instead of overwriting or storing every single update, the Combiner applies reduction strategies to aggregate the data in a meaningful way.
Incoming JSON updates from different sensors:
javascript{
"sensor_id": "sensor-001",
"location": "warehouse-7",
"temperature_readings": [22.5]
}
{
"sensor_id": "sensor-001",
"location": "warehouse-7",
"temperature_readings": [23.1, 21.9]
}
{
"sensor_id": "sensor-001",
"location": "warehouse-7",
"temperature_readings": [22.0]
}
Reduction strategy: calculating average
Instead of storing every individual reading, the combiner applies an average reduction strategy:
- Numeric Accumulation: Maintain a running sum and count of temperature values.
- Array Concatenation: Store the readings for historical tracking.
- Computed Aggregates: Calculate min, max, and average temperature.
Merged JSON Output (after reduction):
javascript{
"sensor_id": "sensor-001",
"location": "warehouse-7",
"temperature_summary": {
"min": 21.9,
"max": 23.1,
"average": 22.375,
"recent_readings": [22.5, 23.1, 21.9, 22.0]
}
}
In Flow, these reductions could be implemented somehow like this:
plaintextcollections:
sensors/temperature_aggregation:
schema:
type: object
required: [sensor_id, location]
properties:
sensor_id:
type: string
description: "Unique identifier for the sensor."
location:
type: string
description: "Physical location of the sensor."
temperature_summary:
type: object
properties:
min:
type: number
description: "Minimum recorded temperature."
reduce: { strategy: minimize }
max:
type: number
description: "Maximum recorded temperature."
reduce: { strategy: maximize }
sum:
type: number
description: "Accumulated sum of temperature readings."
reduce: { strategy: sum }
count:
type: integer
description: "Total number of temperature readings."
reduce: { strategy: sum }
average:
type: number
description: "Computed average temperature."
reduce:
strategy: sum
with:
expression: temperature_summary.sum / temperature_summary.count
recent_readings:
type: array
items: { type: number }
description: "Temperature readings for historical tracking."
reduce:
strategy: append
key: [/sensor_id, /location]
Each sensor_id + location acts as a unique key, meaning readings from the same sensor at the same location are aggregated.
Reduction strategies used:
- minimize → Keeps the lowest recorded temperature.
- maximize → Keeps the highest recorded temperature.
- sum → Maintains a running total for temperature values and count.
- average → Computed dynamically using the sum and count.
- append → Retains only the last 5 temperature readings for historical reference.
Bringing It All Together
Through the integration of simdjson’s SIMD-powered parsing and zero-copy design, coupled with object storage optimizations (more details in a future blog post), Estuary redefines JSON processing for high-throughput, real-time pipelines. By eliminating traditional bottlenecks, this approach empowers the platform to handle massive data volumes with unparalleled speed and efficiency, unlocking new possibilities for data-intensive applications.
About the authors
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.