Estuary

Fast JSON Processing in Real-time Systems: simdjson and Zero-Copy Design

Discover how Estuary Flow handles massive data volumes by leveraging simdjson and a unique Combiner to optimize real-time JSON parsing and document merging.

Picture of Dani Pálma
Dani Pálma
P
Phil Fried Engineer
Share this article

Efficiently processing JSON data has historically been a common bottleneck in data-intensive applications, especially in real-time streaming environments.

At Estuary, we are building a data movement platform specialized in such challenges as supporting high-throughput, low-latency data pipelines, so we needed an approach that moved past traditional JSON decoding methods.

JSON parsing and processing can be a significant bottleneck when handling large volumes of data due to the overhead of converting the document structure into a usable format in memory. While this overhead may be manageable in (some) batch-processing environments, it becomes a critical performance concern for platforms processing tens of terabytes of data daily.

So, we took a different approach: by leveraging simdjson for parsing and introducing a unique Combiner for efficient document merging, Estuary Flow achieves exceptional performance in real-time JSON processing.

Why JSON Processing Is Traditionally Slow

Parsing JSON is generally very resource-intensive. While it is human-readable, substantial processing is required to convert it into a usable, structured format. This complexity arises from the following two key steps in the parsing process:

  1. Lexical Analysis: The JSON text is scanned to identify tokens such as strings, numbers, objects, and arrays.
  2. Structural Conversion: These tokens are organized into an in-memory representation, typically a tree structure or hash table, which applications can then traverse and manipulate.

Each of these steps introduces overhead. Lexical analysis demands CPU cycles to decode and validate the JSON structure, while the structural conversion consumes memory and may require expensive allocations. Additionally, garbage collection in managed languages like Java and Python can exacerbate the latency, as frequent memory allocations and deallocations constrain the runtime.

Batch vs. Streaming

In batch-processing environments, the overhead of JSON parsing is often amortized across large datasets processed in bulk. Here, the latency incurred during parsing is less noticeable, as the system is designed to process data in chunks with no hard real-time constraints.

In contrast, real-time systems like Estuary Flow require instantaneous processing of small, continuous data streams. The need for low-latency responses turns JSON parsing overhead into a significant bottleneck. For example, consider a high-throughput pipeline handling tens of thousands of JSON messages per second. Each millisecond spent parsing can cascade into latency spikes, queue backlogs, and degraded system performance.

Why Traditional Parsers Fall Short

Most conventional JSON libraries, such as Jackson (Java) or json (Python), use a two-step approach:

  1. Decode JSON Text: Read the JSON string and convert it into tokens.
  2. Build a Data Structure: Populate an intermediate object representation (e.g., dictionaries, arrays).

This method has a few inherent inefficiencies:

  • Data Copying: The same JSON data is read, tokenized, and copied multiple times in memory, wasting both CPU cycles and RAM.
  • Allocation Overhead: Dynamic memory allocations are frequent, slowing execution and increasing garbage collection pressure.
  • Parsing Complexity: Validating JSON structure in real-time adds computational load, especially for large or deeply nested objects.

These limitations necessitated a fundamentally different approach to JSON processing for Estuary Flow.

How simdjson Transforms JSON Processing

Simdjson, a high-performance JSON parser, utilizes SIMD (Single Instruction, Multiple Data) instructions to accelerate lexical analysis dramatically. SIMD enables modern CPUs to process multiple data points simultaneously, processing many characters in parallel during tokenization and significantly reducing JSON parsing time.

simd.jpg
Source: http://ftp.cvut.cz

Traditional parsers often operate on one character at a time, performing sequential checks to validate the structure and extract tokens. Simdjson instead reads data in chunks (e.g., 64 bytes at a time) and performs bulk operations, such as:

  • Validating brackets and braces in parallel.
  • Identifying string boundaries and numeric values across multiple characters simultaneously.
  • Quickly rejecting malformed JSON by validating key structural elements in a single pass.

Simdjson’s design is pretty revolutionary. Instead of creating an intermediate in-memory representation of the JSON, simdjson maps the raw JSON bytes into a memory view that applications can directly query. This approach avoids unnecessary data duplication and enables quick traversal of JSON attributes.

Applications like Flow interact directly with the raw JSON bytes, eliminating the need for costly encoding/decoding steps. In addition to this, the JSON data remains unmodified, reducing the risk of errors and maintaining consistency across processing stages.

Zero-copy access

Zero-copy access fundamentally changes how JSON data is processed. By bypassing intermediate structures, we can:

  1. Eliminate Parsing Overhead: Operations like reading, indexing, and validating fields occur directly on the byte array.
  2. Reduce Memory Footprint: Instead of storing redundant copies of the JSON, the system operates on the original data, leaving more memory available for other tasks.
  3. Lightning-Fast AccessWe use binary search to locate fields within parsed json documents, which is O(log n), where n is the number of properties at that level of nesting.

Simdjson’s zero-copy access shines with complex, nested JSON structures. For example, consider the following JSON representing an e-commerce order:

javascript
{ "order_id"98765, "customer": {   "id"12345,   "name""John Doe" }, "items": [   {"product_id"54321"quantity"2},   {"product_id"67890"quantity"1} ] }

Traditional parsers require multiple decoding layers to access nested fields like customer.name or items[1].product_id. simdjson, however, enables direct field lookups with minimal overhead, ensuring blazing-fast performance even with deeply nested objects.

Simdjson directly maps the raw JSON bytes into memory, which is indexed for fast lookup. Accessing a nested field like customer.name proceeds as follows:

  1. Precomputed Index: During the parsing phase, simdjson identifies the byte offsets for keys and values and stores them in a lightweight index structure.
  2. Direct Lookup: To access customer.name, the application queries the index, which points directly to the byte offset of the value "John Doe" within the memory-mapped JSON.

Once the value is parsed into an ArchiveNode or HeapNode (intermediate data structures of Flow) we can extract fields referencing the underlying bytes instead of copying them.

A key component in this process is the combiner, which acts as a buffer for documents within a given transaction. If only a small number of documents need to be processed, they can be stored entirely in memory using HeapNodes. However, when the transaction size exceeds a defined threshold (currently 256MB), the combiner automatically spills over to a local SSD, storing documents as ArchivedNodes

This approach not only optimizes memory usage but also makes sure that large transactions can be efficiently processed without exceeding memory constraints.

The Combiner: Efficient Document Merging

While simdjson handles parsing, the true uniqueness of our JSON processing lies in the combiner. The combiner is a custom-built mechanism that efficiently merges JSON documents with the same key, enabling the handling of streaming updates.

The combiner is configured with an array of JSON Pointers, which it uses to extract the key from each incoming document. This key determines how documents are grouped and merged, ensuring that the aggregation and deduplication within a streaming workload are as efficient as possible. The combiner optimizes real-time data processing while maintaining consistency across transactions.

How the Combiner Works

The combiner executes three steps:

  1. Key Identification: Each incoming JSON document is associated with a unique key.
  2. In-Memory Consolidation: Documents with matching keys are merged in memory, combining their fields or updating existing ones.
  3. Downstream Commit: Once merged, the consolidated document is committed downstream for further processing or storage.

This approach minimizes redundancy by making sure that the updates to the same entity are processed together. By consolidating updates in memory, the combiner reduces write amplification and improves system throughput.

Why the Combiner Is Unique

So why is this actually useful? The combiner enables Estuary Flow to:

  • Handle high-throughput streaming data with ease.
  • Provide consistent updates for entities without additional processing overhead.
  • Optimize performance in scenarios with frequent partial updates, such as e-commerce orders or IoT device telemetry.

Let’s take a look at some practical examples.

Advanced Use Case: Parsing and Merging Nested JSON

Consider the following JSON representing an e-commerce order:

javascript
{ "order_id"98765, "customer": {    "id"12345,    "name""John Doe" }, "items": [    {"product_id"54321"quantity"2},    {"product_id"67890"quantity"1} ] }

The Combiner can efficiently merge updates to such nested JSON structures. For example:

  1. Incoming Update: An update adds a new item to the order.
javascript
{ "order_id"98765, "items": [    {"product_id"11111"quantity"3} ] }
  1. Merge Operation: The Combiner merges the update with the existing document, resulting in:
javascript
{ "order_id"98765, "customer": {    "id"12345,    "name""John Doe" }, "items": [    {"product_id"54321"quantity"2},    {"product_id"67890"quantity"1},    {"product_id"11111"quantity"3} ] }

But wait, there’s more!

Beyond just merging JSON documents, the combiner applies configurable reduction strategies defined within the JSON schema. This approach provides far more flexibility than a simple “JSON Merge,” allowing users to control how fields are aggregated, overwritten, or combined based on their specific needs.

Reduction strategies can range from accumulating numeric valuespreserving the most recent updateconcatenating arrays, or applying other transformations all defined in a schema-driven way.

In the future, we plan to expand this capability even further by supporting custom WASM modules. This will allow users to define entirely custom transformation logic, enabling advanced use cases such as domain-specific data normalization, filtering, and enrichment directly executed within the combiner.

Example: Real-Time IoT Sensor Data Aggregation

Consider a real-time IoT sensor monitoring system where multiple temperature sensors send periodic JSON updates for various locations. Instead of overwriting or storing every single update, the Combiner applies reduction strategies to aggregate the data in a meaningful way.

Incoming JSON updates from different sensors:

javascript
{ "sensor_id""sensor-001", "location""warehouse-7", "temperature_readings": [22.5] } { "sensor_id""sensor-001", "location""warehouse-7", "temperature_readings": [23.121.9] } { "sensor_id""sensor-001", "location""warehouse-7", "temperature_readings": [22.0] }

Reduction strategy: calculating average

Instead of storing every individual reading, the combiner applies an average reduction strategy:

  • Numeric Accumulation: Maintain a running sum and count of temperature values.
  • Array Concatenation: Store the readings for historical tracking.
  • Computed Aggregates: Calculate min, max, and average temperature.

Merged JSON Output (after reduction):

javascript
{ "sensor_id""sensor-001", "location""warehouse-7", "temperature_summary": {    "min"21.9,    "max"23.1,    "average"22.375,    "recent_readings": [22.523.121.922.0] } }

In Flow, these reductions could be implemented somehow like this:

plaintext
collections: sensors/temperature_aggregation:    schema:      type: object      required: [sensor_id, location]      properties:        sensor_id:          type: string          description: "Unique identifier for the sensor."        location:          type: string          description: "Physical location of the sensor."        temperature_summary:          type: object          properties:            min:              type: number              description: "Minimum recorded temperature."              reduce: { strategy: minimize }            max:              type: number              description: "Maximum recorded temperature."              reduce: { strategy: maximize }            sum:              type: number              description: "Accumulated sum of temperature readings."              reduce: { strategy: sum }            count:              type: integer              description: "Total number of temperature readings."              reduce: { strategy: sum }            average:              type: number              description: "Computed average temperature."              reduce:                strategy: sum                with:                  expression: temperature_summary.sum / temperature_summary.count            recent_readings:              type: array              items: { type: number }              description: "Temperature readings for historical tracking."              reduce:                strategy: append    key: [/sensor_id, /location]

Each sensor_id + location acts as a unique key, meaning readings from the same sensor at the same location are aggregated.

Reduction strategies used:

  • minimize → Keeps the lowest recorded temperature.
  • maximize → Keeps the highest recorded temperature.
  • sum → Maintains a running total for temperature values and count.
  • average → Computed dynamically using the sum and count.
  • append → Retains only the last 5 temperature readings for historical reference.

Bringing It All Together

Through the integration of simdjson’s SIMD-powered parsing and zero-copy design, coupled with object storage optimizations (more details in a future blog post), Estuary redefines JSON processing for high-throughput, real-time pipelines. By eliminating traditional bottlenecks, this approach empowers the platform to handle massive data volumes with unparalleled speed and efficiency, unlocking new possibilities for data-intensive applications.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the authors

Picture of Dani Pálma
Dani Pálma

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

P
Phil FriedEngineer

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.