Estuary

JSON Unnesting: Simplifying Complex Data

Frequently work with web data and NoSQL? Clean up your complex JSON structures for simpler reporting and analysis down the line.

Simplify your data structure with JSON Unnesting and depth selection
Share this article

Modern data sources often store data in a nested JSON format, which provides flexibility at the application layer. This means new fields can be easily added without a rigid schema.

However, when it comes to data analysis, this flexibility falters.

Analytics teams often prefer to completely flatten nested structures to make querying easier. The "let's unpack everything, put it in a table, and the BI tool will use it anyway" approach may seem practical in the short term. But over time, reporting complexity increases, data warehousing costs rise, hard-to-detect aggregation errors emerge and for analytics teams, and this structure becomes a nightmare.

Choosing the right JSON unnesting depth impacts BI performance, query costs, and the long-term sustainability of the data. The level to which flattening is done is an important decision, as it determines the level of detail (for instance, order, product, or discount) at which the data will be analyzed.

Estuary makes this decision operational by giving you control over how deeply nested JSON documents are unnested during the ingestion phase. This way, simpler and more consumable schemas for BI can be generated, without unnecessary row duplication and complex downstream SQL transformations.

Understanding Unnesting Depth in Nested JSON

The fundamental problem with Nested JSON in the analytics world is that the hierarchical structure of data doesn't perfectly match the tabular structure. Here’s a simple example:

json language-jsx
{ "order_id": 101, "items": [ { "product_id": 1, "price": 100, "discounts": [ {"type": "coupon", "amount": 10}, {"type": "seasonal", "amount": 5} ] }, { "product_id": 2, "price": 200, "discounts": [ {"type": "coupon", "amount": 20} ] } ] }

This structure has three different levels:

  • Order level
  • Item level
  • Discount level

If we flatten only the items area in Snowflake:

sql language-jsx
SELECT o.order_id, i.value:product_id::INT AS product_id, i.value:price::FLOAT AS price FROM orders o, LATERAL FLATTEN(input => o.items) i;

It reduces the data to an item grain, and each row becomes a product.

Here’s the example output:

order_idproduct_idprice
1011100
1012200

However, if we flatten the discounts field as well:

sql language-jsx
SELECT o.order_id, i.value:product_id::INT AS product_id, d.value:type::STRING AS discount_type, d.value:amount::FLOAT AS discount_amount FROM orders o, LATERAL FLATTEN(input => o.items) i, LATERAL FLATTEN(input => i.value:discounts) d;

Each row corresponds to a product + discount combination. As the depth increases, the number of rows also increases.

The example output here would be:

order_idproduct_iddiscount_typediscount_amount
1011coupon10
1011seasonal5
1012coupon20

Notice how the row count increases as we unnest deeper. Each additional depth changes the analytical grain of the table and may introduce duplication if aggregations are not adjusted accordingly.

Unnesting Depth and Analytical Grain

As we can see from the examples above, flattening depth changes not only the number of rows but also the analytical grain of the table. Grain refers to what a row represents. That is, in a table flattened to the item level, each row represents a product, whereas in a table flattened to the discount level, each row represents a product-discount combination.

These differences may seem small, but they have serious consequences in analytics.

For example, let's say we want to calculate total revenue based on orders. If the data is flattened to the discount level, the same product may be divided into multiple rows. Simply using SUM(price) will give an incorrect result. As you can see, even in a simple example, errors like this are easy to make, and the risk grows as the data gets more complex.

This only goes to show that more detail doesn’t necessarily mean better results. In other words, the choice of unnesting depth is actually a decision guided by both business needs and technical preferences. Before starting our analysis, we should ask ourselves:

At what level of detail do we want to analyze the data, and what are our business needs?

Our decision regarding depth should be shaped based on our answer to this question.

The Hidden Trade-Offs of Recursive Flattening

Recursive flattening makes all nested areas accessible, but as we mentioned earlier, this isn't always the right approach from an analytical standpoint, as it can create complexity.

The first and most visible effect is increased cardinality. Each nested array multiplies the number of rows. This can lead to unnecessary row duplication and queries failing to return, with high-volume data, in particular.

In addition, analytical complexity increases. As the grain size drops, aggregations become more sensitive. Simple SUM and COUNT operations are susceptible to duplicates, and additional transformations may be required.

Finally, schema complexity increases, too. When all rarely used sub-details are opened up, BI models become unnecessarily large, and sustainability becomes difficult.

Therefore, we shouldn't take the easy way out and directly apply recursive flattening. Choosing the correct unnesting depth requires conscious decisions tailored to our business problem.

A Practical Decision Framework for Choosing Depth

QuestionIf your answer is…Here’s the suggested action.
Which grain will the reporting be done on?Order-level reporting is sufficient.Avoid flattening nested arrays, or keep unnesting at a minimal level.
Is reporting item-based?Product performance will be analyzed.Flatten only to the item level.
Does the nested field have high cardinality?YesAvoid recursive flattening; consider modeling it as a separate table.
Will this lower-level detail be frequently queried?NoPreserve the nested structure or store it in a separate model.
Is the row increase after flattening acceptable?NoPrefer shallower unnesting.

This framework transforms unnesting depth from a purely technical preference into a business-driven modeling decision. The next step is to consistently apply this decision during the ingestion phase.

Enabling Controlled JSON Unnesting with Estuary

Now that we've determined the correct grain using the decision framework, another important step is to ensure this decision is applied consistently across the system. If JSON transformations are handled only through downstream SQL queries, each team may implement its own flattening logic, which will lead to duplicated work and inconsistent models over time.

Estuary addresses this by allowing teams to control how nested JSON fields are materialized during ingestion. During the materialization phase, users can define how deeply nested fields should be expanded into columns. This should make the transformation a centralized configuration rather than a collection of ad-hoc SQL queries.

Estuary provides several field selection modes that determine how nested JSON structures are expanded during materialization:

  • Required only: select only required fields
  • Depth 1: select all top-level fields
  • Depth 2: select fields at the first and the second nesting levels
  • Unlimited depth: select all nested fields

These modes expand nested JSON objects into additional columns while still preserving one row per original record. As a result, nested data is easier to query in BI tools without changing the underlying record structure.

Breaks down columns ingested at each level of field selection
Figure: Visualizing Field Depth Selection in Estuary

On the specification side, this control can be configured directly in the materialization definition:

yaml language-jsx
bindings: - source: acmeCo/example/collection resource: { table: example_table } fields: recommended: 1

This configuration selects fields up to the specified nesting depth during materialization. Deeper nested fields are not expanded into columns unless explicitly configured.

Specific fields can also be included or excluded when needed:

yaml language-jsx
fields: recommended: 1 require: importantNestedField: {} exclude: - rarelyUsedField

If you need to expand arrays into multiple records (for example, turning each item in an order into its own row), Estuary can handle this for you through derivations, which allow users to create new collections using transformation logic.

Thanks to this approach, we can move the unnesting decision from the query level to the ingestion layer. Teams no longer have to write a separate LATERAL FLATTEN; in Snowflake. The depth decision becomes a centralized and governance-friendly configuration.

As a result, Estuary transforms unnesting depth from an ad-hoc SQL preference into a modeling decision defined and standardized at the ingestion-layer level. This means simpler schemas, more predictable grain, and a more sustainable analytics architecture for BI.

Conclusion

Nested JSON is an inevitable part of modern data architecture. Sure, it’s flexible, but if you don’t handle it right, that flexibility can come back to bite you.

How deeply you unnest JSON is actually a pretty big deal. It has impact on the table's grain, query performance, and the sustainability of the BI model. Recursive flattening isn't always the right course of action. The smarter move would be to figure out what level of detail you actually need for analysis and stick to it across your system.

Estuary makes this easy by letting you control the unnesting depth right at the ingestion stage. As a result, your data is moved and structured at the right level while also being consumable for BI and sustainable in the long term.

FAQs

    Why is nested JSON problematic for analytics and BI tools?

    Hierarchical data doesn't map cleanly to tabular structure. Flattening this data causes row duplication, aggregation errors, and growing query complexity over time.
    Grain is what a single row represents. Flattening depth changes the grain: flatten to item level and each row is a product; flatten to discount level and each row is a product-discount combination. Getting this wrong breaks aggregations.
    If the data is flattened to the discount level, the same product may be divided into multiple rows. In this case, a simple SUM(price) result will give an incorrect result.
    Each nested array multiplies the number of rows. This can lead to unnecessary row duplication and queries failing to return, especially with high-volume data.
    Without a centralized unnesting decision, each team implements their own flattening logic. This leads to duplicated work and inconsistent models.

Start streaming your data for free

Build a Pipeline

About the author

Picture of Buse Şenol
Buse ŞenolAI Engineer and Data Scientist

AI Engineer and Data Scientist with a passion for building scalable machine learning systems and robust data architectures. I combine technical expertise with a strategic mindset to solve complex data challenges, focusing on real-time integration and predictive analytics.

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.