11 min read

Last updated: June 24, 2024

How Estuary Flow does Schema Evolution

Understand what schema evolution is, and discover how Estuary Flow manages it.

Dani Pálma Head of Data Engineering Marketing

Share this article

Estuary Flow is a platform built for real-time data. It captures, transforms, and materializes data between systems. One of its advanced features is schema evolution, which enables data flows to adapt to changes in source schemas without disrupting the entire pipeline.

This article clarifies how Estuary Flow manages schema evolution to ensure your data flows remain strong and operational even in the event of changes.

Understanding Schema Evolution

Schema evolution in data systems refers to the process of accommodating changes in the schema of a data collection. This feature is crucial because it prevents data flow from failing due to mismatched components when a schema changes.

Changing schemas is unavoidable, so the best approach is to prepare, especially if you have little control over upstream systems. If you work with data, you have probably experienced an unexpected schema change in a source system, which caused some level of chaos.

Here are a few examples of events that could lead to schema changes:

Business Requirements: New business initiatives, regulatory requirements, or changes in market conditions often necessitate updates to the data schema to capture new types of information or to modify existing data structures.
Technological Advancements: For instance, migrating from a noSQL database to a traditional relational database may require significant schema modifications to leverage the new system’s capabilities.
Data Quality Improvements: Organizations might update their data models to include more accurate, comprehensive, and normalized data representations, which necessitates schema updates.
Integration of New Data Sources: Integrating data from new external sources or third-party systems often requires schema adjustments to harmonize different data formats and structures.

Challenges in Handling Schema Changes

Software engineers and data practitioners often find dealing with schema changes to be a challenging task. Although necessary, there are several common pain points associated with it.

Dependency Management: A change in one part of the schema can have cascading effects throughout the pipeline, making it difficult to manage and predict all impacts.
Data Consistency and Integrity: Schema changes can lead to data anomalies, missing values, or misinterpretation of data if not handled properly.
Backward Compatibility: Schema changes need to be carefully managed to ensure that new schema versions can coexist with older ones, allowing for seamless access to both current and historical data.
Automation and Tooling: Effective tooling is required to detect schema changes, propagate updates across the data pipeline, and validate the changes without disrupting the data flow. Building and maintaining such tools require substantial expertise and resources.

In the rest of the article, we’ll take a look at how Estuary Flow helps with these challenges and what you, as the user, should expect during schema changes.

Key Concepts in Flow

Before diving into schema evolution, it’s essential to understand some basic concepts in Flow.

Captures

A capture is how Flow ingests data from an external source. Every Data Flow starts with a capture. Captures run continuously: as soon as new documents are made available at the endpoint resources, Flow validates their schema and adds them to the appropriate collection.

Collections

Real-time datasets are stored as groups of continually updating JSON documents.

All collections in Flow have an associated JSON schema against which documents are validated every time they're written or read. Schemas are critical to how Flow ensures the integrity of your data. Flow validates your documents to ensure that bad data doesn't make it into your collections — or worse, into downstream data products!

JSON schema is a flexible standard for representing structure, invariants, and other constraints over your documents. Schemas can be very permissive, highly exacting, or somewhere in between.

Flow pauses catalog tasks when documents don't match the collection schema, alerting you to the mismatch and allowing you to fix it before it creates a bigger problem.

Materializations

Materializations read data from collections and write to destination systems.

Collections, captures, and materializations together form a complete data flow. Any changes to the schema of a collection can impact these components, making schema evolution necessary.

How Automatic Schema Evolution Works in Estuary Flow

We’ve seen a few theoretical reasons for schema changes, but on a technical level, what actually triggers a schema evolution? Schema evolution is necessary when there are changes in the collection's specifications, which include the collection's key, schema, and logical partitions. Such changes might arise due to:

An ALTER TABLE statement in the source database.
Introduction of new unstructured data with a different schema.
Manual changes to the collection's logical partitions.

Without schema evolution, these changes could cause data flow failures, thankfully, Estuary Flow's schema evolution feature allows you to update the entire data flow in response to these changes, ensuring continued operation.

Here’s how Estuary Flow handles various schema changes coming from upstream databases:

New Table Added to Database

Imagine you have a CDC pipeline that is configured to replicate data from two tables called users and transactions. After a while, you’d like a third table, called products to be included in the replication process as well.

To enable this functionality, ensure the Automatically add new collections option is enabled when configuring the Capture which fills up with data. If this setting is enabled and the capture can discover the new table, it will be automatically added to the list of bindings.

Flow will automatically add the new table to the destination, perform an initial backfill, and switch to streaming.

New Column Added to Table

A fairly common occurrence is when a source database table you are replicating gets a new column. In most cases, you’d also want this new column to be replicated in your destination table as soon as possible.

To achieve this, enable the Automatically keep schemas up to date option. This way, whenever Flow detects the new column in an existing binding, it will add it to the destination, and include its values in future data.

Column Removed from Table

Flow does not drop columns in the destination to preserve historical data. No new data is added to the removed column, but existing data remains.

Column Definition Changed

Changing a column’s type, or adding certain constraints to it, such as NOT NULL can be a breaking change If the existing data in the collection doesn’t conform to the changes.

The Breaking changes re-version collections option determines how to respond when the discovered updates would cause a breaking change to the collection. If true, it will trigger an evolution of the incompatible collection(s) to prevent failures.

Enhanced Control Over Incompatible Schema Changes

While detecting incompatible schema changes is crucial, deciding on the response is equally important. Real-world data pipelines have complex requirements that can’t be managed by a single boolean flag like the one above.

To address this, Estuary Flow introduces the onIncompatibleSchemaChange field in materialization specs, providing granular control over responses to incompatible schema changes. This field can be set at the top level of a materialization spec or within each binding. If not specified at the binding level, the top-level setting applies by default. The onIncompatibleSchemaChange field offers four options:

backfill (default if unspecified): Increments the backfill counter for affected bindings, recreating the destination resources to fit the new schema and backfilling them.
disableBinding: Disables the affected bindings, requiring manual intervention to re-enable and resolve the incompatible fields.
disableTask: Disables the entire materialization, necessitating human action to re-enable and address the incompatible fields.
abort: Halts any automated action, leaving the resolution decision to a human.

These behaviors are triggered only when an automated action detects an incompatible schema change. Manual changes via the UI will ignore onIncompatibleSchemaChange. This feature can be configured using flowctl or the "Advanced specification editor".

With the introduction of onIncompatibleSchemaChange, the evolveIncompatibleCollections field's behavior will eventually be simplified. Currently, this boolean controls responses to incompatible schema changes across all captured collections. In the future, it will only apply to collections needing complete re-creation, primarily when changing the collection key or logical partitioning configuration.

Handling of Inferred Schemas

Inferred schemas are directly added to your collection specs under $defs with a key of flow://inferred-schema. This allows you to customize other parts of the read schema while clearly seeing the inferred schema being used for each collection.

This extends to derivations as well. By including "$ref": "flow://inferred-schema" in the collection’s readSchema, you can leverage inferred schemas with derivations. Estuary Flow's automation will periodically update the collection spec to inline the actual inferred schema as it evolves.

Collections with frequent inferred schema changes will be checked more often, while those with less frequent changes will be checked less frequently, with a maximum interval of every two hours. This approach ensures that inferred schemas remain up-to-date while balancing the frequency of checks based on the collection's activity.

Manually Triggering Schema Evolution

When you attempt to publish a change to the schema of a collection in the Flow dashboard, you might encounter an error indicating a breaking change. Here’s what happens next:

Error Notification: You receive an error message indicating the issue.
Apply Changes: You can click the Apply button to trigger schema evolution. This updates all necessary specifications to keep your data flow functioning.
Review and Publish: After applying changes, you review and publish your draft.

If you have Automatically keep schemas up to date enabled on a capture with the Breaking changes re-version collections option selected, breaking changes will trigger automatic schema evolution. This is meant to provide the least friction in terms of human involvement.

When you publish a new schema via the UI, Flow detects two different cases. One is when a materialization reports a field as "unsatisfiable", and the other is if the collection key or logical partitioning have changed. In the first case, we'll offer to the user to re-backfill the materializations, and in the second case we'll offer to also re-create the collections.

To visualize these two options:

Materialize Data to a New Resource: The system updates the affected materialization bindings, recreates the resource (e.g., a database table), and backfills it from the beginning. This approach is simpler and used in most cases.

Re-create the Flow Collection with a New Name: A new collection is created with a numerical suffix (e.g., _v2). This new collection starts empty and is backfilled from the source. Captures and materializations referencing the old collection are updated to reference the new one, and their backfill counters are incremented. This approach is used for changes in the collection key or logical partitioning.

In both scenarios, the destination resource names remain unchanged, and only the specific bindings with incompatible changes are affected. Other bindings remain untouched and do not re-backfill.

Practical Example

Consider a collection schema structured as follows:

plaintextschema:
  type: object
  properties:
    id: { type: integer }
    foo: { type: string, format: date-time }
  required: [id]
key: [/id]

If you materialize this collection into a relational database table, it would look something like this:

plaintextCREATE TABLE my_table (
    id INTEGER PRIMARY KEY,
    foo TIMESTAMPTZ
);

Now, suppose you modify the collection specification by removing the format: date-time from foo:

plaintextschema:
  type: object
  properties:
    id: { type: integer }
    foo: { type: string }
  required: [id]
key: [/id]

You expect the materialized database table to look like this, and it will, but Estuary Flow will create a new table with the same name before backfilling all data into it instead of modifying the existing table.

plaintextCREATE TABLE my_table (
    id INTEGER PRIMARY KEY,
    foo TEXT
);

Limitations and Manual Actions

While Estuary Flow handles most schema changes automatically, some situations may require manual intervention. Always review the updated schema and ensure all components are aligned to prevent data loss and maintain data flow integrity.

Breaking schema changes are primarily caused by modifications to the collection schema, although changes to the collection key or logical partition can also trigger such issues in special cases. Typically, these changes necessitate updates to materializations, not captures. This is because new collection specifications are usually identified from the source, leading to simultaneous edits to the capture and the collection.

Conclusion

Schema evolution in Estuary Flow is a robust feature designed to handle dynamic changes in data schemas, ensuring your data flows remain uninterrupted and accurate.

With automatic updates combined with an understanding of when manual actions are necessary, you can maintain uninterrupted data integration across your systems. Whether dealing with new tables, columns, or changes in data types, Estuary Flow’s schema evolution capabilities provide a reliable solution for managing real-time data streams.

Share this article

Table of Contents

Start Building For Free

About the author

Dani PálmaHead of Data Engineering Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.