Estuary

What Are Data Backfills? A Guide + How Estuary Makes Them Seamless

Missing data? Don’t panic. Data backfills let you refresh your pipelines so you can be confident in your data completeness.

How Estuary Makes Backfilling Data Seamless
Share this article

A data engineer’s job involves the never-ending quest for pristine, complete data: orderly rows of information without gaps or user-entered errors. Whether recommending a new show or tracking disaster data for mitigation and relief, current, clean data is essential.

The problem, of course, is that data is always changing. There’s always something new to learn. Or something important to add to your calculations that was previously glossed over.

To overcome such obstacles on their data journey, one of the most important tools an engineer can master is the process of backfilling data.

What Are Data Backfills?

In a nutshell, a data backfill is when you fill in data that’s missing from a database or other data asset.

Sounds simple enough.

So, why would data be missing in the first place? There are a number of reasons. Consider these scenarios.

Common Scenarios That Require Data Backfilling

  • Completely new data is discovered to be relevant to the existing data stack
    • For example: Say you perform analytics on complex, interconnected data, such as research for medical or sustainability use cases. New discoveries may require piping in data from previously disconnected sources.
  • A data pipeline only captures incremental data, so the destination doesn’t have the full historical data
    • For example: You set up a CDC integration to capture new changes as they’re occurring on your database to send downstream. Instead of just picking up new changes from the time you turn on the integration, you can backfill to make sure your destination includes the full historical context.
  • Data changes in a source system in a way the capture can’t detect, causing data to get out of sync
    • For example: Say you pull data from an API. You incrementally capture newly created objects from this source system, but when a resource changes, the API doesn’t update the object’s modified time. You won’t be able to catch this discrepancy unless you fully refresh your data.
  • Outages cause some data to be skipped
    • For example: If you’re building your data pipelines yourself, it’s tempting to build for best-case scenarios, and your pipeline may not gracefully handle cases where the source or destination system are temporarily unavailable. Or your pipeline itself might fail catastrophically. Any data that’s missed in the meantime will have to be backfilled if the pipeline can’t pick back up where it left off.

These different reasons for backfilling can also result in differences in how you backfill. The API with no indication of the last modified time may need to be fully refreshed to ensure your destination system is fully up-to-date with the source. On the other hand, you may be able to pinpoint a time range for temporary failures, and only backfill data that was created or modified within that time range.

Why Data Completeness Matters in Real-Time Analytics & AI

Backfilling data can enhance analytics
Image Source

Why backfill data? The general answer is obvious: to improve data completeness. But why is data completeness such a necessary quality, and why is it particularly crucial for modern data pipelines?

Improving a dataset’s completeness helps improve the data’s overall quality, and can help with data consistency and accuracy.

This in turn improves those attributes in any analysis performed against the dataset, resulting in more trustworthy reports.

Data trustworthiness becomes extraordinarily important in this age of automation, where datasets can be incomprehensibly large. No one’s going to manually review a spreadsheeted report spanning 10 GB of data to ensure every entry is as expected, much less the TBs some companies process daily. Data has become bigger than anyone can keep track of.

Data has also become faster than anyone can keep up with. As more data pipelines become real-time and continuous rather than simply compiling batch reports on a specified schedule, those pipelines can introduce more opportunities for downtime, silent failures, and data that therefore slips through the cracks. It’s bad enough that cobbled-together pipelines have all-too-common risks of outages; it’s worse when those outages happen silently, without anyone knowing that their data has become out of date.

These are obstacles to data completeness, and they can have unpredictable and alarming consequences when combined with technology that wholly relies and makes predictions on data: AI. AI models all too often become black boxes. Any missing data can be encoded right into the model as bias.

It’s not always easy to tell what data is missing, or whether you’re missing data from a handful of skipped rows or whether entire segments were overlooked when compiling data to begin with. But when that missing data is noticed, it’s important to incorporate it into the larger dataset, and backfill.

Key Considerations Before Running a Data Backfill

So, you’ve discovered a chunk of data that’s missing from your downstream systems and want to initiate a backfill. But you don’t want to rush into a haphazard solution. Getting the backfill wrong can cause more problems than it solves.

There are different types and ways to backfill. Some options will be more suitable for certain situations than others, and we’ll go over what to choose when a little later.

In general, before you start a backfill, you’ll want to consider:

  • What is the purpose of the backfill? Are you backfilling a whole new resource to load into your data architecture? A single table? A time-delimited number of rows?
  • Can downstream systems tolerate any downtime? Do they tightly rely on a certain database instance or table structure?
  • Do you have an estimate of what your cloud resource cost will be for this backfill? Are there ways you could cut down on that bill by choosing a different method of backfilling or limiting the amount of refreshed data?

These questions will help you determine which data backfill strategy is safest, most efficient, and most aligned with your infrastructure.

How to Minimize Cost and Downtime When Backfilling

Inspecting data resources

Backfills can consume a lot of resources, including time, compute, storage, and ingress/egress costs. Because you’re refreshing a lot of data all at once, this can lead to an unexpectedly large cloud bill at the end of the process.

Some backfills refresh data by dropping existing tables and recreating them wholesale. This may be necessary if your data has been corrupted or is otherwise so out of date that it’s of little use as-is. However, this type of backfill would not be ideal for:

  • Very large databases or tables (with TBs of data)
  • Situations where downtime on recreated destination tables would be unacceptable
  • Cases where the destination table contains historical data you’d like to keep that the source system no longer has access to

You can limit the resources you use for a backfill by adopting a more incremental approach or partial backfill strategy. Instead of starting all over from the beginning, try only updating resources:

  • Within a certain time range (requires accurate modified time fields)
  • Within a certain ID range (assumes rows outside the range will not have been modified)

These approaches can significantly reduce your cloud costs, limit system stress, and help preserve uptime without sacrificing data completeness.

How Can Backfills Go Wrong?

By this point, it’s probably obvious that backfills can cause some headaches unless they’re properly planned for. With all the systems, settings, and configurations out there, there are a number of stumbling blocks you can run into.

The data backfill process can be more complex than it seems on the surface
Image Source

Here are some of the most common issues to watch out for:

  • High cost: As mentioned, all the extra data movement and compute costs add up. Depending on the systems you’re working with, the costs can rack up, and aren’t always transparent up front.
  • Resource intensive: Backfills can negatively impact your source system. If you’re backfilling an OLTP-OLAP pipeline, for example, re-reading all that data to send downstream can add stress to your transactional application database. If you can, you may want to time your backfill so it doesn’t coincide with high application traffic.
  • Potential downtime: If your backfill throughput speeds really hammer your source system, or if you’re dropping and recreating your destination tables, you may encounter some downtime during backfilling. Trying to reduce downtime for one may increase downtime for the other.
  • Outdated schemas: Especially if your backfill is to add entirely new swathes of data or if you’re backfilling to reset data that went off the rails, you may not want to use your old schemas. What should be the source of truth for the backfill and going forward? Will any of your data cause collisions if you try to start up the backfill as-is?
  • Stalling out: What do you do if your pipeline chokes in the middle of a backfill? Maybe it encounters an unexpected data type, can’t handle it, and crashes. Can you seamlessly pick back up in the middle? Can you start over? Do you need to roll back any changes first?
  • Human error: There’s always, of course, the perennial possibility of plain old human error. Identifying a section of missing data doesn’t necessarily mean that all the missing data was identified. Hand-built scripts may port the data incorrectly. Timezone mismatches, data type mismatches, SQL statements that forget the WHERE clause… Almost anything can happen.

Thankfully, you don’t have to reinvent the wheel every time you want to backfill some data. Pre-built solutions can dial down the complexity on a topic that should be (but often isn’t) straightforward.

How Estuary Simplifies and Automates Data Backfills

Estuary is an ETL pipeline platform that focuses on simplifying data movement. Smart defaults with automatic discovery and evolution make it a snap to set up and stay current with your data architecture. And multiple backfill types and options are embedded directly into the platform.

Here’s how Estuary helps streamline the backfill process:

Automatic Initial Backfill

Whenever you create a new connector in Estuary Flow, the platform automatically begins backfilling historical data before seamlessly switching to streaming new events. This all-in-one behavior means it’s simple to add new pieces to your data architecture whenever you need to.

The connector overview page shows how many tables have finished backfilling.

Capture details page, including the connector backfill status

Of note, Estuary applies automatic schema inference to new connectors. When connecting with a source with its own schema, Estuary will base the inferred schema off of this information. Estuary can also fully infer a schema when working with schema-less sources based on the types of data it’s reading.

Resetting Pipelines

To backfill existing data resources, your best bet is Estuary’s dataflow reset feature. Dataflow reset backfills everything in a pipeline to ensure all your resources get cleaned up with up-to-date data and schemas.

As time goes on, schemas can evolve. In Estuary, data types for fields can widen (such as moving from an integer to a number type) as Flow processes more data, but fields can’t completely switch data types. While this makes for smart defaults, the behavior isn’t always ideal when underlying data undergoes radical restructuring, or when bad data makes its way into the pipeline.

With a dataflow reset, you aren’t beholden to old, outdated schemas. As your data is ingested fresh, schemas get recalculated and applied to brand new destination tables.

Ultimately, with one press of a button, dataflow reset will:

  • Reread data from the source system
  • Recalculate schemas associated with the source system
  • Replace collections with fresh source data
  • Recreate derivations (transformations) based on affected collections
  • Recreate destination tables across all associated materializations with the fresh data

It’s a hassle-free solution that will work with most use cases, refreshing data without manually tinkering with schemas or trying to hunt down all affected resources.

Flexible Options: Advanced Backfills

While dataflow reset can be massively convenient, there are times when completely clearing the slate and starting fresh isn’t the best option. Remember those downtime situations discussed earlier? Backfilling is full of complexities so that one single solution will never perfectly fit all use cases, no matter how smart the default behavior is.

Estuary accounts for this by providing advanced backfill options: incremental backfill and materialization backfill.

Backfill options in Estuary: incremental backfill, materialization backfill, and dataflow reset
  1. Incremental backfill is an alternative source-level option to dataflow reset. Instead of resetting the entire pipeline, it simply refreshes data from the source system. These changes percolate through the pipeline as normal, merging into destination tables without dropping them. This can be useful if you’re unable to tolerate downtime on your destination tables.
  2. materialization backfill focuses on the other end of the pipeline. Instead of starting at the source system, it simply takes existing collection data and recreates specified destination tables based off of that existing data. This can be helpful if you’ve pinpointed a handful of problem tables that would benefit from a data refresh from existing collections.

If you’re unsure which option best fits your use case, you can find a backfill selection guide in our docs. In general, though:

  • Dataflow reset is for standard use cases and is a hassle-free way to update all related resources
  • Incremental backfill can cut out destination downtime, and is useful for certain other edge cases like reestablishing consistency with a replication slot
  • Materialization backfill can rebuild destination tables when you know existing collections are already correct and complete

Step-by-Step: How to Backfill a Pipeline Using Dataflow Reset

As indicated, Estuary suggests using a dataflow reset for most backfill use cases. To demonstrate just how streamlined it is to use a dataflow reset, let’s check out the process step by step.

Or, if you already know that a dataflow reset won’t work for you, you can find steps to perform advanced backfills in our docs. They’re just as simple to start up.

Prerequisites

This guide assumes you already have a complete pipeline set up with Estuary. Examples and instructions on setting up these resources are available in a multitude of configurations—check out our tutorials section on our blog for more.

Here’s what you’ll need for today:

  • An Estuary account
  • An active capture for a source system
  • At least one collection associated with the capture
  • An active destination connector that materializes the collection(s)

Step 1: Begin the dataflow reset

It only takes a few steps to initiate a dataflow reset:

  1. Log into your Estuary account and navigate to the sources page.
  2. Find the capture you’d like to reset in the table of sources and select the Edit button.
  3. In the Target Collections section, click the Backfill button.
    • By default, the backfill mode will be set to Dataflow Reset.
Select between dataflow reset and incremental backfill when backfilling a source
  1. Save and Publish your changes.

Estuary will begin to re-extract source data, recalculate schemas, and drop and recreate connected resources.

Step 2: Monitor resources and confirm success

Once you perform the dataflow reset, you can check in on your resources. 

Capture Details

The capture details page includes the connector status, which provides your capture’s state, such as Capture started or Streaming change events. Particularly useful during backfills, it also notes how many bindings have finished their backfill. Once the connector status indicates that all # bindings are backfilled, the capture has completed its part in the process.

This page also includes connector usage stats, which can be a helpful gauge if you know how much data you’re expecting to re-extract.

Collection Details

Individual collection details pages will provide a data preview where you can check on the data being read in.

Switching to the Spec tab will also display the current collection schema, including inferred data types. This can help confirm that schema inference was reset correctly.

Materialization Details

The materialization details page (or pages, if your capture populates to multiple destinations) maps to the last step in the pipeline and the last step of the backfill process. Here, connector usage stats show how much data has been read by the materialization.

These dashboard details will help you track your resources as they go through the full dataflow reset process.

Summary

Backfills are an important way for data engineers to keep data orderly and fill in any gaps. Despite, or because of, that importance, they can be quite complex, with different considerations around data size, downtime tolerance, evolving data types, and human error.

An automated, pre-built tool to ease the process along can therefore be very beneficial, saving on time and costly mistakes. For example, Estuary is purpose-built to simplify data transfer. Backfills constitute an essential part of the platform, from automatic backfills on capture creation to dataflow reset’s all-in-one backfills with intelligent defaults.

Whether you want a backfill as simple as clicking a button or more advanced options that offer more finegrained control for specific use cases, Estuary has a solution. 

Try the platform out for freediscover our supportive community, and learn the latest when you follow us on LinkedIn.

FAQs

    You can backfill data by running a one-off script to copy the required data into your destination, by building a custom pipeline, or by using a specialty tool like Estuary Flow. Estuary provides several methods to backfill data depending on how much of the pipeline you wish to affect.
    If you use Estuary, no. Even if your data sources have changed dramatically, you can use a dataflow reset to get your pipeline in shape. The dataflow reset will draw up new schemas based on re-extracted source data so you don’t need to try to manually match up data types, and will recreate downstream resources for you.
    No. In fact, in some cases it can be far more resource-efficient and cost-effective to bound backfills to specific tables or timelines. For example, when a new data stream is added to a capture in Estuary, Estuary can automatically backfill that single binding. Incremental backfill and materialization-level backfill can also help limit the scope of the backfill.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Emily Lucek
Emily LucekTechnical Content Creator

Emily is a software engineer and technical content creator with an interest in developer education. She has experience across Developer Relations roles from her FinTech background and is always learning something new.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.