Estuary

How to Stream Data from DynamoDB with Unlimited Backfills

Real-time data pipeline crippled by DynamoDB's 24-hour limit on streams? Here's how we built a DynamoDB streams connector with unlimited backfills.

Picture of Jeffrey Richman
Jeffrey Richman
How to Stream Data from DynamoDB with Unlimited Backfills
Share this article
DynamoDB Streams

Image Source

When you need to move data between systems in real time, integrating an event stream and a historical backfill is always a challenge. 

And when your source system is DynamoDB, things can get particularly complicated. 

We built a DynamoDB change data capture connector that allows us to avoid one of DynamoDB’s processing limitations — its 24-hour limit on stream retention.

In this article, we’ll take a look at that limitation, how it shows up in most DynamoDB connectors, and how we were able to get around it. 

What is DynamoDB? 

DynamoDB is Amazon’s fast-performing NoSQL database. Because it allows low-latency queries at any scale and is fully managed by Amazon, it’s a popular choice for large, real-time web applications. 

Though DynamoDB does have tables, it structures data as key-value pairs or documents. 

Put this all together, and you’ve got a high-scale, rapidly changing system full of loosely structured data. 

When it comes time to analyze that data with complex queries, DynamoDB itself is not the place to do so. You’ll need to get that data out. Common destinations include data warehouses, like Redshift, and distributed search engines, like Elasticsearch.

What are DynamoDB Streams

When you’re moving data from DynamoDB to your analytical destination, you often can’t afford to fall behind the rapid changes that happen in DynamoDB.  

Fortunately, DynamoDB makes it easy to get low-latency updates through DynamoDB Streams. 

A DynamoDB Stream is a time-ordered log of change events that have happened on a given table. The Stream reflects these change events within milliseconds.

Once you’ve got a Stream set up, it’s up to you to connect your streaming provider of choice and move the data elsewhere.

Put another way, a DynamoDB Stream is the thing that allows DynamoDB change data capture (CDC), but it’s not a CDC pipeline unto itself. 

DynamoDB Streams and Change Data Capture

Now that we’re talking about streaming data, we arrive at our big question:

We know how to capture new data on an ongoing basis, but what about the data that already existed in DynamoDB before we started the Stream?

Capturing historical data from DynamoDB is a straightforward task you can accomplish with a table scan. The tricky part is syncing up the exact moment at which the historical backfill ends and the real-time data stream begins in such a way that you create an accurate, up-to-date picture of every row.

DynamoDB change data capture connectors can automate this problem away on your behalf. These third-party tools hook up to your DynamoDB instance and neatly sync the end of the table scan and the start of the data stream. If the Stream contains updates to the rows that were captured as part of the the backfill, they’ll pick up those updates, too, arriving at a current and complete record of the DynamoDB table in another system.

There’s only one problem.

Why Most DynamoDB Connectors Have a 24-Hour Backfill Limit

If you pay close attention to the documentation pages for most DynamoDB change data capture connectors, you’ll notice a limitation: they can only backfill data for 24 hours before switching over to streaming. 

This is an issue because, as you’ll remember, DynamoDB is commonly used for large web applications with real-time updates. That means the odds of a table needing more than 24 hours to backfill are pretty high.

So, why does that limitation exist in the first place?

DynamoDB Streams have a 24-hour processing limit. Essentially, once 24 hours have passed, records are erased. After all, DynamoDB streams aren’t supposed to be long-lived storage.

But most DynamoDB connectors have to backfill before they switch to streaming (they can’t do both at once). 

To ensure a smooth transition between backfill and streaming, they need to initiate the DynamoDB Stream first, so it can start recording live changes. 

Now, the connector starts to backfill all the historical data. It makes a note of the time the backfill starts — let’s say 7:00.

Once the backfill is done, it switches over to reading the stream from the noted backfill start time (in our example, 7:00). This creates a complete copy of the table.

But if the backfill takes longer than 24 hours, the start time won’t be available in the stream. Rather than allow you to proceed with a gap in your data, most connectors disallow you to backfill for more than 24 hours. 

Simultaneous Streaming and Backfilling: Avoiding the 24-Hour Limit

Estuary Flow’s DynamoDB CDC connector does not have a 24-hour backfill limit. That’s because it’s able to ingest data from both the stream and the backfill scan at the same time. 

Reconciling the two data sources without a distinct point in time “switch” is not trivial, but Flow is able to handle the challenge without duplicate or erroneous data. 

It does this through reductions: merging and de-duplicating data documents based on a JSON schema. 

These reductions happen with no input needed from you, but you can read about reductions and their processing guarantees in our docs. 

Say, for example, Flow notes an update to a specific record from the DynamoDB stream. Later, Flow runs into that same record as part of the backfill scan. It’d be able to merge the two together, accurately reflecting the current state found in the stream, which is the most up-to-date version, but was captured first.

Completing Your Pipeline

Ingesting a real-time data feed with historical context from DynamoDB is no easy task. 

While DynamoDB change data capture connectors can make things easier, most of them create a race against the clock: a 24-hour backfill limit that limits the size of tables you can capture. 

If that limitation cripples your ability to analyze your DynamoDB tables, give Estuary Flow a try — our DynamoDB connector can simultaneously backfill and stream data, so there’s no limit to the size of the table you can capture.

Once you capture your DynamoDB tables (of any size) you can materialize that data in real-time to popular destinations, like Redshift and Elasticsearch. 

Between those two systems, you can transform your data on the fly using Flow’s derivations — replacing the step you’d normally take in AWS Lambda in real-time, from the same dashboard.

Of course, there’s no way to know if a tool is right for you without taking it for a test run. Head to the Estuary Flow web app to register — you start for free.   

Start streaming your data for free

Build a Pipeline

Author

Author's Avatar
Jeffrey Richman

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.