Estuary

How to Validate Data Replication Pipelines With data-diff

Need to use the data-diff tool to create a data replication pipeline for the first time? Start here.

Share this article

Keeping your application data safe and sound is critical to the success of your company. Losing only a fraction of your app data can frustrate your customers and cause them to turn away from using your product. 

Unfortunately, there a lot of incidents that can lead to data loss such as:

  • Hardware failures
  • Electrical outages
  • Cyber-attacks

…and many more. Implementing a data replication pipeline is one way to mitigate the impact of such incidents. Your application data can stay in multiple data nodes, through multiple geographical locations. In case your database is impacted on one node, you still have other running nodes to back it up. On the other hand, keeping your data in multiple regions also boosts the performance of your app — users can get access to the database node that is close to their location.

However, some data replication methods can create inconsistencies between the source and replicated data. Various reasons can cause this problem such as hard deletion of the data row in the source database or not updating the timestamp in the updated data row. 

You can monitor and address this problem by validating the replicated. In this article, you’ll learn how to validate the data replication pipeline using data diff.

What is a data replication pipeline

A data replication pipeline is a series of steps that help you to replicate the data from one data source to multiple targets. Replicating data involves extracting data from a source system, transforming it, and loading it to the destination systems. The aim of a data replication pipeline is for achieving data consistency and availability across multiple locations or systems.

Data replication pipelines are also implemented to provide real-time or near-real-time access to data across multiple systems. This can include scenarios such as creating data backups or enabling high availability and disaster recovery.

You can build data replication pipelines yourself, or source a data pipeline tool from a vendor. 

Why validate data replication pipelines?

A high-quality data pipeline tool will take care of data validation on your behalf. But if you build and manage your own data pipelines, it’s critical that you validate your data replication processes.

Inaccurate data replication leads to unmatched data in different data nodes, which can tremendously impact the customer experience. Image an e-commerce platform in which users from Singapore might still see the price of a specific shoe as $200, but their friends in the USA may see the updated price as $300.

By validating data replication pipelines, you can also detect performance issues related to the replication process or check the data replication readiness to recover databases from disaster events such as natural disasters, network outages, and cyberattacks.

What is data-diff?

DataFold is an automated testing data platform that created an open-source tool called data-diff that helps developers like you easily detect the differences between two different databases.

Data-diff acts as a Python dependency and provides support for quite a number of databases such as MySQL, PostgreSQL, or SnowFlake.

Data diff can also integrate with the data replication pipeline using the CI tool, which allows you to continuously check the data replication process.

For simpler, smaller-scall data replication pipelines, it’s a great way to avoid data replication problems.

Demo scenario

To better understand how you can validate data replication pipelines with data-diff, let’s use a hands-on demo scenario. You’ll create two separate machines and run PostgreSQL databases on these machines.

  • One primary node that acts as the main data source. The software application will only update the data on this database instance. The app will query data from both the primary node and the replica node to improve the application performance.
  • One replica node that acts as the backup data source. The replica node will continuously sync with the primary node if there are any changes in the primary database. The software application will not update data on the replica node.

Prepare the databases

To prepare the databases for the demo scenarios, you need to:

  • Prepare two Ubuntu machines to set up PostgreSQL databases on them. You can use real physical servers or virtual machines with the help of VirtualBox.
  • Install PostgreSQL on these two machines. Let’s use PostgreSQL version 14.

Step 1: Update the listen_addresses on the primary database

From the primary database, run the following command to update the listen_addresses of the primary database.

plaintext
sudo nano /etc/postgresql/14/main/postgresql.conf

Uncomment the line for listen_addresses and update the content as below:

plaintext
listen_addresses='*’'

Step 2: Create a replica role on the primary database

Connect to the database on the primary node by running the following command to create a new role named estuary with estuary as the password:

plaintext
CREATE ROLE estuary WITH REPLICATION PASSWORD 'estuary' LOGIN;

Add the following lines to the end of the file in /etc/postgresql/14/main/pg_hba.conf to: 

  • Allow your replica node to connect with the primary node.
  • Allow other nodes to connect with your primary node so that you can perform the data-diff test.
plaintext
host replication estuary 192.168.9.227/32 md5 host all all 0.0.0.0/0 md5

Step 3: Clean the current data on the replica database

Before setting up the replica database to sync with the primary database, you need to clean up its current data.

To remove all current data on the replica node, run the following command:

plaintext
sudo - i rm -r /var/lib/postgresql/14/main/*

Step 4: Set the replica database to connect with the primary database

From the replica database, run the following command to start syncing the replica database to the primary database:

plaintext
sudo -u postgres pg_basebackup -h primary-ip-addr -p 5432 -U estuary -D /var/lib/postgresql/14/main/ -Fp -Xs -R

Then you provide the password for the estuary role, which is estuary too.

To test whether you have successfully synced the two databases, run the following command on the primary database:

plaintext
sudo -u postgres psql SELECT client_addr, state FROM pg_stat_replication;

You should see a similar result as below:

data-diff - Successfully set up the syncing between the two databases

Now that you have successfully synced the two databases, let’s move on to validate the data between these two databases using data-diff.

Add data to the primary database

Now in the primary database, you have a database named estuary created.

data-diff - estuary database is created on primary db

Checking on the replica node, you also see the database estuary which was auto-created to match with the primary database too.

data-diff - estuary database is created on replica db

From the primary database, run the following command to access the PostgreSQL database named estuary.

plaintext
sudo -u postgres psql /c estuary

Then run the following command to create a new table named bloggers:

plaintext
CREATE TABLE bloggers (    user_id serial PRIMARY KEY,    username VARCHAR ( 50 ) UNIQUE NOT NULL,    password VARCHAR ( 50 ) NOT NULL,    email VARCHAR ( 255 ) UNIQUE NOT NULL,    created_on TIMESTAMP NOT NULL,    last_login TIMESTAMP );

Grant permissions for the estuary role to access to the bloggers table, so that you can perform data-diff later using the estuary role.

plaintext
grant ALL on bloggers to estuary;

Add some data to the table to validate the data.

plaintext
INSERT INTO bloggers(username,password,email,created_on,last_login) VALUES ('donaldle','donaldle','donaldle@gmail.com',current_timestamp,current_timestamp);

You have now created the data table with some values to it. Let’s move on to see how to validate the data between the primary database and the replica database.

Validate the data between the two databases using data-diff

From the other machine (not the machine that has the databases running), install the data-diff tool by running the following command:

plaintext
pip install data-diff

Then install the PostgreSQL driver so that data-diff can access the PostgreSQL databases.

plaintext
pip install psycopg2

Success case

If your host machine is running Windows, run the following command to validate data for the tables named bloggers in the primary database and the replica database.

plaintext
data-diff.exe postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers --key-columns='user_id'

By default the key-columns of data-diff is id. Since you don’t have the column id in your bloggers table, you need to specify the key-columns as user_id instead.

Running the above command, you will see no errors happen. This means the data between the replica database and the primary database is matched

data-diff - data is matched (short output)

If you want to see more details about the validation, you can add the flag -- verbose.

plaintext
data-diff.exe postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers --key-columns='user_id' --verbose

You should see output similar to the below:

data-diff - Validate data using data-diff with verbose option

Now that you have successfully validated the data between the two bloggers tables, let’s try to simulate a mismatched database and then validate the tables using data-diff.

Failed case

Let’s say the replica database lost the network connection due to the network outage in the network system. After the network connection is stable, you run the data-diff command to validate data between the two tables from the current machine.

plaintext
data-diff.exe postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers --key-columns='user_id' --verbose

You’d see similar output to the below, showing there’s one mismatched data row at the user_id number 7 in the replica database.

data-diff - Mismatched on data row number 7

Checking the primary database, you see the value of user_id number 7 exists but the replica database did not sync it yet.

data-diff - Checking data on the primary database

With the help of data-diff, you are able to identify the data mismatched issue when replicating data between the primary database and the replicated database.

Conclusion

Data replication is a critical step to protect your data and keep your applications reliable. If you prefer to build your own data pipelines, you’ll also need to validate the data replication process. That’s where a tool like data-diff comes in handy. 

If you prefer to reduce the amount of manual intervention in your replication pipelines — to be able to let down your guard a bit knowing that your data pipeline platform is monitoring itself — using a managed data pipeline platform might be a better fit. 

Estuary Flow helps you build real-time data replication pipelines from a low-code UI. Flow is based on exactly-once processing to ensure data accuracy, and its dashboard alerts you of any data replication platforms. 

Learn about how to use Flow to replicate data from databases like PostgresMySQLMongodb, and many more. 

Rather try it yourself? You can build your first pipeline for free!

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of Jeffrey Richman
Jeffrey Richman

With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.