Keeping your application data safe and sound is critical to the success of your company. Losing only a fraction of your app data can frustrate your customers and cause them to turn away from using your product.
Unfortunately, there a lot of incidents that can lead to data loss such as:
- Hardware failures
- Electrical outages
- Cyber-attacks
…and many more. Implementing a data replication pipeline is one way to mitigate the impact of such incidents. Your application data can stay in multiple data nodes, through multiple geographical locations. In case your database is impacted on one node, you still have other running nodes to back it up. On the other hand, keeping your data in multiple regions also boosts the performance of your app — users can get access to the database node that is close to their location.
However, some data replication methods can create inconsistencies between the source and replicated data. Various reasons can cause this problem such as hard deletion of the data row in the source database or not updating the timestamp in the updated data row.
You can monitor and address this problem by validating the replicated. In this article, you’ll learn how to validate the data replication pipeline using data diff.
What is a data replication pipeline
A data replication pipeline is a series of steps that help you to replicate the data from one data source to multiple targets. Replicating data involves extracting data from a source system, transforming it, and loading it to the destination systems. The aim of a data replication pipeline is for achieving data consistency and availability across multiple locations or systems.
Data replication pipelines are also implemented to provide real-time or near-real-time access to data across multiple systems. This can include scenarios such as creating data backups or enabling high availability and disaster recovery.
You can build data replication pipelines yourself, or source a data pipeline tool from a vendor.
Why validate data replication pipelines?
A high-quality data pipeline tool will take care of data validation on your behalf. But if you build and manage your own data pipelines, it’s critical that you validate your data replication processes.
Inaccurate data replication leads to unmatched data in different data nodes, which can tremendously impact the customer experience. Image an e-commerce platform in which users from Singapore might still see the price of a specific shoe as $200, but their friends in the USA may see the updated price as $300.
By validating data replication pipelines, you can also detect performance issues related to the replication process or check the data replication readiness to recover databases from disaster events such as natural disasters, network outages, and cyberattacks.
What is data-diff?
DataFold is an automated testing data platform that created an open-source tool called data-diff that helps developers like you easily detect the differences between two different databases.
Data-diff acts as a Python dependency and provides support for quite a number of databases such as MySQL, PostgreSQL, or SnowFlake.
Data diff can also integrate with the data replication pipeline using the CI tool, which allows you to continuously check the data replication process.
For simpler, smaller-scall data replication pipelines, it’s a great way to avoid data replication problems.
Demo scenario
To better understand how you can validate data replication pipelines with data-diff, let’s use a hands-on demo scenario. You’ll create two separate machines and run PostgreSQL databases on these machines.
- One primary node that acts as the main data source. The software application will only update the data on this database instance. The app will query data from both the primary node and the replica node to improve the application performance.
- One replica node that acts as the backup data source. The replica node will continuously sync with the primary node if there are any changes in the primary database. The software application will not update data on the replica node.
Prepare the databases
To prepare the databases for the demo scenarios, you need to:
- Prepare two Ubuntu machines to set up PostgreSQL databases on them. You can use real physical servers or virtual machines with the help of VirtualBox.
- Install PostgreSQL on these two machines. Let’s use PostgreSQL version 14.
Step 1: Update the listen_addresses on the primary database
From the primary database, run the following command to update the listen_addresses
of the primary database.
plaintextsudo nano /etc/postgresql/14/main/postgresql.conf
Uncomment the line for listen_addresses and update the content as below:
plaintextlisten_addresses='*’'
Step 2: Create a replica role on the primary database
Connect to the database on the primary node by running the following command to create a new role named estuary
with estuary
as the password:
plaintextCREATE ROLE estuary WITH REPLICATION PASSWORD 'estuary' LOGIN;
Add the following lines to the end of the file in /etc/postgresql/14/main/pg_hba.conf
to:
- Allow your replica node to connect with the primary node.
- Allow other nodes to connect with your primary node so that you can perform the data-diff test.
plaintexthost replication estuary 192.168.9.227/32 md5
host all all 0.0.0.0/0 md5
Step 3: Clean the current data on the replica database
Before setting up the replica database to sync with the primary database, you need to clean up its current data.
To remove all current data on the replica node, run the following command:
plaintextsudo - i
rm -r /var/lib/postgresql/14/main/*
Step 4: Set the replica database to connect with the primary database
From the replica database, run the following command to start syncing the replica database to the primary database:
plaintextsudo -u postgres pg_basebackup -h primary-ip-addr -p 5432 -U estuary -D /var/lib/postgresql/14/main/ -Fp -Xs -R
Then you provide the password for the estuary
role, which is estuary
too.
To test whether you have successfully synced the two databases, run the following command on the primary database:
plaintextsudo -u postgres psql
SELECT client_addr, state FROM pg_stat_replication;
You should see a similar result as below:
Now that you have successfully synced the two databases, let’s move on to validate the data between these two databases using data-diff.
Add data to the primary database
Now in the primary database, you have a database named estuary
created.
Checking on the replica node, you also see the database estuary
which was auto-created to match with the primary database too.
From the primary database, run the following command to access the PostgreSQL database named estuary
.
plaintextsudo -u postgres psql
/c estuary
Then run the following command to create a new table named bloggers
:
plaintextCREATE TABLE bloggers (
user_id serial PRIMARY KEY,
username VARCHAR ( 50 ) UNIQUE NOT NULL,
password VARCHAR ( 50 ) NOT NULL,
email VARCHAR ( 255 ) UNIQUE NOT NULL,
created_on TIMESTAMP NOT NULL,
last_login TIMESTAMP
);
Grant permissions for the estuary
role to access to the bloggers
table, so that you can perform data-diff later using the estuary
role.
plaintextgrant ALL on bloggers to estuary;
Add some data to the table to validate the data.
plaintextINSERT INTO bloggers(username,password,email,created_on,last_login)
VALUES ('donaldle','donaldle','donaldle@gmail.com',current_timestamp,current_timestamp);
You have now created the data table with some values to it. Let’s move on to see how to validate the data between the primary database and the replica database.
Validate the data between the two databases using data-diff
From the other machine (not the machine that has the databases running), install the data-diff tool by running the following command:
plaintextpip install data-diff
Then install the PostgreSQL driver so that data-diff can access the PostgreSQL databases.
plaintextpip install psycopg2
Success case
If your host machine is running Windows, run the following command to validate data for the tables named bloggers in the primary database and the replica database.
plaintextdata-diff.exe postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers --key-columns='user_id'
By default the key-columns
of data-diff
is id
. Since you don’t have the column id
in your bloggers
table, you need to specify the key-columns
as user_id
instead.
Running the above command, you will see no errors happen. This means the data between the replica database and the primary database is matched
If you want to see more details about the validation, you can add the flag -- verbose
.
plaintextdata-diff.exe postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers --key-columns='user_id' --verbose
You should see output similar to the below:
Now that you have successfully validated the data between the two bloggers
tables, let’s try to simulate a mismatched database and then validate the tables using data-diff.
Failed case
Let’s say the replica database lost the network connection due to the network outage in the network system. After the network connection is stable, you run the data-diff
command to validate data between the two tables from the current machine.
plaintextdata-diff.exe postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers postgresql://estuary:'estuary'@primary-ip:5432/estuary bloggers --key-columns='user_id' --verbose
You’d see similar output to the below, showing there’s one mismatched data row at the user_id
number 7 in the replica database.
Checking the primary database, you see the value of user_id
number 7 exists but the replica database did not sync it yet.
With the help of data-diff
, you are able to identify the data mismatched issue when replicating data between the primary database and the replicated database.
Conclusion
Data replication is a critical step to protect your data and keep your applications reliable. If you prefer to build your own data pipelines, you’ll also need to validate the data replication process. That’s where a tool like data-diff
comes in handy.
If you prefer to reduce the amount of manual intervention in your replication pipelines — to be able to let down your guard a bit knowing that your data pipeline platform is monitoring itself — using a managed data pipeline platform might be a better fit.
Estuary Flow helps you build real-time data replication pipelines from a low-code UI. Flow is based on exactly-once processing to ensure data accuracy, and its dashboard alerts you of any data replication platforms.
Learn about how to use Flow to replicate data from databases like Postgres, MySQL, Mongodb, and many more.
Rather try it yourself? You can build your first pipeline for free!