15 min read

Last updated: March 5, 2026

Data Normalization: Types, Techniques & Examples [2026 Guide]

Learn data normalization across databases (1NF to 5NF) and machine learning (min-max, z-score, decimal scaling). Includes real examples, Python code, and formulas.

Jeffrey Richman Data Engineering & Growth Specialist

Data Normalization - Different Types Of Data Normalization In Databases

Share this article

Summarize this page with AI

Start Building For Free

Poor data quality is one of the most common causes of broken analytics, failed ML models, and inconsistent application behavior. Data normalization is one of the most effective practices for preventing these problems. It applies to two related but distinct domains: database design, where normalization removes redundancy and enforces integrity, and machine learning, where normalization rescales feature values so algorithms can compare them fairly.

Data normalization is the process of structuring a database by eliminating redundancy, organizing data efficiently, and ensuring data integrity. It standardizes data across various fields, from databases to data analysis and machine learning, improving accuracy and consistency.

In this guide, we'll break down the complex concept of data normalization and explore its types and applications to help you handle your data more effectively. But first, let's begin by discussing data anomalies.

What Is Data Normalization?

Data normalization is an important aspect of data management and analysis that plays a crucial role in both data storage and data analysis. It is a systematic approach to decompose data tables to eliminate redundant data and undesirable characteristics.

The primary goal of data normalization is to add, delete, and modify data without causing data inconsistencies. It ensures that each data item is stored in only one place, which reduces the overall disk space requirement and improves the consistency and reliability of the system.

In databases, it organizes fields and tables, and in data analysis and machine learning, normalization is used to preprocess data before being used in any analysis.

Who Needs Data Normalization?

Data normalization has applications in a wide array of fields and professions. Its ability to streamline data storage, reduce data input errors, and ensure consistency makes it an invaluable asset for anyone dealing with large datasets. Let’s discuss some of its use cases.

Data Normalization In Machine Learning

Data normalization is a standard preprocessing step in machine learning. ML engineers use it to standardize and scale their data, which is very important to ensure that every feature has an equal impact on the prediction.

Data Normalization In Research

Researchers, particularly those in the field of science and engineering, often use data normalization in their work. Whether they're dealing with experimental data or large datasets, normalization helps to simplify their data, making it easier to analyze and interpret. They use it to eliminate potential distortions caused by differing scales or units and ensure that their findings are accurate and reliable.

Data Normalization In Business

In the business world, data normalization is often used in business intelligence and decision-making. Business analysts use normalization to prepare data for analysis, helping them to identify trends, make comparisons, and draw meaningful conclusions.

This helps in more informed business decisions and strategies to drive growth and success. Normalization also improves data consistency, which results in better collaboration between different teams within the company.

Database Normal Forms Explained (1NF through 5NF)

Image Source

Data normalization in databases is a multi-stage process that involves the application of a series of rules known as 'normal forms'. Each normal form represents a level of normalization and comes with its own set of conditions that a database should meet.

These normal forms give a set of rules that a database should adhere to for achieving a certain level of normalization. The process starts with the first normal form (1NF) and can go up to the fifth normal form (5NF), each level addressing a specific type of data redundancy or anomaly. Let’s take a look at each one of them.

1. First Normal Form (1NF)

The first normal form (1NF) is the foundational step of data normalization. A database is in 1NF if:

It contains only atomic values (each field holds a single value, no lists or arrays).
Every record is unique and identified by a primary key.
There are no repeating groups of data within a row.

This stage eliminates duplicate data and ensures that each entry in the database has a unique identifier, enhancing data consistency.

Example of 1NF Violation & Solution

Consider an E-commerce Order Table, used in online retail to track purchases. Customers often buy multiple products in one order, which leads to multiple values in a single field. This complicates retrieval and analysis since extracting individual products requires extra processing. Normalization restructures the data for efficient querying and ensures data integrity.

Order_ID	Customer_Name	Products Ordered
1001	John Doe	Laptop, Mouse
1002	Jane Smith	Phone, Headphones

The 'Products Ordered' column has multiple values (not atomic), violating 1NF.

Order_ID	Customer_Name	Products
1001	John Doe	Laptop
1001	John Doe	Mouse
1002	Jane Smith	Phone
1002	Jane Smith	Headphones

Now, each record stores a single atomic value, making it 1NF compliant.

2. Second Normal Form (2NF)

A database reaches the second normal form (2NF) if:

It is already in 1NF.
All non-key attributes are fully functionally dependent on the primary key.
There are no partial dependencies, meaning no attribute should depend on just a part of a composite primary key.

Example of 2NF Violation & Solution

Consider a Student-Course Enrollment Table, where students enroll in multiple courses.

Student_ID	Course_ID	Student_Name	Course_Name
201	C101	Alice	Math
202	C102	Bob	Science

Student_Name depends only on Student_ID.
Course_Name depends only on Course_ID.
Neither depends on both, violating 2NF.

To bring the table to 2NF, we split it into three:

Student Table

Student_ID	Student_Name
201	Alice
202	Bob

Courses Table

Course_ID	Course_Name
C101	Math
C102	Science

Enrollment Table (Bridging Table)

Student_ID	Course_ID
201	C101
202	C102

Now, every non-key attribute is fully dependent on its respective primary key.

3. Third Normal Form (3NF)

The third normal form (3NF) is achieved if a database is in 2NF and there are no transitive dependencies. This means:

There are no transitive dependencies, meaning no non-primary key attribute should depend on another non-primary key attribute.

Example of 3NF Violation & Solution

Consider an Employee Payroll Table, where a company tracks employee salaries and tax rates for payroll processing. Each employee belongs to a department, and salaries vary based on job roles. However, tax rates are not directly related to employees but rather to salary ranges. This introduces a transitive dependency, making the table inefficient for data retrieval and updates.

Employee_ID	Name	Department	Salary	Tax Rate
101	John	HR	5000	10%
102	Alice	IT	6000	20%

Tax Rate is dependent on Salary, not directly on Employee_ID.

To bring the table to 3NF, we split it into three:

Employee Table

Employee_ID	Name	Department
101	John	HR
102	Alice	IT

Salary Table

Employee_ID	Salary
101	5000
102	6000

Tax Table

Salary	Tax Rate
5000	10%
6000	20%

Now, Tax Rate is dependent on Salary, not on Employee_ID, removing the transitive dependency.

4. Beyond 3NF (BCNF, 4NF, 5NF)

While most databases are considered normalized after reaching 3NF, there are further stages of normalization, including:

Boyce-Codd Normal Form (BCNF)

A table is in BCNF if it is in 3NF and every determinant is a candidate key.

It ensures that there are no partial or transitive dependencies.
BCNF is stronger than 3NF, but is required only in special cases.

Fourth Normal Form (4NF)

A table is in 4NF if it is in BCNF and contains no multi-valued dependencies.

This is useful when handling many-to-many relationships.

Fifth Normal Form (5NF)

A table is in 5NF if it eliminates join dependencies while preserving data integrity.

Used in scenarios where complex joins lead to redundancy.

Data Normalization in Data Analysis & Machine Learning

Data normalization techniques in machine learning: min-max, z-score, and decimal scaling

Image Source

In data analysis and machine learning workflows, data normalization is a pre-processing step. It adjusts the scale of data and ensures that all variables in a dataset are on a similar scale. This uniformity is important as it prevents any single variable from overshadowing others.

For machine learning algorithms that rely on distance or gradient-based methods, normalized data is especially key. It helps these algorithms to function optimally and leads to the creation of models that are accurate, reliable, and unbiased. This ultimately enhances the quality of insights derived from the data.

3 Data Normalization Techniques & Formulas

Data analysis and machine learning use several techniques for normalizing data. Let’s discuss the 3 most commonly used methods.

Min-Max Normalization

This technique performs a linear transformation on the original data. Each value is replaced according to a formula that considers the minimum and maximum values of the data. The goal is to scale the data to a specific range, such as [0.0, 1.0]. The formula for min-max normalization is:

Min-max normalization formula for scaling data to a 0-1 range

Z-Score Normalization

Also known as Zero mean normalization or standardization, this technique normalizes values based on the mean and standard deviation of the data. Each value is replaced by a score that indicates how many standard deviations it is from the mean. You can apply Z-score normalization using the following formula:

Data Normalization - Z-Score Normalization Formula

Decimal Scaling Normalization

This technique normalizes by moving the decimal point of the values of the data. Each value of the data is divided by the maximum absolute value of the data, resulting in values typically in the range of -1 to 1. The formula for this simple normalization technique is:

Data Normalization - Decimal Scaling Normalization Formula

3 Data Normalization Examples With Python Code

Let’s apply the normalization techniques discussed above to real-world data. This can help us uncover the tangible effects they have on data transformation. We will use the Iris dataset, which is a popular dataset in the field of machine learning. This dataset consists of 150 samples from 3 species of Iris flowers.

Here’s how you can import the data in Python:

python
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
data = load_iris()
x = data['data']
feature_names = data['feature_names']

# Create a DataFrame from the data for better visual representation
df = pd.DataFrame(x, columns=feature_names)

Here’s a sample of the dataset:

Data Normalization - Import The Data In Python

Min-Max Normalization Example

Min-Max normalization is a simple yet effective method to rescale features to a specific range, typically 0 to 1. Here is how you can perform Min-Max normalization using Python and Scikit-learn:

python
from sklearn.preprocessing import MinMaxScaler

# Create the scaler
scaler = MinMaxScaler()

# Fit and transform the data
df_min_max_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

When we apply Min-Max normalization to the Iris dataset, we get:

Z-score Normalization Example

Z-score normalization, or standardization, centers the data with a mean of 0 and a standard deviation of 1. Here's an example of how to perform Z-score normalization:

python
from sklearn.preprocessing importStandardScaler

# Create the scaler
scaler = StandardScaler()

# Fit and transform the data
df_standard_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

Z-score normalization of the Iris dataset gives:

Data Normalization - Z-Score Normalization Of The Iris Dataset

Decimal Scaling Normalization Example

Decimal scaling normalization is particularly useful when the maximum absolute value of a feature is known. Here's a simple Python example of decimal scaling normalization:

python
import numpy as np

# Perform decimal scaling
max_abs_val = np.max(np.abs(df.values), axis=0)
df_decimal_scaled = df / 10 ** np.ceil(np.log10(max_abs_val))

The decimal scaling normalization code above first checks the order of the largest value in the dataset and then divides the entire dataset by it. Here’s the result:

Data Normalization - Decimal Scaling Normalization Of The Iris Dataset

The scales of the features in each of the 3 normalized datasets are much closer to each other than in the original dataset. This helps to ensure that all features contribute equally to the final result.

Data Anomalies: What They Are and How Normalization Prevents Them

What Are Data Anomalies?

Data anomalies refer to inconsistencies or errors that occur when you deal with stored data. These anomalies can compromise the integrity of the data and cause inaccuracies that do not reflect the real-world scenario the data is meant to represent.

In databases, anomalies are typically because of redundancy or poor table construction. In data analysis and machine learning, anomalies can arise from missing values, incorrect data types, or unrealistic values.

Regardless of the context, anomalies can significantly impact the consistency and integrity of data. They can cause inaccurate analyses, misleading results, and poor decision-making. Therefore, identifying and addressing data anomalies is a crucial step in any data-driven process.

Causes and Effects of Data Anomalies

Data anomalies can be categorized based on their causes and their impact. They primarily affect databases and data analysis/machine learning systems, leading to inefficiencies and unreliable outputs. Each type can disrupt data consistency and affect operations.

Exploring Data Anomalies: A Focus on Databases, Data Analysis & Machine Learning

Data anomalies can originate from a range of sources, and their impact can vary, often causing substantial complications if not addressed properly. Let’s talk about 2 broad categories where these anomalies are most prevalent and can cause major issues.

1. Anomalies In Databases

Data Normalization - Anomalies In Databases

Image Source

When it comes to databases, 3 primary types of data anomalies result from update, insertion, and deletion operations.

Insertion anomalies: These occur when the addition of new data to the database is hindered because of the absence of other necessary data. This situation often arises in systems where specific dependencies between data elements exist.
Update anomalies: These types of anomalies happen when modifications to the data end up causing inconsistencies. This usually occurs when the same piece of data is stored in multiple locations, and changes aren't reflected uniformly across all instances.
Deletion anomalies: You encounter these anomalies when you unintentionally lose other valuable information while removing certain data. This typically happens when multiple pieces of information are stored together, and the deletion of one affects the others.

While the above anomalies are mainly related to the operations in databases and their design flaws, understand that anomalies are not limited to these aspects alone. They can very well be present in the data itself and can be a source of misleading analysis and interpretations. Let’s discuss these next.

2. Anomalies In Data Analysis & Machine Learning

In data analysis and machine learning, data anomalies can manifest as discrepancies in the values, types, or completeness of data, which can significantly impact the outcome of analyses or predictive models. Let's examine some of the key anomalies that occur in this context:

Missing values: These happen when data is not available for certain observations or variables.
Incorrect data types: These anomalies occur when the data type of a variable does not match the expected data type. For example, a numeric variable might be recorded as a string.
Unrealistic values: This type of anomaly arises when variables contain values that are not physically possible or realistic. For example, a variable representing human age might contain a value of 200.

How Data Normalization Solves Data Anomalies

Data normalization plays a crucial role in preventing, managing, and resolving anomalies by structuring data efficiently and enforcing integrity constraints. By standardizing data structures, eliminating redundancies, and enforcing data integrity, normalization ensures that databases and datasets are reliable and efficient for processing.

Next, let's explore the structured process of data normalization and how it prevents these issues.

How Estuary Can Help With Data Normalization

Estuary is a real-time data platform that handles data ingestion, transformation, and normalization in a single pipeline. It supports SQL and TypeScript transformations (including AVG, MIN, MAX, and STDDEV functions), uses JSON Schema for automatic data validation, and processes data in real time so normalized datasets are always current.

Here are some of the key features of Estuary that can support data normalization:

Default annotations: It uses default annotations to prevent null values from being materialized to your endpoint system, ensuring data consistency.
Real-time transformations: Estuary supports SQL and Typescript for data manipulation, including functions like AVG(), MIN(), MAX(), and STDDEV() that can be used for data normalization.
Projections: It uses projections to translate between the documents of a collection and a table representation. This feature is particularly useful when dealing with systems that model flat tables of rows and columns.
Logical partitions: It allows you to logically partition a collection, isolating the storage of documents by their differing values for partitioned fields. This can help improve the efficiency of data storage and retrieval.
Real-time data processing: It processes data in real-time , which ensures that your normalized data is always up-to-date. This is particularly useful for applications that require immediate insights from the data.
Reductions: Estuary can merge multiple documents with a common key into a single document using customizable reduction strategies.
Schema management: It uses JSON Schema to define your data’s structure, representation, and constraints. This allows for robust data validation and ensures that your data is clean and valid before it's stored or processed.
Flexible data ingestion: Flow allows for the ingestion of data from a wide array of sources, including databases, cloud storage, and message queues. This flexibility makes it easier to bring in data from various sources for normalization.

Data normalization is a critical process in data management and analysis that ensures the integrity and reliability of data. However, the process can be complex and time-consuming, especially when dealing with large datasets and various types of anomalies.

This is where Estuary comes in. It facilitates seamless real-time data operations and ensures that your data is always up-to-date and ready for analysis. With features like schema management and support for data manipulation functions, Estuary can streamline the data normalization process.

So, if you're looking for a platform to simplify your data normalization process, you can explore Estuary for free by signing up here or reaching out to our team to discuss your specific needs.

FAQs

What is data normalization in simple terms?

Data normalization is the process of organizing data so that it is consistent, free of redundancy, and easy to use. In databases, it means restructuring tables to remove duplicate information and ensure each fact is stored in one place. In machine learning, it means rescaling numerical features so they share a comparable range, which helps algorithms treat each feature fairly.

What is the difference between 1NF, 2NF, and 3NF?

First normal form (1NF) requires that every field hold a single atomic value, with no lists or repeating groups, and that each record have a unique identifier. Second normal form (2NF) builds on 1NF by requiring that every non-key column depend on the entire primary key, not just part of a composite key. Third normal form (3NF) builds on 2NF by removing transitive dependencies, meaning no non-key column should depend on another non-key column. In practice, most databases are considered well-normalized once they reach 3NF.

What is the difference between database normalization and ML normalization?

Database normalization and machine learning normalization share a name but solve different problems. Database normalization restructures tables to eliminate redundancy and improve data integrity, using normal forms (1NF through 5NF). ML normalization rescales numerical feature values to a common range so that distance-based and gradient-based algorithms work correctly, using techniques like min-max scaling, z-score standardization, and decimal scaling. The two are unrelated despite the shared term.

Do I always need to normalize data for machine learning?

Not always. Tree-based algorithms like decision trees, random forests, and gradient-boosted trees do not require normalization because they split features based on thresholds rather than distances. Algorithms that do benefit from normalization include k-nearest neighbors, support vector machines, neural networks, linear regression with regularization, and any algorithm using gradient descent. As a general rule, if your features are on very different scales (one column in dollars, another in percentages, another in counts), normalization usually improves model quality.

What are data anomalies in a database?

Data anomalies are inconsistencies that occur when a database is poorly designed. There are three main types: insertion anomalies happen when you cannot add a new record because other required data is missing; update anomalies happen when changing a value in one place leaves stale copies elsewhere; and deletion anomalies happen when removing one record accidentally removes other information you wanted to keep. Database normalization is designed specifically to prevent these three problems by ensuring each fact is stored in exactly one place.

Can over-normalization cause problems?

Yes. Highly normalized databases require more joins to reconstruct usable data, which can hurt query performance for read-heavy workloads like analytics. For this reason, analytics warehouses often use denormalized schemas (star schema, wide tables) that intentionally duplicate data to make queries faster. Transactional databases generally benefit from normalization (typically 3NF), while analytical warehouses often benefit from deliberate denormalization.

About the author

Jeffrey RichmanData Engineering & Growth Specialist

Jeffrey is a data engineering professional with over 15 years of experience, helping early-stage data companies scale by combining technical expertise with growth-focused strategies. His writing shares practical insights on data systems and efficient scaling.