Padhai Time

Normalization and Standardization

These techniques are required when we want to improve our model’s accuracy. These techniques help in scaling the data within a certain range so that one or the other attribute does not influence the model's training due to its different data scale.

Suppose you are training your model on data where you have the height and salary of an employee, then your model training will get influenced more with the “Salary” attribute than the “height” attribute. The reason is that Salary ranges from 25,000 to 1,50,000 INR, whereas height ranges from 4.2 to 6.5 fts. Scale of both the attributes is very different. As we know few models like KNN, calculate distance between the data points for their training and prediction, hence increase in age will not affect the distance whereas increase in salary will.

Hence, models where the algorithm is based on distance calculation, it is recommended to bring all your attributes within a certain scale so that none of the attributes influence the model training just because of its data values scale.

What is Normalization?

It is a scaling technique also called Min-Max scaling. It is used to shift the values of a variable between 0 and 1.

Formula for this is very simple:

Let us implement this technique using python to scale the data within a certain range.

Data:

Code:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv("titanic.csv")

# display current data values of Fare and Age attribute

sns.displot(data['Fare'])

sns.displot(data['Age'])

# Let us scale the data and check the distribution again

min_max_scaler = MinMaxScaler()

data[['Fare', 'Age']] = min_max_scaler.fit_transform(data[['Fare', 'Age']])

sns.displot(data['Fare'])

sns.displot(data['Age'])

Output:

It is visible from the chart that after Min-Max scaling, the distribution remains the same, but the scale gets change. Now all the values are ranging between 0 and 1.

What is Standardization?

It is also a scaling technique. Using this technique, mean of the variables gets shifted to 0 and standard deviation becomes 1.

Formula:

Important point to note:

Similar to Min-Max Scaling, Standardization technique also scales the data but do not change the distribution shape. So, if your original data is skewed, it remains skewed after transformation.

Code:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

# import standard scaler

standard_scaler = StandardScaler()

data[['Fare', 'Age']] = standard_scaler.fit_transform(data[['Fare', 'Age']])

sns.displot(data['Fare'])

sns.displot(data['Age'])

Output:

After applying Standardization, mean of the variable becomes 0 and standard deviation becomes 1.

Bengaluru, India

contact.padhaitime@gmail.com