PadhaiTime Logo
Padhai Time

Normalization and Standardization

These techniques are required when we want to improve our model’s accuracy. These techniques help in scaling the data within a certain range so that one or the other attribute does not influence the model's training due to its different data scale.

Suppose you are training your model on data where you have the height and salary of an employee, then your model training will get influenced more with the “Salary” attribute than the “height” attribute. The reason is that Salary ranges from 25,000 to 1,50,000 INR, whereas height ranges from 4.2 to 6.5 fts. Scale of both the attributes is very different. As we know few models like KNN, calculate distance between the data points for their training and prediction, hence increase in age will not affect the distance whereas increase in salary will.

 

Hence, models where the algorithm is based on distance calculation, it is recommended to bring all your attributes within a certain scale so that none of the attributes influence the model training just because of its data values scale.

 

What is Normalization?

It is a scaling technique also called Min-Max scaling. It is used to shift the values of a variable between 0 and 1.

Formula for this is very simple:

undefined

  

Let us implement this technique using python to scale the data within a certain range.

Data:

undefined

Code:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
 
data = pd.read_csv("titanic.csv")
# display current data values of Fare and Age attribute
sns.displot(data['Fare'])
sns.displot(data['Age'])
 
# Let us scale the data and check the distribution again
min_max_scaler = MinMaxScaler()
data[['Fare', 'Age']] = min_max_scaler.fit_transform(data[['Fare', 'Age']])
sns.displot(data['Fare'])
sns.displot(data['Age'])

  

Output:

undefined

It is visible from the chart that after Min-Max scaling, the distribution remains the same, but the scale gets change. Now all the values are ranging between 0 and 1.

 

What is Standardization?

It is also a scaling technique. Using this technique, mean of the variables gets shifted to 0 and standard deviation becomes 1.

 

Formula:

undefined

Important point to note:

Similar to Min-Max Scaling, Standardization technique also scales the data but do not change the distribution shape. So, if your original data is skewed, it remains skewed after transformation.

 

Code:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
 
# import standard scaler
standard_scaler = StandardScaler()
data[['Fare', 'Age']] = standard_scaler.fit_transform(data[['Fare', 'Age']])
 
sns.displot(data['Fare'])
sns.displot(data['Age'])

Output:

undefined

After applying Standardization, mean of the variable becomes 0 and standard deviation becomes 1.

Bengaluru, India
contact.padhaitime@gmail.com
  • We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
    Our Privacy policy can be found by clicking here