When we get data for training a machine learning model, it may have categorical columns as well. And there is a limitation in some machine learning models that you have to handle categorical columns beforehand.
For example:
Salary and Country both are independent features which we want to use into a machine learning model for some prediction. But as these two features are categorical in nature, we can’t simply put them into model. Some sort of encoding is required.
Now the question is that
Is it Mandatory to encode categorical variables before modeling?
The answer is Not really.
Some machine learning algorithms handle the categorical data automatically by themselves whereas the rest of the algorithms expect only the numerical columns. CatBoost & Light GBM are the machine learning models where you can pass your data set having numerical as well as categorical data to the model. You just need to specify the index of categorical columns so that model handles it internally.
Whereas models like Logistic Regression, Linear Regression, Random Forest, XGBoost etc can work only on numerical dataset. Hence there is a strict need of encoding with these algorithms.
Now we have understood why there is a need to encode the categorical dataset. Let us learn the techniques which are available for this purpose.
1) Manual Labelling:
We can create a dictionary which will map the categories to some numbers. This approach is simple and effective, as we know which number to assign for which category.
Code:
Employee = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
Salary = ['High', 'Low', 'Medium', 'High', 'High', 'Low', 'Medium', 'Medium', 'Low', 'Medium']
Country = ['India', 'US', 'US', 'US', 'India', 'US', 'India', 'India', 'India', 'US']
df = pd.DataFrame()
df['Employee'] = Employee
df['Salary'] = Salary
df['Country'] = Country
to_replace_map = {"High": 3, "Medium": 2, "Low": 1}
df['Salary_manually_mapped'] = df['Salary'].replace(to_replace_map)
df
We have created a dictionary for this purpose. Salary “High” is replaced with a bigger number i.e. 3 and Salary “Low” is replaced with lower number i.e. 1.
2) Label Encoding:
Same task can be achieved automatically by Label Encoding. But it comes with its own limitations.
Code:
from sklearn.preprocessing import LabelEncoder
categ = ['Salary', 'Country']
label_encoded = ['Salary_label_encoded', 'Country_label_encoded']
le = LabelEncoder()
df[label_encoded] = df[categ].apply(le.fit_transform)
df
Output:
As you can see label encoding has been performed automatically on Salary and Country columns. We didn’t have to specify the mapping manually. But there is a significant problem in this approach.
Encoding of the Country column has been done perfectly as there are only two categories,
but there is a problem with Salary encoding. If you look closely, “High” Salary has been replaced with 0, “Medium” with 2 and “Low” with 1.
This will confuse the model while learning/training. Since the Salary column is ordinal (High, Medium, Low) in nature, hence the encoding done should also make sense otherwise we may lose this useful information. The right encoding will be “High” to 3, “Medium” to 2 and “Low” to 1.
Therefore, to fix these issues, either you can go with Manually labelling as we have shown in the first part or you can go with One Hot encoding.
3) One Hot Encoding:
In this technique, a new column is added for each category and values will be 0 or 1 only.
Code:
from sklearn.preprocessing import OneHotEncoder
df = pd.get_dummies(df, columns=['Salary'], prefix = ['Salary_OneHot'])
df
This code will create as many new columns as the number of distinct categories present in the original column. It drops the original column by default. Also, it will assign “Salary_OneHot” as a prefix to all the new columns.
There were 3 distinct categories in the Salary column, hence it created 3 new columns. However, if we have only 2 columns instead of 3, then also model will have the same information to learn from. We can achieve this by setting one additional parameter “drop_first = True”
Code:
from sklearn.preprocessing import OneHotEncoder
df = pd.get_dummies(df, columns=['Salary'], prefix = ['Salary_OneHot'], drop_first=True)
df
Output:
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here