PadhaiTime Logo
Padhai Time

Hands-on Linear Regression Using Sklearn

In this tutorial, we'll look at how to use scikit-learn to predict using a Linear Regression model.

Problem Statement

The link to the data set is https://www.kaggle.com/c/boston-housing. We can also import this dataset from the scikit-learn itself. 

  • The Boston Housing dataset contains information about various houses in Boston through different parameters. 
  • There are 506 samples and 13 feature variables in this dataset. 

The objective is to predict the value of prices of the house using the given features.

Let's get started with the hands-on exercise.

1. Importing Libraries

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

2. Loading and Overview of Data

Code

import load_boston
boston = load_boston()
print("Shape of boston data: ",boston.data.shape)
print(boston.feature_names)

Output

undefined

There are 4 keys in the bunch [‘data’, ‘target’, ‘feature_names’, ‘DESCR’] as mentioned above. The data has 506 rows and 13 feature variables. Notice that this doesn’t include the target variable. Also, the names of the columns are also extracted. The details about the features and more information about the dataset can be seen by using boston.DESCR`  

Code

print(boston.DESCR)

Output

undefined

We must convert this to a pandas data frame before applying any EDA or model, which we can do by calling the dataframe on Boston.data. We also add the target variable from boston.target to the dataframe.

Code

bos = pd.DataFrame(boston.data)bos['PRICE'] = boston.target
bos.head()

Output

undefined

3. Train and Test Split of Data

  1. We will split the data into 2 parts, ie we will be 80% of the data to build the model and the remaining 20% will be kept as unseen as validation for model generalization.
  2. We will perform standardization on all the input features to the same scale. You can refer to the concepts of standardization and normalization in the Probability and Statistics module.

Code

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

Output

undefined

4. Building the model using scikit-learn

Code

# loading the model
lin_reg_model = LinearRegression()
# fitting the model with train data
lin_reg_model.fit(X_train, Y_train)
# predicting on the test 20% data
Y_pred = model.predict(X_test)
# weights and intercept of the model features
optimal_W = model.coef_ 
optimal_b = model.intercept_
print("Optimal W: ",optimal_W)
print("Optimal intercept(bias): ",np.round(optimal_b,3))

Output

undefined

Let us evaluate the various metrics we discussed during linear regression.

Code

# error and evaluation metrics of the model
error = Y_test - Y_pred
MSE = (1/X_test.shape[0]) * np.sum(error**2)
RMSE = np.sqrt(MSE)
print("MSE: ",np.round(sq_loss,3))
print("RMSE: ",np.round(rmse,3))

Output

undefined

4. Visulatizations

4.1 Plotting the model fitted line on the output variable.

plt.figure(figsize = (20,6))
plt.style.use('fivethirtyeight')
plt.subplot(121)
plt.plot(Y_pred_sklearn,Y_test,'ro')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")

Output

undefined

4.2 Plotting the distribution of house prices

plt.figure(figsize = (20,6))
plt.style.use('fivethirtyeight')
plt.subplot(121)
sns.kdeplot(Y_pred_sklearn, bw = 0.5, color = "r", shade = True)
plt.xlabel("Predicted Price")
plt.ylabel("Distribution")
plt.title("With Sklearn")

Output

undefined

4.3 Plotting the error distribution

plt.figure(figsize = (20,6))
plt.style.use('fivethirtyeight')
plt.subplot(121)
sns.kdeplot(np.array(error_sklearn), bw = 0.5, color = "r", shade = True)
plt.xlabel("Error = Actual - Predicted")
plt.ylabel("Error Distribution")
plt.title("With Sklearn")

Output

undefined

5. Conclusion

We can see how our model is predicting by plotting a scatter plot between the original house price and predicted house prices. I Hope, it was fun with the first hands-on tutorial to build a machine learning model. To tweak and understand it better you can also try different models on the same problem, with that you would not only get better results but also a better understanding of the same. 

Bengaluru, India
contact.padhaitime@gmail.com
  • We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
    Our Privacy policy can be found by clicking here