Padhai Time

Hands-on Linear Regression Using Sklearn

In this tutorial, we'll look at how to use scikit-learn to predict using a Linear Regression model.

Problem Statement

The link to the data set is https://www.kaggle.com/c/boston-housing. We can also import this dataset from the scikit-learn itself.

The Boston Housing dataset contains information about various houses in Boston through different parameters.
There are 506 samples and 13 feature variables in this dataset.

The objective is to predict the value of prices of the house using the given features.

Let's get started with the hands-on exercise.

1. Importing Libraries

%matplotlib inline

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

2. Loading and Overview of Data

Code

import load_boston

boston = load_boston()

print("Shape of boston data: ",boston.data.shape)

print(boston.feature_names)

Output

There are 4 keys in the bunch [‘data’, ‘target’, ‘feature_names’, ‘DESCR’] as mentioned above. The data has 506 rows and 13 feature variables. Notice that this doesn’t include the target variable. Also, the names of the columns are also extracted. The details about the features and more information about the dataset can be seen by using boston.DESCR`

Code

print(boston.DESCR)

Output

We must convert this to a pandas data frame before applying any EDA or model, which we can do by calling the dataframe on Boston.data. We also add the target variable from boston.target to the dataframe.

Code

bos = pd.DataFrame(boston.data)bos['PRICE'] = boston.target

bos.head()

Output

3. Train and Test Split of Data

We will split the data into 2 parts, ie we will be 80% of the data to build the model and the remaining 20% will be kept as unseen as validation for model generalization.
We will perform standardization on all the input features to the same scale. You can refer to the concepts of standardization and normalization in the Probability and Statistics module.

Code

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

print(X_train.shape)

print(X_test.shape)

print(Y_train.shape)

print(Y_test.shape)

Output

4. Building the model using scikit-learn

Code

# loading the model

lin_reg_model = LinearRegression()

# fitting the model with train data

lin_reg_model.fit(X_train, Y_train)

# predicting on the test 20% data

Y_pred = model.predict(X_test)

# weights and intercept of the model features

optimal_W = model.coef_

optimal_b = model.intercept_

print("Optimal W: ",optimal_W)

print("Optimal intercept(bias): ",np.round(optimal_b,3))

Output

Let us evaluate the various metrics we discussed during linear regression.

Code

# error and evaluation metrics of the model

error = Y_test - Y_pred

MSE = (1/X_test.shape[0]) * np.sum(error**2)

RMSE = np.sqrt(MSE)

print("MSE: ",np.round(sq_loss,3))

print("RMSE: ",np.round(rmse,3))

Output

4. Visulatizations

4.1 Plotting the model fitted line on the output variable.

plt.figure(figsize = (20,6))

plt.style.use('fivethirtyeight')

plt.subplot(121)

plt.plot(Y_pred_sklearn,Y_test,'ro')

plt.xlabel("Actual Price")

plt.ylabel("Predicted Price")

Output

4.2 Plotting the distribution of house prices

plt.figure(figsize = (20,6))

plt.style.use('fivethirtyeight')

plt.subplot(121)

sns.kdeplot(Y_pred_sklearn, bw = 0.5, color = "r", shade = True)

plt.xlabel("Predicted Price")

plt.ylabel("Distribution")

plt.title("With Sklearn")

Output

4.3 Plotting the error distribution

plt.figure(figsize = (20,6))

plt.style.use('fivethirtyeight')

plt.subplot(121)

sns.kdeplot(np.array(error_sklearn), bw = 0.5, color = "r", shade = True)

plt.xlabel("Error = Actual - Predicted")

plt.ylabel("Error Distribution")

plt.title("With Sklearn")

Output

5. Conclusion

We can see how our model is predicting by plotting a scatter plot between the original house price and predicted house prices. I Hope, it was fun with the first hands-on tutorial to build a machine learning model. To tweak and understand it better you can also try different models on the same problem, with that you would not only get better results but also a better understanding of the same.

Bengaluru, India

contact.padhaitime@gmail.com