In this tutorial, we'll look at how to use scikit-learn to predict using a Linear Regression model.
The link to the data set is https://www.kaggle.com/c/boston-housing. We can also import this dataset from the scikit-learn itself.
The objective is to predict the value of prices of the house using the given features.
Let's get started with the hands-on exercise.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import load_boston
boston = load_boston()
print("Shape of boston data: ",boston.data.shape)
print(boston.feature_names)
There are 4 keys in the bunch [‘data’, ‘target’, ‘feature_names’, ‘DESCR’] as mentioned above. The data has 506 rows and 13 feature variables. Notice that this doesn’t include the target variable. Also, the names of the columns are also extracted. The details about the features and more information about the dataset can be seen by using boston.DESCR`
print(boston.DESCR)
We must convert this to a pandas data frame before applying any EDA or model, which we can do by calling the dataframe on Boston.data. We also add the target variable from boston.target to the dataframe.
bos = pd.DataFrame(boston.data)bos['PRICE'] = boston.target
bos.head()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
# loading the model
lin_reg_model = LinearRegression()
# fitting the model with train data
lin_reg_model.fit(X_train, Y_train)
# predicting on the test 20% data
Y_pred = model.predict(X_test)
# weights and intercept of the model features
optimal_W = model.coef_
optimal_b = model.intercept_
print("Optimal W: ",optimal_W)
print("Optimal intercept(bias): ",np.round(optimal_b,3))
Let us evaluate the various metrics we discussed during linear regression.
# error and evaluation metrics of the model
error = Y_test - Y_pred
MSE = (1/X_test.shape[0]) * np.sum(error**2)
RMSE = np.sqrt(MSE)
print("MSE: ",np.round(sq_loss,3))
print("RMSE: ",np.round(rmse,3))
4.1 Plotting the model fitted line on the output variable.
plt.figure(figsize = (20,6))
plt.style.use('fivethirtyeight')
plt.subplot(121)
plt.plot(Y_pred_sklearn,Y_test,'ro')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
Output
4.2 Plotting the distribution of house prices
plt.figure(figsize = (20,6))
plt.style.use('fivethirtyeight')
plt.subplot(121)
sns.kdeplot(Y_pred_sklearn, bw = 0.5, color = "r", shade = True)
plt.xlabel("Predicted Price")
plt.ylabel("Distribution")
plt.title("With Sklearn")
Output
4.3 Plotting the error distribution
plt.figure(figsize = (20,6))
plt.style.use('fivethirtyeight')
plt.subplot(121)
sns.kdeplot(np.array(error_sklearn), bw = 0.5, color = "r", shade = True)
plt.xlabel("Error = Actual - Predicted")
plt.ylabel("Error Distribution")
plt.title("With Sklearn")
Output
We can see how our model is predicting by plotting a scatter plot between the original house price and predicted house prices. I Hope, it was fun with the first hands-on tutorial to build a machine learning model. To tweak and understand it better you can also try different models on the same problem, with that you would not only get better results but also a better understanding of the same.
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here