Case Study-2-House Sale Price Prediction

Linear regression Case Study : House Sale Price Prediction

In this blog, you will find one more example of a Linear regression case study and this case study is about optimizing the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Step1: Importing and Understanding Data.

 
import pandas as pd
import numpy as np

# Importing Housing.csv
housing = pd.read_csv('Housing.csv')

# Looking at the first five rows
housing.head()

Output:

price  area  bedrooms  bathrooms  stories  mainroad  guestroom  basement hotwaterheating airconditioning parking prefarea furnishingstatus
13300000 7420 4 2 3 yes no no no yes 2 yes furnished
12250000 8960 4 4 4 yes no no no yes 3 no furnished
12250000 9960 3 2 2 yes no yes no no 2 yes semi-furnished
12215000 7500 4 2 2 yes no yes no yes 3 yes furnished
11410000 7420 4 1 2 yes yes yes no yes 2 no furnished

Step2: Data Preparation.

 # Converting Yes to 1 and No to 0
housing['mainroad'] = housing['mainroad'].map({'yes': 1, 'no': 0})
housing['guestroom'] = housing['guestroom'].map({'yes': 1, 'no': 0})
housing['basement'] = housing['basement'].map({'yes': 1, 'no': 0})
housing['hotwaterheating'] = housing['hotwaterheating'].map({'yes': 1,'no': 0})
housing['airconditioning'] = housing['airconditioning'].map({'yes': 1,'no': 0})
housing['prefarea'] = housing['prefarea'].map({'yes': 1, 'no': 0})

Step3: Creating dummy variables for variable ‘furnishingstatus’ and dropping the first one.

 status = pd.get_dummies(housing['furnishingstatus'],drop_first = True)

Step4: Adding the results to the master dataframe.

housing = pd.concat([housing,status],axis=1)

Step5: Dropping the variable ‘furnishingstatus’.

 housing.drop(['furnishingstatus'],axis=1,inplace=True)

Step6: Create new variables.

 
# Let us create the new metric and assign it to "areaperbedroom"
housing['areaperbedroom'] = housing['area']/housing['bedrooms']
# Metric: bathrooms per bedroom
housing['bbratio'] = housing['bathrooms']/housing['bedrooms']

Step7: Data normalization. (Rescaling features)

 #defining a normalisation function 
def normalize (x): 
    return ( (x-np.mean(x))/ (max(x) - min(x)))
                                                                                          
# applying normalize ( ) to all columns 
housing = housing.apply(normalize)

Step8: Split data into training and testing sets.

 # Putting feature variable to X
X = housing[['area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'semi-furnished', 'unfurnished',
       'areaperbedroom', 'bbratio']]

# Putting response variable to y
y = housing['price']

Step9: Split X and y into X_train, X_test, y_train,y_test.

 
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 ,test_size = 0.3, random_state = 100)

Step10: Use the VIF metric to remove unnecessary variables. Details of variance inflation factor can be found here.

 
# UDF for calculating vif value
def vif_cal(input_data, dependent_col):
    vif_df = pd.DataFrame( columns = ['Var', 'Vif'])
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.OLS(y,x).fit().rsquared  
        vif=round(1/(1-rsq),2)
        vif_df.loc[i] = [xvar_names[i], vif]
    return vif_df.sort_values(by = 'Vif', axis=0, ascending=False, inplace=False)

Step11: Use RFE to reduce variables.

For this Linear regression case study, we will use ‘Scikit’ learn library.

 
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

Step12: RFE with the output number of the variable equal to 9.

 
lm = LinearRegression()
rfe = RFE(lm, 9) # running RFE
rfe = rfe.fit(X_train, y_train)
print(rfe.support_) # Printing the boolean results
print(rfe.ranking_) 

Output:

[ True False  True  True  True False False  True  True False  True False
 False  True  True]
[1 3 1 1 1 4 6 1 1 2 1 7 5 1 1

Step13: Get all variables.

 
col = X_train.columns[rfe.support_]
X_train_rfe = X_train[col] 

Step14: Building a model using sklearn.

 
# Adding a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)
lm = sm.OLS(y_train,X_train_rfe).fit() # Run the model

Step15: Let’s see the summary of our linear model.

 print(lm.summary())

Output:

                           OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.660
Model:                            OLS   Adj. R-squared:                  0.652
Method:                 Least Squares   F-statistic:                     80.14
Date:                Mon, 13 Apr 2020   Prob (F-statistic):           1.88e-81
Time:                        15:56:09   Log-Likelihood:                 369.54
No. Observations:                 381   AIC:                            -719.1
Df Residuals:                     371   BIC:                            -679.7
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               0.0034      0.005      0.704      0.482      -0.006       0.013
area                0.7022      0.130      5.421      0.000       0.447       0.957
bathrooms           0.1718      0.098      1.759      0.079      -0.020       0.364
stories             0.0814      0.019      4.321      0.000       0.044       0.118
mainroad            0.0647      0.014      4.470      0.000       0.036       0.093
hotwaterheating     0.1002      0.022      4.523      0.000       0.057       0.144
airconditioning     0.0776      0.011      6.806      0.000       0.055       0.100
prefarea            0.0631      0.012      5.286      0.000       0.040       0.087
areaperbedroom     -0.4095      0.143     -2.868      0.004      -0.690      -0.129
bbratio             0.1156      0.080      1.450      0.148      -0.041       0.272
==============================================================================
Omnibus:                       85.512   Durbin-Watson:                   2.108
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              273.429
Skew:                           0.998   Prob(JB):                     4.22e-60
Kurtosis:                       6.638   Cond. No.                         46.6
==============================================================================

Step16: Calculate the Vif value for all variables and verify the values of each variable. If the value is more, you can remove the variable from the model and re-run the model.

 vif_cal(input_data=housing.drop(['area','bedrooms','stories','basement','semi-furnished','areaperbedroom'], axis=1), dependent_col="price")

Output:

 	Var 	Vif
0 	bathrooms 	2.35
8 	bbratio 	2.19
5 	parking 	1.12
4 	airconditioning 	1.11
1 	mainroad 	1.10
6 	prefarea 	1.09
7 	unfurnished 	1.07
2 	guestroom 	1.06
3 	hotwaterheating 	1.04

Step17: At this point, VIF and P-values look good for the model, we can proceed to test our model on test data.

 
X_test_rfe = X_test[col]
# Adding a constant variable 
X_test_rfe = sm.add_constant(X_test_rfe)
# Making predictions
y_pred = lm.predict(X_test_rfe)

Step18: Evaluate the model, which we built for the linear regression case study.

 
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

c = [i for i in range(1,165,1)] # generating index 
fig = plt.figure() 
plt.plot(c,y_test, color="blue", linewidth=2.5, linestyle="-") #Plotting Actual
plt.plot(c,y_pred, color="red",  linewidth=2.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Housing Price', fontsize=16)                       # Y-label 

Output:

AXiYQRfNO917AAAAAElFTkSuQmCC
Step19: Verify Root Mean Square Error of our model.

 import numpy as np
from sklearn import metrics
print('RMSE :', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Output:

RMSE : 0.108203525381

Conclusion

The above Linear regression case study is about predicting sale price of a house. Please follow this for more insights about Linear Regression.

Leave a Reply