Regression - Melb house prices
My first attempt on ML regression problem
- Import python libraries
- Import data
- Exploratory Data Analysis
- Data cleaning
- Select and train model
- Fine tune model
- Final model
This is my first attempt to do a regression ML problem. I have previously followed a tutorial based on A Geron's book on a similar project. Time to try by myself. I start through a regression problem because it is the simplest and I am familiar with regression.
I chose house prices data in Melb since I know the place well. Ideally I would have done Jakarta, but I am still learning web scrapping.
House prices data are taken from the Kaggle competition: https://www.kaggle.com/anthonypino/melbourne-housing-market
Things I will attempt to do on this notebook:
- As many charts as possible to use interactive visualisation.
- Using geopy. Successful to do this given full address, but that takes forever. Doing this in suburb level unsuccessful
# download python libraries
from datetime import datetime, timedelta
import os
import glob
import wget
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import json
import plotly.express as px
import plotly.graph_objs as go
# for offline ploting
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
from IPython.display import HTML
os.chdir("C:/Users/Riyan Aditya/Desktop/ML_learning/Project9_regression_melb_house_prices")
df = pd.read_csv('MELBOURNE_HOUSE_PRICES_LESS.csv')
df_b = pd.read_csv('Melbourne_housing_FULL.csv')
The kaggle repo indicates that this data was from August 2018
First, lets see how many properties there are
len(df)
And with an average prices of
round(df.Price.mean(),0)
Wow, around 1M? they are expensive
df.columns
df.info()
df.describe()
df.hist(bins=50, figsize=(20,15))
plt.show()
# calculate the correlation matrix
corr = df.corr()
corr
df.isna().sum()
Ok, that is a lot of null values (25%). Let's remove all rows that have null values. Afterall, we are only interested in property prices
# delete all rows with empty prices
df = df[df['Price'].notna()]
import math
bin_width= 10**5
nbins = math.ceil((df["Price"].max() - df["Price"].min()) / bin_width)
fig1 = px.histogram(df, x="Price", nbins=nbins, marginal = 'rug')
#fig.update_traces(xbins_size = 10**5)
#fig.show()
HTML(fig1.to_html(include_plotlyjs='cdn'))
Looks like most house prices are between 0 to 2M, while there are several outliers where the most expensive is 11.2M!!!
Considering the price histogram is right skewed, perhaps, median is a better indication (especially when we are looking at per suburb)
Average price
df.Price.mean()
Median price
df.Price.median()
df.Suburb.value_counts()
# combine address with suburb
df['full_address'] = df['Address']+" "+df['Suburb']+' Victoria Australia'
df['full_address'][0]
# additional library for geocode
from geopy.geocoders import Nominatim
from functools import partial
import time
# geolocator = Nominatim(user_agent="melb_address")
# geocode = partial(geolocator.geocode, language="en")
# aaa = df['full_address'][0:10].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
# aaa
Hmm. If I run everything to find coordinates for all 64k houses that will take forever
Groupby per suburb
Note: attempted to extract coordinate per suburb by using Geopy, but unsuccessful. Extra lat long from a dataset provided by the kaggle OP instead.
df2 = df.copy()
df2 = df2[['Suburb','Price','Rooms','Distance','Postcode']]
df3 = df2.groupby('Suburb').median()
df3 = df3.reset_index()
df_b2 = df_b.groupby('Suburb')['Lattitude','Longtitude'].median().reset_index()
df3 = df3.merge(df_b2, on ='Suburb', how = 'inner')
px.set_mapbox_access_token(open("aa.mapbox_token.txt").read())
fig2 = px.scatter_mapbox(df3, lat = 'Lattitude',lon = 'Longtitude',color ='Price',
color_continuous_scale=px.colors.sequential.algae,
size="Price",hover_name='Suburb',center = dict(lat=-37.80316 , lon =144.996430 ))
HTML(fig2.to_html(include_plotlyjs='cdn'))
Are there any correlation between price and distance to CBD?
fig3 = px.scatter(df, y='Price' ,x='Distance', title="Distance vs Price")
fig3.update_layout(yaxis_title='Median price')
HTML(fig3.to_html(include_plotlyjs='cdn'))
It looks like the further you are from CBD, the lower the median house prices
corr['Price']
Seems that there are some correlation between price and distance, but not much
bins = [0,1,2,3,4,5,200]
label = ['<1km','1-2km','2-3km','3-4km','4-5km','the rest']
df_05 = df.groupby(pd.cut(df.Distance, bins))['Price'].median()
plt.bar(label,df_05)
plt.ylabel('Median house price')
Interesting. Houses in CBD (perhaps smaller apartments) are relatively cheaper. Note that house sizes are not a factor here, simply average distance
df.Rooms.value_counts()
df_rooms = df.groupby('Rooms')['Price'].median()
fig4 = px.bar(df_rooms, title="Rooms vs Price")
fig4.update_layout(yaxis_title='Median price')
HTML(fig4.to_html(include_plotlyjs='cdn'))
Makes sense. Number of rooms increases, the more expensive the house becomes
df.Type.value_counts()
We have three type of units: houses, units, townhosues
Do they differ in price?
df_types = df.groupby('Type')['Price'].median()
fig5 = px.bar(df_types, title="Type vs Price")
fig5.update_layout(yaxis_title='Median price')
HTML(fig5.to_html(include_plotlyjs='cdn'))
Houses seems to be the most expensive, followed by town houses, then units
fig6 = px.box(df,x='Type', y = 'Price' ,title="Type vs Price")
fig6.update_layout(yaxis_title='Price ($)')
HTML(fig6.to_html(include_plotlyjs='cdn'))
import math
bin_width= 10**5
nbins = math.ceil((df["Price"].max() - df["Price"].min()) / bin_width)
fig7 = px.histogram(df, x="Price", nbins=nbins, color='Type',marginal = 'rug',
title='Melbourne houses prices based on unit type')
#fig.update_traces(xbins_size = 10**5)
HTML(fig7.to_html(include_plotlyjs='cdn'))
df.Method.value_counts()
S is property sold, SP is property sold prior, PI is property passed in, VB is vendor bid and SA is sold after auction
df_methods = df.groupby('Method')['Price'].median()
fig8 = px.bar(df_methods, title="Method vs Price")
fig8.update_layout(yaxis_title='Median price')
HTML(fig8.to_html(include_plotlyjs='cdn'))
Interesting. There are no significant difference between each method except vendor bid
fig9 = px.box(df,x='Method', y = 'Price' ,title="Method vs Price")
fig9.update_layout(yaxis_title='Price ($)')
HTML(fig9.to_html(include_plotlyjs='cdn'))
df_councils = df.groupby('CouncilArea')['Price','Distance'].median()
df_councils = df_councils.sort_values('Distance')
fig10 = px.bar(df_councils, y='Price', title="Council area vs Price (sorted by distance to CBD)")
fig10.update_layout(yaxis_title='Median price')
HTML(fig10.to_html(include_plotlyjs='cdn'))
Roughly, the further the council is, the cheaper the houses
df_regions = df.groupby('Regionname')['Price','Distance'].median()
fig11 = px.bar(df_regions, y='Price', title="Region vs Price")
fig11.update_layout(yaxis_title='Median price')
fig11.update_layout( xaxis={'categoryorder':'category ascending'})
HTML(fig11.to_html(include_plotlyjs='cdn'))
Again, metropolitan seems to be more expensive
There seems to be no need for data cleaning since the data has been cleaned by OP (Tony Pino) before uploading to Kaggle.
Earlier, the houses that have no prices had been removed
Plan to prepare data for ML algo:
- Train test split. Going to use stratified sampling. Seems number of bedroom is relevant
- No imputing needed since the only missing data was prices and we have deleted rows with no prices
- OneHotEncoding for categorical variables (type, selling method and council). Note that suburb has too many variables while region has too few, so council is choosen instead.
- Feature Scaling via StandardScaler
- Create Pipeline transformer
corr.Price.sort_values(ascending=False)
Looking at the correlation, rooms are an important variable. However, rooms have too many categories. We need to simplify this to maybe 5 to 6 categories (based on the number of rooms)
Rooms_cat categories:
- 1: 0-1 room
- 2: 1-2 room
- 3: 2-3 room
- 4: 3-4 room
- 5: 4-5 room
- 6: >6 room
df['Rooms_cat'] = pd.cut(df["Rooms"],
bins=[0, 1, 2, 3, 4, 5, np.inf],
labels=[1, 2, 3, 4, 5, 6])
df.Rooms_cat.value_counts()
df.Rooms_cat.hist()
Create test set with 20% of the original data
# apply Stratified Shuffle Split
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index, test_index in split.split(df, df['Rooms_cat']):
strat_train_set = df.iloc[train_index]
strat_test_set = df.iloc[test_index]
strat_test_set["Rooms_cat"].value_counts() / len(strat_test_set)
strat_train_set["Rooms_cat"].value_counts() / len(strat_train_set)
strat_train_set.shape, strat_test_set.shape
from plotly.subplots import make_subplots
fig12 = px.pie(strat_train_set,names='Rooms_cat',
color_discrete_sequence=px.colors.sequential.Plasma)
fig12.update_traces(hovertemplate = "Rooms_cat:%{label} <br>Amount: %{value} ")
fig13 = px.pie(strat_test_set,names='Rooms_cat',labels = label,
color_discrete_sequence=px.colors.sequential.Teal)
fig13.update_traces(hovertemplate = "Rooms_cat:%{label} <br>Amount: %{value} ")
trace1 = fig12['data'][0]
trace2 = fig13['data'][0]
fig14 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig14.add_trace(trace1, row=1, col=1)
fig14.add_trace(trace2, row=1, col=2)
fig14.update_layout(showlegend=False)
fig14.update_layout(title_text='Training vs test dataset breakdown'
,annotations =[dict(text='Training set', x=0.18, y=1.1, font_size=20, showarrow=False),
dict(text='Test set', x=0.82, y=1.1, font_size=20, showarrow=False)])
HTML(fig14.to_html(include_plotlyjs='cdn'))
Training and test set now have the same proportion based on number of rooms
strat_train_set.columns
strat_test_set.columns
X_train = strat_train_set[['CouncilArea','Distance','Rooms_cat','Type','Method']]
y_train = strat_train_set.Price.copy()
X_test = strat_test_set[['CouncilArea','Distance','Rooms_cat','Type','Method']]
y_test = strat_test_set.Price.copy()
X_train.head()
Split to numerical and categorical to prepare for pipeline
# split to numerical and categorical to prepare for pipeline
X_train_num = X_train[['Rooms_cat','Distance']]
X_train_cat = X_train[['CouncilArea','Type','Method']]
Apply standard scaler since the variable distance is much larger than rooms
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scaler',StandardScaler())])
# apply numercial pipeline trial
X_train_num_tr = num_pipeline.fit_transform(X_train_num)
X_train_num_tr
Regression requires numerical attribute. so convert all category variables via OHE (Council, Type, Method)
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
# apply numercial pipeline tria
X_train_cat_tr = cat_encoder.fit_transform(X_train_cat)
X_train_cat_tr.toarray()
list(X_train_cat), list(X_train_num)
from sklearn.compose import ColumnTransformer
num_attribs = list(X_train_num)
cat_attribs = list(X_train_cat)
full_pipeline = ColumnTransformer([
("num", num_pipeline,num_attribs),
("cat",cat_encoder,cat_attribs)
])
X_train_prepared = full_pipeline.fit_transform(X_train)
X_train_prepared
X_train_prepared.shape
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train_prepared, y_train)
lets try to the first 5 data
# lets try to the first 5 data
some_data = X_train.iloc[:5]
some_labels = y_train.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions :", lin_reg.predict(some_data_prepared))
Compare against the actual values:
print("Labels:", list(some_labels))
Not too bad
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
X_train_predictions = lin_reg.predict(X_train_prepared)
lin_mse = mean_squared_error(X_train_predictions, y_train)
lin_rmse = np.sqrt(lin_mse)
print("RMSE =",lin_rmse)
lin_mae = mean_absolute_error(X_train_predictions, y_train)
print("MAE =",lin_mae)
380k RMSE is not too bad considering the house prices are in 1-2 M
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train_prepared, y_train)
X_train_pred_tree = tree_reg.predict(X_train_prepared)
tree_mse = mean_squared_error(X_train_pred_tree, y_train)
tree_rmse = np.sqrt(tree_mse)
print("RMSE =",tree_rmse)
Seems like DT has lower RMSE than linear regression
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, X_train_prepared, y_train, scoring = "neg_mean_squared_error", cv=5)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)
lin_scores = cross_val_score(lin_reg, X_train_prepared, y_train,scoring="neg_mean_squared_error", cv=5)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Without cross validation, RMSE:
- Linear = 383k
- DT = 274k
With cross validation, RMSE:
- Linear = 384k
- DT = 331k
Seems like the DT overfit without cross validation. Whereas the linear model performs very similar
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators=100, random_state = 42)
forest_reg.fit(X_train_prepared, y_train)
X_train_pred_rf = forest_reg.predict(X_train_prepared)
rf_scores = cross_val_score(forest_reg, X_train_prepared, y_train,scoring="neg_mean_squared_error", cv=5)
rf_rmse_scores = np.sqrt(-rf_scores)
display_scores(rf_rmse_scores)
RF has an RMSE of 321 k
from sklearn.svm import SVR
svm_reg = SVR(kernel='linear')
svm_reg.fit(X_train_prepared, y_train)
X_train_pred_svm = svm_reg.predict(X_train_prepared)
svm_scores = cross_val_score(svm_reg, X_train_prepared, y_train,scoring="neg_mean_squared_error", cv=5)
svm_rmse_scores = np.sqrt(-svm_scores)
display_scores(svm_rmse_scores)
RMSE of 608k. terrible
Since RF is best, lets fine tune just the RF model
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators':[3,10,30,50,100], 'max_features':[2,4,6,8]},
{'bootstrap':[False],'n_estimators':[3,30,100], 'max_features':[2,8]}]
forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(X_train_prepared, y_train)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
Lowest score is:
# lowest score is
np.sqrt(-max(cvres["mean_test_score"]))
Grid search helped a little
grid_search.best_params_
final_model = grid_search.best_estimator_
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
Final error is similar to the cross validation error. Both are around 320k. This is not bad considering house prices are in 1-2M range