This project is based on Chapter 2 of a book by Aurelien Geron
The author also provided a github link for the notebook

I was following the notebook, recreating it, and made some annotations for my own understanding. But 95% of the work was following Geron’s. My follow along notebook is here.

Introduction

The dataset we are using is the California Housing Prices dataset based on 1990 California census (see Figure below). We were trying to predict the mean house prices in each district by using regression.

In summary, it is an end to end ML project:

Importing data from Github
Exploratory Data Analysis
Preparing data
Includes: imputing, dealing with categorical data, custom transformer, and applying pipelines
Select and train model
The example used: linear regression, decision tree, random forest and support vector regression
Fine tune model / hyperparameter adjustment
Via: GridSearch and Randomised search

Results:

As seen above, the lowest RMSE is for RandomForest with GridSearch tuning. With this model, the RMSE of the test set is $47.7k

What I learnt

What do i say. I learnt a lot as this is my first time going through the whole process.

Insights:

Random test-train split may not be good if the data is skewed. Test data should have a similar “sub-group” distribution as the full data set => use StratifiedShuffleSplit
Experiment with attributes. Feature engineering may create a better metrics
Eg: In this example, number_of_bedrooms and number_of_household didn’t correlate well with median_house_value. Once we create number_of_rooms/household => this metric much more correlated to median_house_value
Using SimpleImputer
Using OrdinalEncoder vs OneHotEncoder
Feature engineering via custom transformer
Using Feature Scaling such as StandardScaler
Sparse vs dense matrix. Sparse matrix in OneHotEncoder due to the presence of many 0’s.
Applied the above via pipeline
Using K-fold cross validation (cross_val_score). Beware of training time.
Can do automatic hyperparameter via GridSearchCV and RandomisedSearchCV
Sometimes, knowing the final RMSE isn’t enough. How do we know if this performs better compared to an already deployed model ==> might be good to find the 95% CI

What I found confusing in this tutorial:

I found it confusing for the author to rename housing to strat_train_set. I would prefer to keep the “train set” variable label throughout the exercise.

Next step?

Probably continue for classification problem tutorial