This project is based on Chapter 2 of a book by Aurelien Geron
The author also provided a github link for the notebook
I was following the notebook, recreating it, and made some annotations for my own understanding. But 95% of the work was following Geron’s. My follow along notebook is here.
Introduction
The dataset we are using is the California Housing Prices dataset based on 1990 California census (see Figure below). We were trying to predict the mean house prices in each district by using regression.

In summary, it is an end to end ML project:
- Importing data from Github
- Exploratory Data Analysis
- Preparing data
Includes: imputing, dealing with categorical data, custom transformer, and applying pipelines
- Select and train model
The example used: linear regression, decision tree, random forest and support vector regression
- Fine tune model / hyperparameter adjustment
Via: GridSearch and Randomised search
Results:

As seen above, the lowest RMSE is for RandomForest with GridSearch tuning. With this model, the RMSE of the test set is $47.7k
What I learnt
What do i say. I learnt a lot as this is my first time going through the whole process.
Insights:
- Random test-train split may not be good if the data is skewed. Test data should have a similar “sub-group” distribution as the full data set => use StratifiedShuffleSplit
- Experiment with attributes. Feature engineering may create a better metrics
Eg: In this example, number_of_bedrooms and number_of_household didn’t correlate well with median_house_value. Once we create number_of_rooms/household => this metric much more correlated to median_house_value
- Using SimpleImputer
- Using OrdinalEncoder vs OneHotEncoder
- Feature engineering via custom transformer
- Using Feature Scaling such as StandardScaler
- Sparse vs dense matrix. Sparse matrix in OneHotEncoder due to the presence of many 0’s.
- Applied the above via pipeline
- Using K-fold cross validation (cross_val_score). Beware of training time.
- Can do automatic hyperparameter via GridSearchCV and RandomisedSearchCV
- Sometimes, knowing the final RMSE isn’t enough. How do we know if this performs better compared to an already deployed model ==> might be good to find the 95% CI
What I found confusing in this tutorial:
- I found it confusing for the author to rename housing to strat_train_set. I would prefer to keep the “train set” variable label throughout the exercise.
Next step?
- Probably continue for classification problem tutorial