This project is based on a YouTube video by KenJee
The author also provided a link to Kaggle for his notebook
This is also based on the popular Kaggle’s Titanic dataset which is used as the introduction for classification problem.

I was following the notebook, recreating it, and made some annotations for my own understanding. My follow along notebook is here.

Ken provided a good overview in each notebook session as well as introductory comment for each cell. Often I read the comment then attempted to code by myself first, then checking with his code afterwards. Good progress for my learning.

Introduction

This Titanic dataset is commonly used as the introduction into the Kaggle competition. With this dataset, we create a model to predict which passengers survived the Titanic shipwreck.

Coding process:

Import data
EDA
Feature Engineering
Data Preprocessing
- Dropped null values, used dummy variables, imputed data with median and using standard scaler
Model building
- Using: Naive Bayers, Logistic Regression, Decision Tree, K Nearest Neighbour, Random Forest, Support Vector Classifier, Extreme Gradient Boosting, Soft Voting Classifier
Model tuning

Result:

From the base model, the cross validation score achieved was around 75 to 80%. The best model seems to be the Support Vector Classifier model. The ensemble method seems to perform better.

The models were later tuned by using GridSearch. Hyperparameter for model such as RandomForest may take ages to run. Overall, the model performance slightly improved (1 to 3%) after tuning. The Extreme Gradient Boosting model received the highest improvement. Some voting classifier were also conducted using combinations of the tuned models.

At the end, the best performance model with the Kaggle’s test dataset is the hard voting of the tuned model. This achieved 79% score with the test set.

What I learnt

Insights:

Good idea to do EDA separately between numerical and categorical data.
For categorical data such as cabin or ticket, don’t be afraid to group variables based on certain category (eg: first letter of ticket). This can shorten combinations significantly.
Using Pandas’ get_dummies vs OneHotEncoding.
- Generally for ML, OHE seems to be better and can be used in pipelines
- get_dummies can be applied to a dataframe automatically (it will automatically applied only for the categorical part of the DF).
For model like RandomForest, maybe better to start with RandomisedSearchCV then tune with GridSearchCV to cut down searching costs.

Next step?

I am going to continue with Chapter 4 of the Geron’s book