This project is based on a YouTube video by KenJee
The author also provided a link to Kaggle for his notebook
This is also based on the popular Kaggle’s Titanic dataset which is used as the introduction for classification problem.

I was following the notebook, recreating it, and made some annotations for my own understanding. My follow along notebook is here.

Ken provided a good overview in each notebook session as well as introductory comment for each cell. Often I read the comment then attempted to code by myself first, then checking with his code afterwards. Good progress for my learning.

Introduction

This Titanic dataset is commonly used as the introduction into the Kaggle competition. With this dataset, we create a model to predict which passengers survived the Titanic shipwreck.

Coding process:

Result:

From the base model, the cross validation score achieved was around 75 to 80%. The best model seems to be the Support Vector Classifier model. The ensemble method seems to perform better.

The models were later tuned by using GridSearch. Hyperparameter for model such as RandomForest may take ages to run. Overall, the model performance slightly improved (1 to 3%) after tuning. The Extreme Gradient Boosting model received the highest improvement. Some voting classifier were also conducted using combinations of the tuned models.

At the end, the best performance model with the Kaggle’s test dataset is the hard voting of the tuned model. This achieved 79% score with the test set.

What I learnt

Insights:

Next step?

I am going to continue with Chapter 4 of the Geron’s book