Data is downloaded from UCI ML database: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

The dataset is about whether a creditor default in Taiwan. I believe the data is around October 2005

Interesting insights from the dataset:

  • No obvious correlation between age and probability of default
  • Defaultees tend to have lower credit limit. Makes sense, maybe they have low credit rating
  • No significant recent bill amount differences between defaultees and non defaultees
  • Significantly less recent payment amount for defaultees. Perhaps indicating they struggle to make payment
  • Male has higher default ratio (24%) than female (21%)
  • Graduates have significantly less default ratio. Whereas no different between uni and high school
  • Married couple have slightly higher default ratio

Summary of notebook workflow:

  • Download and import data from UCI ML repo
  • EDA and insights
  • Train test split via Stratified sampling since number of non default is much larger than default
  • Create numerical and categorical pipeline for feature selection (via standard scaler, clean up datasets and One hot encoding)
  • Train 7 base models
  • Fine tune 4 models (Logistic regression, Random Forest, Support Vector and Extreme Gradient Boosting) via GridSearch
  • Fine tune did not yield much better result => Final model is Logistic (Best accuracy ~82%)
  • Use test dataset. Prediction accuracy 82%

Import python module

import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# for offline ploting
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
import plotly.express as px


from IPython.display import HTML

download data

os.chdir("C:/Users/Riyan Aditya/Desktop/ML_learning/Project10_clf_cc_default")
df = pd.read_excel('default of credit card clients.xls', skiprows = [0])

df.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

We have the following variables:

  • Default *1 = yes, 0 = no)
  • limit_bal = total credit (taiwan dollar)
  • Sex = gender (1 = M, 2 = F)
  • Education = (1 = graduate school, 2 = uni, 3 = high school, 4 = others)
  • Marriage = (1 = married, 2 = single, 3 = others)
  • age
  • Pay is repayment status (from september to april). The number indicate how many months the delay (eg: 3 means 3 months delayed)
  • Bill amount (from September to April)
  • Paid amount (from September to April)

Let's see the first person. Looks like he defaults

# lets see the first customer
df.loc[0]
ID                                1
LIMIT_BAL                     20000
SEX                               2
EDUCATION                         2
MARRIAGE                          1
AGE                              24
PAY_0                             2
PAY_2                             2
PAY_3                            -1
PAY_4                            -1
PAY_5                            -2
PAY_6                            -2
BILL_AMT1                      3913
BILL_AMT2                      3102
BILL_AMT3                       689
BILL_AMT4                         0
BILL_AMT5                         0
BILL_AMT6                         0
PAY_AMT1                          0
PAY_AMT2                        689
PAY_AMT3                          0
PAY_AMT4                          0
PAY_AMT5                          0
PAY_AMT6                          0
default payment next month        1
Name: 0, dtype: int64

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
ID                            30000 non-null int64
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6                     30000 non-null int64
PAY_AMT1                      30000 non-null int64
PAY_AMT2                      30000 non-null int64
PAY_AMT3                      30000 non-null int64
PAY_AMT4                      30000 non-null int64
PAY_AMT5                      30000 non-null int64
PAY_AMT6                      30000 non-null int64
default payment next month    30000 non-null int64
dtypes: int64(25)
memory usage: 5.7 MB

Great. No missing data

EDA

How many default

df['default payment next month'].value_counts()
0    23364
1     6636
Name: default payment next month, dtype: int64

6,600 people defaulted (22%). The proportion of default vs non-default is imbalanced. Consider stratified sampling later

Numerical and categorical data

# rename "default payment next month". Simply too long
df=df.rename(columns = {'default payment next month':'DEFAULT'})

df.columns
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'DEFAULT'],
      dtype='object')

Let's identify which might be numerical category, which might be categorical

df_num = df[['LIMIT_BAL','AGE',
               'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
               'PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']]
df_cat = df[['DEFAULT','SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']]

Numerical data EDA

# plot histogram for all numerical data
for i in df_num.columns:
    plt.hist(df_num[i])
    plt.title(i)
    plt.show()
  • The graph is very right skewed. They might need to be normalised.
  • The scale are also different (age in 10s while bill in millions). They might need to be standardised

Let's check correlation plot

sns.heatmap(df_num.corr())
<AxesSubplot:>

Bill amounts are highly correlated with each other. Perhaps this suggest they are accumulative (recent bills are the sum of all outstanding bills). Also let's assume recent bill and recent payment more significant than old (ie: bill 1 and bill 2 more important).

Some correlation between paid amounts and bill amounts => make sense

No correlation whatsoever in age.

How about default vs some variables

pd.pivot_table(df,index = 'DEFAULT', values = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','PAY_AMT1','PAY_AMT2','PAY_AMT3'])
AGE BILL_AMT1 BILL_AMT2 LIMIT_BAL PAY_AMT1 PAY_AMT2 PAY_AMT3
DEFAULT
0 35.417266 51994.227273 49717.435670 178099.726074 6307.337357 6640.465074 5753.496833
1 35.725738 48509.162297 47283.617842 130109.656420 3397.044153 3388.649638 3367.351567

Some insights:

  • Average age about the same" for default vs non default
  • Defaultees tend to have lower credit limit. Makes sense, maybe they have low credit rating
  • No significant recent bill amount differences between defaultees and non defaultees
  • Significantly less recent payment amount for defaultees. Perhaps indicating they struggle to make payment

Categorical data EDA

for i in df_cat.columns:
    sns.barplot(x=df_cat[i].value_counts().index,y=df_cat[i].value_counts()).set_title(i)
    plt.show()

Lets see in relative of whether they default or not

print(pd.pivot_table(df,index = 'DEFAULT',columns = 'SEX', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'EDUCATION', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'MARRIAGE', values = 'ID', aggfunc = 'count'))
print()
SEX         1      2
DEFAULT             
0        9015  14349
1        2873   3763

EDUCATION     0       1        2       3      4      5     6
DEFAULT                                                     
0          14.0  8549.0  10700.0  3680.0  116.0  262.0  43.0
1           NaN  2036.0   3330.0  1237.0    7.0   18.0   8.0

MARRIAGE   0      1      2    3
DEFAULT                        
0         49  10453  12623  239
1          5   3206   3341   84

Gender, education, marital status

print('male default ratio', round(2873/(2873+9015),2))
print('female default ratio',round(3763/(3763+14349),2))
male default ratio 0.24
female default ratio 0.21

print('graduate default ratio', round(2036/(2036+8549),2))
print('uni default ratio',round(3330/(3330+10700),2))
print('HS default ratio',round(1237/(1237+3680),2))
graduate default ratio 0.19
uni default ratio 0.24
HS default ratio 0.25

print('married default ratio', round(3206/(3206+10453),2))
print('single default ratio',round(3341/(3341+12623),2))
married default ratio 0.23
single default ratio 0.21

Insights:

  • More female than male
  • But male has higher default ratio (24%) than female (21%)
  • There are education category of 4, 5 and 6. Education attribute info only gives 1-4. Consider to group 0,4,5,6 into 4
  • Graduates have significantly less default ratio. Whereas no different between uni and high school
  • Married couple have slightly higher default ratio
  • Again, there are other categories except 1,2,3. Consideer to group 0 and 3 into 3

Recent late payment status

print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_0', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_2', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_3', values = 'ID', aggfunc = 'count'))
print()
PAY_0      -2    -1      0     1     2    3   4   5   6   7   8
DEFAULT                                                        
0        2394  4732  12849  2436   823   78  24  13   5   2   8
1         365   954   1888  1252  1844  244  52  13   6   7  11

PAY_2        -2      -1        0     1       2      3     4     5    6     7  \
DEFAULT                                                                        
0        3091.0  5084.0  13227.0  23.0  1743.0  125.0  49.0  10.0  3.0   8.0   
1         691.0   966.0   2503.0   5.0  2184.0  201.0  50.0  15.0  9.0  12.0   

PAY_2      8  
DEFAULT       
0        1.0  
1        NaN  

PAY_3      -2    -1      0   1     2    3   4   5   6   7   8
DEFAULT                                                      
0        3328  5012  13013   3  1850  102  32   9   9   5   1
1         757   926   2751   1  1969  138  44  12  14  22   2

Ok, so many possible categories here. I am going to assume:

  • negative values and 0 mean early or on time payment
  • Positive values mean late payment

Late payment insights:

  • Make sense, if recent payment status indicated payment delay for many months, you are more likely to bankrupt
  • But, although smaller proportion, how could people that pay their card on time (or early) be categorised as default? This is indicated as -1 and -2 value for default category 1

Data prep

Plan to do:

  • Train test split. Gonna use stratified sampling based on the proportion of default
  • Normalised and standardised numerical variable
  • Lump into "other" categories for the unknown category in education and marriage status
  • Possibly also reduce late payment categorical data into less features

Train test split via Stratified Sampling

df.columns
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'DEFAULT'],
      dtype='object')

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)

for train_index, test_index in split.split(df, df['DEFAULT']):
    strat_train_set = df.iloc[train_index]
    strat_test_set = df.iloc[test_index]
strat_train_set['DEFAULT'].value_counts()/len(strat_train_set)
0    0.778792
1    0.221208
Name: DEFAULT, dtype: float64
strat_test_set['DEFAULT'].value_counts()/len(strat_test_set)
0    0.778833
1    0.221167
Name: DEFAULT, dtype: float64
strat_train_set.shape, strat_test_set.shape
((24000, 25), (6000, 25))

Great. Looks like non default is 78% and default is 22% both in the train and test set

Discard attributes that i will not use

strat_train_set.columns
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'DEFAULT'],
      dtype='object')

For simplicity, I am going to only use the history of past payment, amount of bill payment and amount of previous payment from the latest 3 months (ie remove part 4, 5 and 6)

X_train = strat_train_set[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3', 
                           'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
y_train = strat_train_set.DEFAULT.copy()

X_test = strat_test_set[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3', 
                           'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
y_test = strat_test_set.DEFAULT.copy()

Apply standard scaler to normalise

Need to split to numerical and categorical attribute to prepare for pipeline,

# split to numerical and categorical to prepare for pipeline
X_train_num = X_train[['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
               'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
X_train_cat = X_train[['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3']]

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer

num_pipeline = Pipeline([('std_scaler',StandardScaler())])

X_train_num_tr = num_pipeline.fit_transform(X_train_num)

X_train_num_tr
array([[-0.05686623, -0.26455769,  1.50554693, ...,  0.58065737,
        -0.29033241, -0.29781997],
       [-0.13408117, -0.15580369, -0.69516453, ..., -0.34496923,
        -0.29033241, -0.29781997],
       [-1.21509034,  1.58426024, -0.55679958, ..., -0.34812752,
        -0.22708115, -0.23306877],
       ...,
       [-0.36572599, -1.24334365,  0.33595431, ..., -0.11912055,
        -0.19044381, -0.18270673],
       [ 1.48743258,  2.3455382 , -0.69516453, ..., -0.34825137,
        -0.24786865, -0.28377341],
       [ 1.02414294, -0.0470497 , -0.67821411, ..., -0.27288591,
        -0.27364316, -0.29781997]])

Pipeline for categorical variable

We will need to create custom transformer to change to simplify the categorical data

Clean other unknown cateogrical attributes

from sklearn.base import BaseEstimator, TransformerMixin

class CategoricalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, EDUCATION = True, MARRIAGE = True, PAY_0 = True,PAY_2 = True,PAY_3 = True):
        self.EDUCATION = EDUCATION
        self.MARRIAGE = MARRIAGE
        self.PAY_0 = PAY_0
        self.PAY_2 = PAY_2
        self.PAY_3 = PAY_3
    
    # return self, nothing else to do here
    def fit(self, X, y=None):
        return self
    
    # create custom transformers
    def transform(self, X, y=None):

        # clean education
        # education has value 0,1,2,3,4,5,6. Attributes documentation only mention 1,2,3,4. 
        # lump 0, 4, 5, 6 together to 4
        X.loc[X.EDUCATION >4 ,'EDUCATION'] = 4
        X.loc[X.EDUCATION <1 ,'EDUCATION'] = 4
        
        # clean marriage
        # marriage status has value 0,1,2,3. Attributes documentation only mention 1,2,3. 
        # lump 0, 3 together to 3
        X.loc[X.MARRIAGE <1 ,'MARRIAGE'] = 3
        
        # clean PAY_0, PAY_2 and PAY_3
        # lets create 4 categories. 0, 1, 2, 3
        # 0 means pay on time (which is the preprocessed 0 and negative)
        # 1 means 1 month late
        # 2 means 2 months late
        # 3 means 3 months late and later (the preprocessed 3 and above)
        X.loc[X.PAY_0 <1 ,'PAY_0'] = 0
        X.loc[X.PAY_0 >4 ,'PAY_0'] = 4
        X.loc[X.PAY_2 <1 ,'PAY_2'] = 0
        X.loc[X.PAY_2 >4 ,'PAY_2'] = 4
        X.loc[X.PAY_3 <1 ,'PAY_3'] = 0
        X.loc[X.PAY_3 >4 ,'PAY_3'] = 4

        
        return X
        

pd.options.mode.chained_assignment = None  # default='warn'

cat_custom_prep = CategoricalTransformer()
X_train_cat_prep = cat_custom_prep.transform(X_train_cat)

pd.options.mode.chained_assignment = 'warn'  # default='warn'

Ok, data cleaning of those attribute was successful

Apply OHE for categorical variables

X_train_cat_prep.head()
SEX EDUCATION MARRIAGE PAY_0 PAY_2 PAY_3
22788 2 2 2 2 2 3
29006 2 1 2 1 0 0
16950 1 2 1 1 2 0
22280 2 1 2 0 0 0
11346 2 1 2 1 0 0

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
X_train_cat_prep_1hot = cat_encoder.fit_transform(X_train_cat_prep)
X_train_cat_prep_1hot
<24000x24 sparse matrix of type '<class 'numpy.float64'>'
	with 144000 stored elements in Compressed Sparse Row format>

Create full pipeline for fit transform

from sklearn.compose import ColumnTransformer

num_attribs = ['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
               'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']
cat_attribs = ['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3']

num_pipeline = Pipeline([('std_scaler',StandardScaler())])
cat_pipeline = Pipeline([ ('cat_custom',CategoricalTransformer()),
                        ('OHE', OneHotEncoder())])

full_pipeline = ColumnTransformer([
    ("num",num_pipeline, num_attribs),
    ("cat",cat_pipeline, cat_attribs)])

X_train_prepared = full_pipeline.fit_transform(X_train)

Model selection

Since we are predicting default or not (ie 0 and 1) and have labelled data, this is a classification problem.

Plan is to do a 5 fold cross validation, using the following model as baseline and tune the best one:

  • Naive bayes
  • Logistic regression
  • Decision tree
  • K nearest neighbour
  • Random Forest
  • SVC
  • XGB

# import various sklearn ML library
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

Naive Bayes:

gnb = GaussianNB()
cv_gnb = cross_val_score(gnb,X_train_prepared,y_train,cv=5)
print(cv_gnb )
print(cv_gnb.mean().round(3))
[0.80166667 0.763125   0.80541667 0.80583333 0.80666667]
0.797

Logistic

lr = LogisticRegression(max_iter = 2000)
cv_lr = cross_val_score(lr,X_train_prepared,y_train,cv=5)
print(cv_lr )
print(cv_lr.mean().round(3))
[0.820625   0.826875   0.82041667 0.818125   0.81854167]
0.821

Decision tree

dt = tree.DecisionTreeClassifier(random_state = 42)
cv_dt = cross_val_score(dt,X_train_prepared,y_train,cv=5)
print(cv_dt )
print(cv_dt.mean().round(3))
[0.71229167 0.72041667 0.724375   0.71958333 0.72333333]
0.72

KNN

knn = KNeighborsClassifier()
cv_knn = cross_val_score(knn,X_train_prepared,y_train,cv=5)
print(cv_knn )
print(cv_knn.mean().round(3))
[0.7925     0.789375   0.79458333 0.79395833 0.78333333]
0.791

Random forest

rf = RandomForestClassifier(random_state = 42)
cv_rf = cross_val_score(rf,X_train_prepared,y_train,cv=5)
print(cv_rf )
print(cv_rf.mean().round(3))
[0.8125     0.814375   0.81229167 0.81333333 0.81333333]
0.813

Support vector classifier

svc = SVC()
cv_svc = cross_val_score(svc,X_train_prepared,y_train,cv=5)
print(cv_svc )
print(cv_svc.mean().round(3))
[0.81916667 0.8225     0.81916667 0.81958333 0.81458333]
0.819

XGB

xgb = XGBClassifier(random_state =1)
cv_xgb = cross_val_score(xgb,X_train_prepared,y_train,cv=5)
print(cv_xgb )
print(cv_xgb.mean().round(3))
[0.81125    0.81833333 0.81145833 0.816875   0.81375   ]
0.814

Summary

data_matrix = [["Model","cv score"],
         ['Naive Bayes',cv_gnb.mean().round(3)],
         ['Logistic regression',cv_lr.mean().round(3)],
         ['Decision Tree',cv_dt.mean().round(3)],
         ['K Nearest neighbour',cv_knn.mean().round(3)],
         ['Random Forest',cv_rf.mean().round(3)],
         ['Support vector classifier',cv_svc.mean().round(3)],
         ['Extreme Gradient boosting',cv_xgb.mean().round(3)]]
data_matrix
[['Model', 'cv score'],
 ['Naive Bayes', 0.797],
 ['Logistic regression', 0.821],
 ['Decision Tree', 0.72],
 ['K Nearest neighbour', 0.791],
 ['Random Forest', 0.813],
 ['Support vector classifier', 0.819],
 ['Extreme Gradient boosting', 0.814]]

# plot accuracy
plt.figure(figsize=(8, 4))
plt.plot([1]*5, cv_gnb, ".")
plt.plot([2]*5, cv_lr, ".")
plt.plot([3]*5, cv_dt, ".")
plt.plot([4]*5, cv_knn, ".")
plt.plot([5]*5, cv_rf, ".")
plt.plot([6]*5, cv_svc, ".")
plt.plot([7]*5, cv_xgb, ".")
plt.boxplot([cv_gnb,cv_lr,cv_dt,cv_knn,cv_rf,cv_svc,cv_xgb], 
            labels=("Naive Bayes","Logistic","Decision Tree","KNN","RF","SVC","XGB"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

Seems like Logistic regression and SVC is best

Fine tune model

Lets try to fine tune Logistic Regression, RF, SVC and XGB

from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV

# Create a simple performance reporting function
def clf_performance(classifier,model_name):
    print(model_name)
    print('Best_score: '+str(classifier.best_score_))
    print('Best_parameters: '+str(classifier.best_params_))

lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['liblinear']}

clf_lr = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr = clf_lr.fit(X_train_prepared,y_train)
clf_performance(best_clf_lr,'Logistic Regression')
Fitting 5 folds for each of 40 candidates, totalling 200 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   23.6s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   24.8s finished
Logistic Regression
Best_score: 0.8210416666666667
Best_parameters: {'C': 0.23357214690901212, 'max_iter': 2000, 'penalty': 'l1', 'solver': 'liblinear'}

No visible improvement for Logistic regression

rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [100,300,500],
               'criterion':['gini','entropy'],
                                  'max_depth': [15, 20],
                                  'max_features': ['auto','sqrt', 10]}
                                  
clf_rf = GridSearchCV(rf, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_rf = clf_rf.fit(X_train_prepared,y_train)
clf_performance(best_clf_rf,'Random Forest')
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 12.6min finished
Random Forest
Best_score: 0.81975
Best_parameters: {'criterion': 'gini', 'max_depth': 15, 'max_features': 'auto', 'n_estimators': 300}

Slight improvement for Random Forest

# svc = SVC()
# param_grid = tuned_parameters = [{'kernel': ['rbf'],'C': [.1, 1, 10]},
#                                  {'kernel': ['linear'], 'C': [.1, 1, 10]},
#                                  {'kernel': ['poly'], 'degree' : [3,5], 'C': [.1, 1, 10]}]
# clf_svc = GridSearchCV(svc, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
# best_clf_svc = clf_svc.fit(X_train_prepared,y_train)
# clf_performance(best_clf_svc,'SVC')

Lesson learnt. SVM is not suitable for large training sets with a large number of features

And no improvement either

xgb = XGBClassifier(random_state = 1)

param_grid = {
    'n_estimators': [200,300,500],
    'max_depth': [None],
    'subsample': [0.3,0.5,0.8],
    'learning_rate':[0.5],
    'sampling_method': ['uniform']
}

clf_xgb = GridSearchCV(xgb, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_xgb = clf_xgb.fit(X_train_prepared,y_train)
clf_performance(best_clf_xgb,'XGB')
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
C:\Users\Riyan Aditya\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  3.4min finished
XGB
Best_score: 0.7889583333333333
Best_parameters: {'learning_rate': 0.5, 'max_depth': None, 'n_estimators': 200, 'sampling_method': 'uniform', 'subsample': 0.8}

Not much improvement with XGB either

Looks like grid search didnt really help. Logistic regression was best anyway. Kind of make sense because we are predicting 0 and 1

Focus on LR

LR fastest to grid search. can we improve it?

lr = LogisticRegression()
param_grid = {'max_iter' : [100,1000,2000],
              'penalty' : ['l1', 'l2'],
              'C' : [0.1,0.3,0.5,0.7,0.9],
              'solver' : ['liblinear']}

clf_lr_v2 = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr_v2 = clf_lr.fit(X_train_prepared,y_train)
clf_performance(best_clf_lr_v2,'Logistic Regression')
Fitting 5 folds for each of 40 candidates, totalling 200 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    1.4s
Logistic Regression
Best_score: 0.8210416666666667
Best_parameters: {'C': 0.23357214690901212, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'liblinear'}
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   30.6s finished

Not much improvement

Save best model

import pickle

def pickle_model(model, filename):
    with open(filename, 'wb') as file:  
        pickle.dump(model, file)

#pickle_model(best_clf_lr_v2, 'best_clf_lr')
# pickle_model(best_clf_rf, 'best_clf_rf')
# pickle_model(best_clf_svc, 'best_clf_svc')
# pickle_model(best_clf_xgb, 'best_clf_xgb')

Test model

Let's use the test set with our logistic regression v2 model

y_test.head()
6907     0
24575    0
26766    0
2156     1
3179     0
Name: DEFAULT, dtype: int64

X_test_prepared = full_pipeline.transform(X_test)

final_model = best_clf_lr_v2.best_estimator_
y_test_pred = final_model.predict(X_test_prepared)
y_test_pred_wproba = final_model.predict_proba(X_test_prepared)
y_test[0:5]
6907     0
24575    0
26766    0
2156     1
3179     0
Name: DEFAULT, dtype: int64
y_test_pred[0:5]
array([0, 0, 0, 0, 0], dtype=int64)
y_test_pred_wproba[0:5]
array([[0.85660147, 0.14339853],
       [0.8327953 , 0.1672047 ],
       [0.83491694, 0.16508306],
       [0.90460951, 0.09539049],
       [0.93294588, 0.06705412]])

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

accuracy_score(y_test, y_test_pred)
0.8168333333333333

Our test set was correctly predicted 81%. Similar to the validation result

confusion_matrix(y_test, y_test_pred)
array([[4428,  245],
       [ 854,  473]], dtype=int64)