Data is downloaded from UCI ML database: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

The dataset is about whether a creditor default in Taiwan. I believe the data is around October 2005

Interesting insights from the dataset:

No obvious correlation between age and probability of default
Defaultees tend to have lower credit limit. Makes sense, maybe they have low credit rating
No significant recent bill amount differences between defaultees and non defaultees
Significantly less recent payment amount for defaultees. Perhaps indicating they struggle to make payment
Male has higher default ratio (24%) than female (21%)
Graduates have significantly less default ratio. Whereas no different between uni and high school
Married couple have slightly higher default ratio

Summary of notebook workflow:

Download and import data from UCI ML repo
EDA and insights
Train test split via Stratified sampling since number of non default is much larger than default
Create numerical and categorical pipeline for feature selection (via standard scaler, clean up datasets and One hot encoding)
Train 7 base models
Fine tune 4 models (Logistic regression, Random Forest, Support Vector and Extreme Gradient Boosting) via GridSearch
Fine tune did not yield much better result => Final model is Logistic (Best accuracy ~82%)
Use test dataset. Prediction accuracy 82%

Import python module

import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# for offline ploting
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
import plotly.express as px


from IPython.display import HTML

download data

os.chdir("C:/Users/Riyan Aditya/Desktop/ML_learning/Project10_clf_cc_default")
df = pd.read_excel('default of credit card clients.xls', skiprows = [0])

df.head()

We have the following variables:

Default *1 = yes, 0 = no)
limit_bal = total credit (taiwan dollar)
Sex = gender (1 = M, 2 = F)
Education = (1 = graduate school, 2 = uni, 3 = high school, 4 = others)
Marriage = (1 = married, 2 = single, 3 = others)
age
Pay is repayment status (from september to april). The number indicate how many months the delay (eg: 3 means 3 months delayed)
Bill amount (from September to April)
Paid amount (from September to April)

Let's see the first person. Looks like he defaults

# lets see the first customer
df.loc[0]

ID                                1
LIMIT_BAL                     20000
SEX                               2
EDUCATION                         2
MARRIAGE                          1
AGE                              24
PAY_0                             2
PAY_2                             2
PAY_3                            -1
PAY_4                            -1
PAY_5                            -2
PAY_6                            -2
BILL_AMT1                      3913
BILL_AMT2                      3102
BILL_AMT3                       689
BILL_AMT4                         0
BILL_AMT5                         0
BILL_AMT6                         0
PAY_AMT1                          0
PAY_AMT2                        689
PAY_AMT3                          0
PAY_AMT4                          0
PAY_AMT5                          0
PAY_AMT6                          0
default payment next month        1
Name: 0, dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
ID                            30000 non-null int64
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6                     30000 non-null int64
PAY_AMT1                      30000 non-null int64
PAY_AMT2                      30000 non-null int64
PAY_AMT3                      30000 non-null int64
PAY_AMT4                      30000 non-null int64
PAY_AMT5                      30000 non-null int64
PAY_AMT6                      30000 non-null int64
default payment next month    30000 non-null int64
dtypes: int64(25)
memory usage: 5.7 MB

Great. No missing data

EDA

How many default

df['default payment next month'].value_counts()

0    23364
1     6636
Name: default payment next month, dtype: int64

6,600 people defaulted (22%). The proportion of default vs non-default is imbalanced. Consider stratified sampling later

Numerical and categorical data

# rename "default payment next month". Simply too long
df=df.rename(columns = {'default payment next month':'DEFAULT'})

df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'DEFAULT'],
      dtype='object')

Let's identify which might be numerical category, which might be categorical

df_num = df[['LIMIT_BAL','AGE',
               'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
               'PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']]
df_cat = df[['DEFAULT','SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']]

Numerical data EDA

# plot histogram for all numerical data
for i in df_num.columns:
    plt.hist(df_num[i])
    plt.title(i)
    plt.show()

The graph is very right skewed. They might need to be normalised.
The scale are also different (age in 10s while bill in millions). They might need to be standardised

Let's check correlation plot

sns.heatmap(df_num.corr())

<AxesSubplot:>

Bill amounts are highly correlated with each other. Perhaps this suggest they are accumulative (recent bills are the sum of all outstanding bills). Also let's assume recent bill and recent payment more significant than old (ie: bill 1 and bill 2 more important).

Some correlation between paid amounts and bill amounts => make sense

No correlation whatsoever in age.

How about default vs some variables

pd.pivot_table(df,index = 'DEFAULT', values = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','PAY_AMT1','PAY_AMT2','PAY_AMT3'])

Some insights:

Average age about the same" for default vs non default
Defaultees tend to have lower credit limit. Makes sense, maybe they have low credit rating
No significant recent bill amount differences between defaultees and non defaultees
Significantly less recent payment amount for defaultees. Perhaps indicating they struggle to make payment

Categorical data EDA

for i in df_cat.columns:
    sns.barplot(x=df_cat[i].value_counts().index,y=df_cat[i].value_counts()).set_title(i)
    plt.show()

Lets see in relative of whether they default or not

print(pd.pivot_table(df,index = 'DEFAULT',columns = 'SEX', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'EDUCATION', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'MARRIAGE', values = 'ID', aggfunc = 'count'))
print()

SEX         1      2
DEFAULT             
0        9015  14349
1        2873   3763

EDUCATION     0       1        2       3      4      5     6
DEFAULT                                                     
0          14.0  8549.0  10700.0  3680.0  116.0  262.0  43.0
1           NaN  2036.0   3330.0  1237.0    7.0   18.0   8.0

MARRIAGE   0      1      2    3
DEFAULT                        
0         49  10453  12623  239
1          5   3206   3341   84

Gender, education, marital status

print('male default ratio', round(2873/(2873+9015),2))
print('female default ratio',round(3763/(3763+14349),2))

male default ratio 0.24
female default ratio 0.21

print('graduate default ratio', round(2036/(2036+8549),2))
print('uni default ratio',round(3330/(3330+10700),2))
print('HS default ratio',round(1237/(1237+3680),2))

graduate default ratio 0.19
uni default ratio 0.24
HS default ratio 0.25

print('married default ratio', round(3206/(3206+10453),2))
print('single default ratio',round(3341/(3341+12623),2))

married default ratio 0.23
single default ratio 0.21

Insights:

More female than male
But male has higher default ratio (24%) than female (21%)
There are education category of 4, 5 and 6. Education attribute info only gives 1-4. Consider to group 0,4,5,6 into 4
Graduates have significantly less default ratio. Whereas no different between uni and high school
Married couple have slightly higher default ratio
Again, there are other categories except 1,2,3. Consideer to group 0 and 3 into 3

Recent late payment status

print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_0', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_2', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_3', values = 'ID', aggfunc = 'count'))
print()

PAY_0      -2    -1      0     1     2    3   4   5   6   7   8
DEFAULT                                                        
0        2394  4732  12849  2436   823   78  24  13   5   2   8
1         365   954   1888  1252  1844  244  52  13   6   7  11

PAY_2        -2      -1        0     1       2      3     4     5    6     7  \
DEFAULT                                                                        
0        3091.0  5084.0  13227.0  23.0  1743.0  125.0  49.0  10.0  3.0   8.0   
1         691.0   966.0   2503.0   5.0  2184.0  201.0  50.0  15.0  9.0  12.0   

PAY_2      8  
DEFAULT       
0        1.0  
1        NaN  

PAY_3      -2    -1      0   1     2    3   4   5   6   7   8
DEFAULT                                                      
0        3328  5012  13013   3  1850  102  32   9   9   5   1
1         757   926   2751   1  1969  138  44  12  14  22   2

Ok, so many possible categories here. I am going to assume:

negative values and 0 mean early or on time payment
Positive values mean late payment

Late payment insights:

Make sense, if recent payment status indicated payment delay for many months, you are more likely to bankrupt
But, although smaller proportion, how could people that pay their card on time (or early) be categorised as default? This is indicated as -1 and -2 value for default category 1

Data prep

Plan to do:

Train test split. Gonna use stratified sampling based on the proportion of default
Normalised and standardised numerical variable
Lump into "other" categories for the unknown category in education and marriage status
Possibly also reduce late payment categorical data into less features

Train test split via Stratified Sampling

df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'DEFAULT'],
      dtype='object')

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)

for train_index, test_index in split.split(df, df['DEFAULT']):
    strat_train_set = df.iloc[train_index]
    strat_test_set = df.iloc[test_index]

strat_train_set['DEFAULT'].value_counts()/len(strat_train_set)

0    0.778792
1    0.221208
Name: DEFAULT, dtype: float64

strat_test_set['DEFAULT'].value_counts()/len(strat_test_set)

0    0.778833
1    0.221167
Name: DEFAULT, dtype: float64

strat_train_set.shape, strat_test_set.shape

((24000, 25), (6000, 25))

Great. Looks like non default is 78% and default is 22% both in the train and test set

Discard attributes that i will not use

strat_train_set.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'DEFAULT'],
      dtype='object')

For simplicity, I am going to only use the history of past payment, amount of bill payment and amount of previous payment from the latest 3 months (ie remove part 4, 5 and 6)

X_train = strat_train_set[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3', 
                           'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
y_train = strat_train_set.DEFAULT.copy()

X_test = strat_test_set[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3', 
                           'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
y_test = strat_test_set.DEFAULT.copy()

Apply standard scaler to normalise

Need to split to numerical and categorical attribute to prepare for pipeline,

# split to numerical and categorical to prepare for pipeline
X_train_num = X_train[['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
               'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
X_train_cat = X_train[['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3']]

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer

num_pipeline = Pipeline([('std_scaler',StandardScaler())])

X_train_num_tr = num_pipeline.fit_transform(X_train_num)

X_train_num_tr

array([[-0.05686623, -0.26455769,  1.50554693, ...,  0.58065737,
        -0.29033241, -0.29781997],
       [-0.13408117, -0.15580369, -0.69516453, ..., -0.34496923,
        -0.29033241, -0.29781997],
       [-1.21509034,  1.58426024, -0.55679958, ..., -0.34812752,
        -0.22708115, -0.23306877],
       ...,
       [-0.36572599, -1.24334365,  0.33595431, ..., -0.11912055,
        -0.19044381, -0.18270673],
       [ 1.48743258,  2.3455382 , -0.69516453, ..., -0.34825137,
        -0.24786865, -0.28377341],
       [ 1.02414294, -0.0470497 , -0.67821411, ..., -0.27288591,
        -0.27364316, -0.29781997]])

Pipeline for categorical variable

We will need to create custom transformer to change to simplify the categorical data

Clean other unknown cateogrical attributes

from sklearn.base import BaseEstimator, TransformerMixin

class CategoricalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, EDUCATION = True, MARRIAGE = True, PAY_0 = True,PAY_2 = True,PAY_3 = True):
        self.EDUCATION = EDUCATION
        self.MARRIAGE = MARRIAGE
        self.PAY_0 = PAY_0
        self.PAY_2 = PAY_2
        self.PAY_3 = PAY_3
    
    # return self, nothing else to do here
    def fit(self, X, y=None):
        return self
    
    # create custom transformers
    def transform(self, X, y=None):

        # clean education
        # education has value 0,1,2,3,4,5,6. Attributes documentation only mention 1,2,3,4. 
        # lump 0, 4, 5, 6 together to 4
        X.loc[X.EDUCATION >4 ,'EDUCATION'] = 4
        X.loc[X.EDUCATION <1 ,'EDUCATION'] = 4
        
        # clean marriage
        # marriage status has value 0,1,2,3. Attributes documentation only mention 1,2,3. 
        # lump 0, 3 together to 3
        X.loc[X.MARRIAGE <1 ,'MARRIAGE'] = 3
        
        # clean PAY_0, PAY_2 and PAY_3
        # lets create 4 categories. 0, 1, 2, 3
        # 0 means pay on time (which is the preprocessed 0 and negative)
        # 1 means 1 month late
        # 2 means 2 months late
        # 3 means 3 months late and later (the preprocessed 3 and above)
        X.loc[X.PAY_0 <1 ,'PAY_0'] = 0
        X.loc[X.PAY_0 >4 ,'PAY_0'] = 4
        X.loc[X.PAY_2 <1 ,'PAY_2'] = 0
        X.loc[X.PAY_2 >4 ,'PAY_2'] = 4
        X.loc[X.PAY_3 <1 ,'PAY_3'] = 0
        X.loc[X.PAY_3 >4 ,'PAY_3'] = 4

        
        return X

pd.options.mode.chained_assignment = None  # default='warn'

cat_custom_prep = CategoricalTransformer()
X_train_cat_prep = cat_custom_prep.transform(X_train_cat)

pd.options.mode.chained_assignment = 'warn'  # default='warn'

Ok, data cleaning of those attribute was successful

Apply OHE for categorical variables

X_train_cat_prep.head()

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
X_train_cat_prep_1hot = cat_encoder.fit_transform(X_train_cat_prep)
X_train_cat_prep_1hot

<24000x24 sparse matrix of type '<class 'numpy.float64'>'
	with 144000 stored elements in Compressed Sparse Row format>

Create full pipeline for fit transform

from sklearn.compose import ColumnTransformer

num_attribs = ['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
               'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']
cat_attribs = ['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3']

num_pipeline = Pipeline([('std_scaler',StandardScaler())])
cat_pipeline = Pipeline([ ('cat_custom',CategoricalTransformer()),
                        ('OHE', OneHotEncoder())])

full_pipeline = ColumnTransformer([
    ("num",num_pipeline, num_attribs),
    ("cat",cat_pipeline, cat_attribs)])

X_train_prepared = full_pipeline.fit_transform(X_train)

Model selection

Since we are predicting default or not (ie 0 and 1) and have labelled data, this is a classification problem.

Plan is to do a 5 fold cross validation, using the following model as baseline and tune the best one:

Naive bayes
Logistic regression
Decision tree
K nearest neighbour
Random Forest
SVC
XGB

# import various sklearn ML library
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

Naive Bayes:

gnb = GaussianNB()
cv_gnb = cross_val_score(gnb,X_train_prepared,y_train,cv=5)
print(cv_gnb )
print(cv_gnb.mean().round(3))

[0.80166667 0.763125   0.80541667 0.80583333 0.80666667]
0.797

Logistic

lr = LogisticRegression(max_iter = 2000)
cv_lr = cross_val_score(lr,X_train_prepared,y_train,cv=5)
print(cv_lr )
print(cv_lr.mean().round(3))

[0.820625   0.826875   0.82041667 0.818125   0.81854167]
0.821

Decision tree

dt = tree.DecisionTreeClassifier(random_state = 42)
cv_dt = cross_val_score(dt,X_train_prepared,y_train,cv=5)
print(cv_dt )
print(cv_dt.mean().round(3))

[0.71229167 0.72041667 0.724375   0.71958333 0.72333333]
0.72

KNN

knn = KNeighborsClassifier()
cv_knn = cross_val_score(knn,X_train_prepared,y_train,cv=5)
print(cv_knn )
print(cv_knn.mean().round(3))

[0.7925     0.789375   0.79458333 0.79395833 0.78333333]
0.791

Random forest

rf = RandomForestClassifier(random_state = 42)
cv_rf = cross_val_score(rf,X_train_prepared,y_train,cv=5)
print(cv_rf )
print(cv_rf.mean().round(3))

[0.8125     0.814375   0.81229167 0.81333333 0.81333333]
0.813

Support vector classifier

svc = SVC()
cv_svc = cross_val_score(svc,X_train_prepared,y_train,cv=5)
print(cv_svc )
print(cv_svc.mean().round(3))

[0.81916667 0.8225     0.81916667 0.81958333 0.81458333]
0.819

XGB

xgb = XGBClassifier(random_state =1)
cv_xgb = cross_val_score(xgb,X_train_prepared,y_train,cv=5)
print(cv_xgb )
print(cv_xgb.mean().round(3))

[0.81125    0.81833333 0.81145833 0.816875   0.81375   ]
0.814

Summary

data_matrix = [["Model","cv score"],
         ['Naive Bayes',cv_gnb.mean().round(3)],
         ['Logistic regression',cv_lr.mean().round(3)],
         ['Decision Tree',cv_dt.mean().round(3)],
         ['K Nearest neighbour',cv_knn.mean().round(3)],
         ['Random Forest',cv_rf.mean().round(3)],
         ['Support vector classifier',cv_svc.mean().round(3)],
         ['Extreme Gradient boosting',cv_xgb.mean().round(3)]]
data_matrix

[['Model', 'cv score'],
 ['Naive Bayes', 0.797],
 ['Logistic regression', 0.821],
 ['Decision Tree', 0.72],
 ['K Nearest neighbour', 0.791],
 ['Random Forest', 0.813],
 ['Support vector classifier', 0.819],
 ['Extreme Gradient boosting', 0.814]]

# plot accuracy
plt.figure(figsize=(8, 4))
plt.plot([1]*5, cv_gnb, ".")
plt.plot([2]*5, cv_lr, ".")
plt.plot([3]*5, cv_dt, ".")
plt.plot([4]*5, cv_knn, ".")
plt.plot([5]*5, cv_rf, ".")
plt.plot([6]*5, cv_svc, ".")
plt.plot([7]*5, cv_xgb, ".")
plt.boxplot([cv_gnb,cv_lr,cv_dt,cv_knn,cv_rf,cv_svc,cv_xgb], 
            labels=("Naive Bayes","Logistic","Decision Tree","KNN","RF","SVC","XGB"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

Seems like Logistic regression and SVC is best

Fine tune model

Lets try to fine tune Logistic Regression, RF, SVC and XGB

from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV

# Create a simple performance reporting function
def clf_performance(classifier,model_name):
    print(model_name)
    print('Best_score: '+str(classifier.best_score_))
    print('Best_parameters: '+str(classifier.best_params_))

lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['liblinear']}

clf_lr = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr = clf_lr.fit(X_train_prepared,y_train)
clf_performance(best_clf_lr,'Logistic Regression')

Fitting 5 folds for each of 40 candidates, totalling 200 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   23.6s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   24.8s finished

Logistic Regression
Best_score: 0.8210416666666667
Best_parameters: {'C': 0.23357214690901212, 'max_iter': 2000, 'penalty': 'l1', 'solver': 'liblinear'}

No visible improvement for Logistic regression

rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [100,300,500],
               'criterion':['gini','entropy'],
                                  'max_depth': [15, 20],
                                  'max_features': ['auto','sqrt', 10]}
                                  
clf_rf = GridSearchCV(rf, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_rf = clf_rf.fit(X_train_prepared,y_train)
clf_performance(best_clf_rf,'Random Forest')

Fitting 5 folds for each of 36 candidates, totalling 180 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 12.6min finished

Random Forest
Best_score: 0.81975
Best_parameters: {'criterion': 'gini', 'max_depth': 15, 'max_features': 'auto', 'n_estimators': 300}

Slight improvement for Random Forest

# svc = SVC()
# param_grid = tuned_parameters = [{'kernel': ['rbf'],'C': [.1, 1, 10]},
#                                  {'kernel': ['linear'], 'C': [.1, 1, 10]},
#                                  {'kernel': ['poly'], 'degree' : [3,5], 'C': [.1, 1, 10]}]
# clf_svc = GridSearchCV(svc, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
# best_clf_svc = clf_svc.fit(X_train_prepared,y_train)
# clf_performance(best_clf_svc,'SVC')

Lesson learnt. SVM is not suitable for large training sets with a large number of features

And no improvement either

xgb = XGBClassifier(random_state = 1)

param_grid = {
    'n_estimators': [200,300,500],
    'max_depth': [None],
    'subsample': [0.3,0.5,0.8],
    'learning_rate':[0.5],
    'sampling_method': ['uniform']
}

clf_xgb = GridSearchCV(xgb, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_xgb = clf_xgb.fit(X_train_prepared,y_train)
clf_performance(best_clf_xgb,'XGB')

Fitting 5 folds for each of 9 candidates, totalling 45 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
C:\Users\Riyan Aditya\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  3.4min finished

XGB
Best_score: 0.7889583333333333
Best_parameters: {'learning_rate': 0.5, 'max_depth': None, 'n_estimators': 200, 'sampling_method': 'uniform', 'subsample': 0.8}

Not much improvement with XGB either

Looks like grid search didnt really help. Logistic regression was best anyway. Kind of make sense because we are predicting 0 and 1

Focus on LR

LR fastest to grid search. can we improve it?

lr = LogisticRegression()
param_grid = {'max_iter' : [100,1000,2000],
              'penalty' : ['l1', 'l2'],
              'C' : [0.1,0.3,0.5,0.7,0.9],
              'solver' : ['liblinear']}

clf_lr_v2 = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr_v2 = clf_lr.fit(X_train_prepared,y_train)
clf_performance(best_clf_lr_v2,'Logistic Regression')

Fitting 5 folds for each of 40 candidates, totalling 200 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    1.4s

Logistic Regression
Best_score: 0.8210416666666667
Best_parameters: {'C': 0.23357214690901212, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'liblinear'}

[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   30.6s finished

Not much improvement

Save best model

import pickle

def pickle_model(model, filename):
    with open(filename, 'wb') as file:  
        pickle.dump(model, file)

#pickle_model(best_clf_lr_v2, 'best_clf_lr')
# pickle_model(best_clf_rf, 'best_clf_rf')
# pickle_model(best_clf_svc, 'best_clf_svc')
# pickle_model(best_clf_xgb, 'best_clf_xgb')

Test model

Let's use the test set with our logistic regression v2 model

y_test.head()

6907     0
24575    0
26766    0
2156     1
3179     0
Name: DEFAULT, dtype: int64

X_test_prepared = full_pipeline.transform(X_test)

final_model = best_clf_lr_v2.best_estimator_
y_test_pred = final_model.predict(X_test_prepared)
y_test_pred_wproba = final_model.predict_proba(X_test_prepared)

y_test[0:5]

6907     0
24575    0
26766    0
2156     1
3179     0
Name: DEFAULT, dtype: int64

y_test_pred[0:5]

array([0, 0, 0, 0, 0], dtype=int64)

y_test_pred_wproba[0:5]

array([[0.85660147, 0.14339853],
       [0.8327953 , 0.1672047 ],
       [0.83491694, 0.16508306],
       [0.90460951, 0.09539049],
       [0.93294588, 0.06705412]])

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

accuracy_score(y_test, y_test_pred)

0.8168333333333333

Our test set was correctly predicted 81%. Similar to the validation result

confusion_matrix(y_test, y_test_pred)

array([[4428,  245],
       [ 854,  473]], dtype=int64)

	AGE	BILL_AMT1	BILL_AMT2	LIMIT_BAL	PAY_AMT1	PAY_AMT2	PAY_AMT3
DEFAULT
0	35.417266	51994.227273	49717.435670	178099.726074	6307.337357	6640.465074	5753.496833
1	35.725738	48509.162297	47283.617842	130109.656420	3397.044153	3388.649638	3367.351567

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	1	20000	2	2	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
1	2	120000	2	2	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	2	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	2	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	2	1	57	-1	0	-1	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0