Classification problem - credit card default prediction
My attempt to predict whether a person default on their credit card debt
Data is downloaded from UCI ML database: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
The dataset is about whether a creditor default in Taiwan. I believe the data is around October 2005
Interesting insights from the dataset:
- No obvious correlation between age and probability of default
- Defaultees tend to have lower credit limit. Makes sense, maybe they have low credit rating
- No significant recent bill amount differences between defaultees and non defaultees
- Significantly less recent payment amount for defaultees. Perhaps indicating they struggle to make payment
- Male has higher default ratio (24%) than female (21%)
- Graduates have significantly less default ratio. Whereas no different between uni and high school
- Married couple have slightly higher default ratio
Summary of notebook workflow:
- Download and import data from UCI ML repo
- EDA and insights
- Train test split via Stratified sampling since number of non default is much larger than default
- Create numerical and categorical pipeline for feature selection (via standard scaler, clean up datasets and One hot encoding)
- Train 7 base models
- Fine tune 4 models (Logistic regression, Random Forest, Support Vector and Extreme Gradient Boosting) via GridSearch
- Fine tune did not yield much better result => Final model is Logistic (Best accuracy ~82%)
- Use test dataset. Prediction accuracy 82%
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# for offline ploting
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
import plotly.express as px
from IPython.display import HTML
os.chdir("C:/Users/Riyan Aditya/Desktop/ML_learning/Project10_clf_cc_default")
df = pd.read_excel('default of credit card clients.xls', skiprows = [0])
df.head()
We have the following variables:
- Default *1 = yes, 0 = no)
- limit_bal = total credit (taiwan dollar)
- Sex = gender (1 = M, 2 = F)
- Education = (1 = graduate school, 2 = uni, 3 = high school, 4 = others)
- Marriage = (1 = married, 2 = single, 3 = others)
- age
- Pay is repayment status (from september to april). The number indicate how many months the delay (eg: 3 means 3 months delayed)
- Bill amount (from September to April)
- Paid amount (from September to April)
Let's see the first person. Looks like he defaults
# lets see the first customer
df.loc[0]
df.info()
Great. No missing data
df['default payment next month'].value_counts()
6,600 people defaulted (22%). The proportion of default vs non-default is imbalanced. Consider stratified sampling later
# rename "default payment next month". Simply too long
df=df.rename(columns = {'default payment next month':'DEFAULT'})
df.columns
Let's identify which might be numerical category, which might be categorical
df_num = df[['LIMIT_BAL','AGE',
'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
'PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']]
df_cat = df[['DEFAULT','SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']]
# plot histogram for all numerical data
for i in df_num.columns:
plt.hist(df_num[i])
plt.title(i)
plt.show()
- The graph is very right skewed. They might need to be normalised.
- The scale are also different (age in 10s while bill in millions). They might need to be standardised
Let's check correlation plot
sns.heatmap(df_num.corr())
Bill amounts are highly correlated with each other. Perhaps this suggest they are accumulative (recent bills are the sum of all outstanding bills). Also let's assume recent bill and recent payment more significant than old (ie: bill 1 and bill 2 more important).
Some correlation between paid amounts and bill amounts => make sense
No correlation whatsoever in age.
How about default vs some variables
pd.pivot_table(df,index = 'DEFAULT', values = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','PAY_AMT1','PAY_AMT2','PAY_AMT3'])
Some insights:
- Average age about the same" for default vs non default
- Defaultees tend to have lower credit limit. Makes sense, maybe they have low credit rating
- No significant recent bill amount differences between defaultees and non defaultees
- Significantly less recent payment amount for defaultees. Perhaps indicating they struggle to make payment
for i in df_cat.columns:
sns.barplot(x=df_cat[i].value_counts().index,y=df_cat[i].value_counts()).set_title(i)
plt.show()
Lets see in relative of whether they default or not
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'SEX', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'EDUCATION', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'MARRIAGE', values = 'ID', aggfunc = 'count'))
print()
print('male default ratio', round(2873/(2873+9015),2))
print('female default ratio',round(3763/(3763+14349),2))
print('graduate default ratio', round(2036/(2036+8549),2))
print('uni default ratio',round(3330/(3330+10700),2))
print('HS default ratio',round(1237/(1237+3680),2))
print('married default ratio', round(3206/(3206+10453),2))
print('single default ratio',round(3341/(3341+12623),2))
Insights:
- More female than male
- But male has higher default ratio (24%) than female (21%)
- There are education category of 4, 5 and 6. Education attribute info only gives 1-4. Consider to group 0,4,5,6 into 4
- Graduates have significantly less default ratio. Whereas no different between uni and high school
- Married couple have slightly higher default ratio
- Again, there are other categories except 1,2,3. Consideer to group 0 and 3 into 3
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_0', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_2', values = 'ID', aggfunc = 'count'))
print()
print(pd.pivot_table(df,index = 'DEFAULT',columns = 'PAY_3', values = 'ID', aggfunc = 'count'))
print()
Ok, so many possible categories here. I am going to assume:
- negative values and 0 mean early or on time payment
- Positive values mean late payment
Late payment insights:
- Make sense, if recent payment status indicated payment delay for many months, you are more likely to bankrupt
- But, although smaller proportion, how could people that pay their card on time (or early) be categorised as default? This is indicated as -1 and -2 value for default category 1
Plan to do:
- Train test split. Gonna use stratified sampling based on the proportion of default
- Normalised and standardised numerical variable
- Lump into "other" categories for the unknown category in education and marriage status
- Possibly also reduce late payment categorical data into less features
df.columns
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index, test_index in split.split(df, df['DEFAULT']):
strat_train_set = df.iloc[train_index]
strat_test_set = df.iloc[test_index]
strat_train_set['DEFAULT'].value_counts()/len(strat_train_set)
strat_test_set['DEFAULT'].value_counts()/len(strat_test_set)
strat_train_set.shape, strat_test_set.shape
Great. Looks like non default is 78% and default is 22% both in the train and test set
strat_train_set.columns
For simplicity, I am going to only use the history of past payment, amount of bill payment and amount of previous payment from the latest 3 months (ie remove part 4, 5 and 6)
X_train = strat_train_set[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3',
'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
y_train = strat_train_set.DEFAULT.copy()
X_test = strat_test_set[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0','PAY_2', 'PAY_3',
'BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
y_test = strat_test_set.DEFAULT.copy()
Need to split to numerical and categorical attribute to prepare for pipeline,
# split to numerical and categorical to prepare for pipeline
X_train_num = X_train[['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']]
X_train_cat = X_train[['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3']]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer
num_pipeline = Pipeline([('std_scaler',StandardScaler())])
X_train_num_tr = num_pipeline.fit_transform(X_train_num)
X_train_num_tr
We will need to create custom transformer to change to simplify the categorical data
from sklearn.base import BaseEstimator, TransformerMixin
class CategoricalTransformer(BaseEstimator, TransformerMixin):
def __init__(self, EDUCATION = True, MARRIAGE = True, PAY_0 = True,PAY_2 = True,PAY_3 = True):
self.EDUCATION = EDUCATION
self.MARRIAGE = MARRIAGE
self.PAY_0 = PAY_0
self.PAY_2 = PAY_2
self.PAY_3 = PAY_3
# return self, nothing else to do here
def fit(self, X, y=None):
return self
# create custom transformers
def transform(self, X, y=None):
# clean education
# education has value 0,1,2,3,4,5,6. Attributes documentation only mention 1,2,3,4.
# lump 0, 4, 5, 6 together to 4
X.loc[X.EDUCATION >4 ,'EDUCATION'] = 4
X.loc[X.EDUCATION <1 ,'EDUCATION'] = 4
# clean marriage
# marriage status has value 0,1,2,3. Attributes documentation only mention 1,2,3.
# lump 0, 3 together to 3
X.loc[X.MARRIAGE <1 ,'MARRIAGE'] = 3
# clean PAY_0, PAY_2 and PAY_3
# lets create 4 categories. 0, 1, 2, 3
# 0 means pay on time (which is the preprocessed 0 and negative)
# 1 means 1 month late
# 2 means 2 months late
# 3 means 3 months late and later (the preprocessed 3 and above)
X.loc[X.PAY_0 <1 ,'PAY_0'] = 0
X.loc[X.PAY_0 >4 ,'PAY_0'] = 4
X.loc[X.PAY_2 <1 ,'PAY_2'] = 0
X.loc[X.PAY_2 >4 ,'PAY_2'] = 4
X.loc[X.PAY_3 <1 ,'PAY_3'] = 0
X.loc[X.PAY_3 >4 ,'PAY_3'] = 4
return X
pd.options.mode.chained_assignment = None # default='warn'
cat_custom_prep = CategoricalTransformer()
X_train_cat_prep = cat_custom_prep.transform(X_train_cat)
pd.options.mode.chained_assignment = 'warn' # default='warn'
Ok, data cleaning of those attribute was successful
X_train_cat_prep.head()
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
X_train_cat_prep_1hot = cat_encoder.fit_transform(X_train_cat_prep)
X_train_cat_prep_1hot
from sklearn.compose import ColumnTransformer
num_attribs = ['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2','BILL_AMT3',
'PAY_AMT1','PAY_AMT2', 'PAY_AMT3']
cat_attribs = ['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2', 'PAY_3']
num_pipeline = Pipeline([('std_scaler',StandardScaler())])
cat_pipeline = Pipeline([ ('cat_custom',CategoricalTransformer()),
('OHE', OneHotEncoder())])
full_pipeline = ColumnTransformer([
("num",num_pipeline, num_attribs),
("cat",cat_pipeline, cat_attribs)])
X_train_prepared = full_pipeline.fit_transform(X_train)
Since we are predicting default or not (ie 0 and 1) and have labelled data, this is a classification problem.
Plan is to do a 5 fold cross validation, using the following model as baseline and tune the best one:
- Naive bayes
- Logistic regression
- Decision tree
- K nearest neighbour
- Random Forest
- SVC
- XGB
# import various sklearn ML library
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
Naive Bayes:
gnb = GaussianNB()
cv_gnb = cross_val_score(gnb,X_train_prepared,y_train,cv=5)
print(cv_gnb )
print(cv_gnb.mean().round(3))
Logistic
lr = LogisticRegression(max_iter = 2000)
cv_lr = cross_val_score(lr,X_train_prepared,y_train,cv=5)
print(cv_lr )
print(cv_lr.mean().round(3))
Decision tree
dt = tree.DecisionTreeClassifier(random_state = 42)
cv_dt = cross_val_score(dt,X_train_prepared,y_train,cv=5)
print(cv_dt )
print(cv_dt.mean().round(3))
KNN
knn = KNeighborsClassifier()
cv_knn = cross_val_score(knn,X_train_prepared,y_train,cv=5)
print(cv_knn )
print(cv_knn.mean().round(3))
Random forest
rf = RandomForestClassifier(random_state = 42)
cv_rf = cross_val_score(rf,X_train_prepared,y_train,cv=5)
print(cv_rf )
print(cv_rf.mean().round(3))
Support vector classifier
svc = SVC()
cv_svc = cross_val_score(svc,X_train_prepared,y_train,cv=5)
print(cv_svc )
print(cv_svc.mean().round(3))
XGB
xgb = XGBClassifier(random_state =1)
cv_xgb = cross_val_score(xgb,X_train_prepared,y_train,cv=5)
print(cv_xgb )
print(cv_xgb.mean().round(3))
data_matrix = [["Model","cv score"],
['Naive Bayes',cv_gnb.mean().round(3)],
['Logistic regression',cv_lr.mean().round(3)],
['Decision Tree',cv_dt.mean().round(3)],
['K Nearest neighbour',cv_knn.mean().round(3)],
['Random Forest',cv_rf.mean().round(3)],
['Support vector classifier',cv_svc.mean().round(3)],
['Extreme Gradient boosting',cv_xgb.mean().round(3)]]
data_matrix
# plot accuracy
plt.figure(figsize=(8, 4))
plt.plot([1]*5, cv_gnb, ".")
plt.plot([2]*5, cv_lr, ".")
plt.plot([3]*5, cv_dt, ".")
plt.plot([4]*5, cv_knn, ".")
plt.plot([5]*5, cv_rf, ".")
plt.plot([6]*5, cv_svc, ".")
plt.plot([7]*5, cv_xgb, ".")
plt.boxplot([cv_gnb,cv_lr,cv_dt,cv_knn,cv_rf,cv_svc,cv_xgb],
labels=("Naive Bayes","Logistic","Decision Tree","KNN","RF","SVC","XGB"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()
Seems like Logistic regression and SVC is best
Lets try to fine tune Logistic Regression, RF, SVC and XGB
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Create a simple performance reporting function
def clf_performance(classifier,model_name):
print(model_name)
print('Best_score: '+str(classifier.best_score_))
print('Best_parameters: '+str(classifier.best_params_))
lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
'penalty' : ['l1', 'l2'],
'C' : np.logspace(-4, 4, 20),
'solver' : ['liblinear']}
clf_lr = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr = clf_lr.fit(X_train_prepared,y_train)
clf_performance(best_clf_lr,'Logistic Regression')
No visible improvement for Logistic regression
rf = RandomForestClassifier(random_state = 1)
param_grid = {'n_estimators': [100,300,500],
'criterion':['gini','entropy'],
'max_depth': [15, 20],
'max_features': ['auto','sqrt', 10]}
clf_rf = GridSearchCV(rf, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_rf = clf_rf.fit(X_train_prepared,y_train)
clf_performance(best_clf_rf,'Random Forest')
Slight improvement for Random Forest
# svc = SVC()
# param_grid = tuned_parameters = [{'kernel': ['rbf'],'C': [.1, 1, 10]},
# {'kernel': ['linear'], 'C': [.1, 1, 10]},
# {'kernel': ['poly'], 'degree' : [3,5], 'C': [.1, 1, 10]}]
# clf_svc = GridSearchCV(svc, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
# best_clf_svc = clf_svc.fit(X_train_prepared,y_train)
# clf_performance(best_clf_svc,'SVC')
Lesson learnt. SVM is not suitable for large training sets with a large number of features
And no improvement either
xgb = XGBClassifier(random_state = 1)
param_grid = {
'n_estimators': [200,300,500],
'max_depth': [None],
'subsample': [0.3,0.5,0.8],
'learning_rate':[0.5],
'sampling_method': ['uniform']
}
clf_xgb = GridSearchCV(xgb, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_xgb = clf_xgb.fit(X_train_prepared,y_train)
clf_performance(best_clf_xgb,'XGB')
Not much improvement with XGB either
Looks like grid search didnt really help. Logistic regression was best anyway. Kind of make sense because we are predicting 0 and 1
LR fastest to grid search. can we improve it?
lr = LogisticRegression()
param_grid = {'max_iter' : [100,1000,2000],
'penalty' : ['l1', 'l2'],
'C' : [0.1,0.3,0.5,0.7,0.9],
'solver' : ['liblinear']}
clf_lr_v2 = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr_v2 = clf_lr.fit(X_train_prepared,y_train)
clf_performance(best_clf_lr_v2,'Logistic Regression')
Not much improvement
import pickle
def pickle_model(model, filename):
with open(filename, 'wb') as file:
pickle.dump(model, file)
#pickle_model(best_clf_lr_v2, 'best_clf_lr')
# pickle_model(best_clf_rf, 'best_clf_rf')
# pickle_model(best_clf_svc, 'best_clf_svc')
# pickle_model(best_clf_xgb, 'best_clf_xgb')
Let's use the test set with our logistic regression v2 model
y_test.head()
X_test_prepared = full_pipeline.transform(X_test)
final_model = best_clf_lr_v2.best_estimator_
y_test_pred = final_model.predict(X_test_prepared)
y_test_pred_wproba = final_model.predict_proba(X_test_prepared)
y_test[0:5]
y_test_pred[0:5]
y_test_pred_wproba[0:5]
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
accuracy_score(y_test, y_test_pred)
Our test set was correctly predicted 81%. Similar to the validation result
confusion_matrix(y_test, y_test_pred)