#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

PeerLoanKart NBFC

Business Challenge/Requiment
PeerLoanKart is anNBFC (Non-BankingFinancial Company) which facilitates peer to peer loan.
It connects people who need money (borrowers) with people who have money (investors). As an investor,you would want to invest in people who showed a profile of having a high probability of paying you back.

Goal:
Create a model that will help predict whether a borrower will pay the loan or not.
Comparing accuracy, comfusion matrix and reports of 3 different models

Note:

- credit.policy: 1 if the customer meets the credit underwriting criteria of PeerLoanKart, and 0 otherwise
- purpose: Purpose of the loan ("credit_card","debt_consolidation","educational","major_purchase","small_business","all_other")
- int.rate: Interest rate of the loan, as a proportion (e.i.: 11% -> 0.11). Borrowers judged to be more risky are assigned higher interest rates
- installment: The monthly installments owed by the borrower if the loan is funded
- log.annual.inc: The natural log of the self-reported annual income of the borrower
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income)
- fico: The FICO credit score of the borrower
- days.with.cr.line: The number of days the borrower has had a credit line
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle)
- revol.util: The borrower's revolving line utilization rate (amount of the credit line used relative to total credit available)
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments)
- not.fully.paid: This is the output field. Please note that 1 means borrower is not going to pay the loan completely

</small>

Data Cleaning and Exploration

In [1]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv(r'loan_borowwer_data.csv')
In [3]:
df.head()
Out[3]:
credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
0 1 debt_consolidation 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1 0 0 0 0
1 1 credit_card 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7 0 0 0 0
2 1 debt_consolidation 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6 1 0 0 0
3 1 debt_consolidation 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2 1 0 0 0
4 1 credit_card 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5 0 1 0 0
In [4]:
df.describe()
Out[4]:
credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
count 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000 9.578000e+03 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000
mean 0.804970 0.122640 319.089413 10.932117 12.606679 710.846314 4560.767197 1.691396e+04 46.799236 1.577469 0.163708 0.062122 0.160054
std 0.396245 0.026847 207.071301 0.614813 6.883970 37.970537 2496.930377 3.375619e+04 29.014417 2.200245 0.546215 0.262126 0.366676
min 0.000000 0.060000 15.670000 7.547502 0.000000 612.000000 178.958333 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 0.103900 163.770000 10.558414 7.212500 682.000000 2820.000000 3.187000e+03 22.600000 0.000000 0.000000 0.000000 0.000000
50% 1.000000 0.122100 268.950000 10.928884 12.665000 707.000000 4139.958333 8.596000e+03 46.300000 1.000000 0.000000 0.000000 0.000000
75% 1.000000 0.140700 432.762500 11.291293 17.950000 737.000000 5730.000000 1.824950e+04 70.900000 2.000000 0.000000 0.000000 0.000000
max 1.000000 0.216400 940.140000 14.528354 29.960000 827.000000 17639.958330 1.207359e+06 119.000000 33.000000 13.000000 5.000000 1.000000

Handling missing data

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB

No missing data

In [6]:
numerical_features = ['credit.policy','int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line',
                      'revol.bal','revol.util','inq.last.6mths','delinq.2yrs','pub.rec']
categorical_features = ['purpose']
In [7]:
X = df.drop('not.fully.paid',axis=1)
y = df['not.fully.paid']

Converting categorical features

In [8]:
X_dummied = pd.get_dummies(X, columns=categorical_features, drop_first=True)
In [9]:
X_dummied.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 18 columns):
credit.policy                 9578 non-null int64
int.rate                      9578 non-null float64
installment                   9578 non-null float64
log.annual.inc                9578 non-null float64
dti                           9578 non-null float64
fico                          9578 non-null int64
days.with.cr.line             9578 non-null float64
revol.bal                     9578 non-null int64
revol.util                    9578 non-null float64
inq.last.6mths                9578 non-null int64
delinq.2yrs                   9578 non-null int64
pub.rec                       9578 non-null int64
purpose_credit_card           9578 non-null uint8
purpose_debt_consolidation    9578 non-null uint8
purpose_educational           9578 non-null uint8
purpose_home_improvement      9578 non-null uint8
purpose_major_purchase        9578 non-null uint8
purpose_small_business        9578 non-null uint8
dtypes: float64(6), int64(6), uint8(6)
memory usage: 954.2 KB

Train-Test-Split

In [10]:
from sklearn.model_selection import train_test_split
In [11]:
# testing data size at 30%
x_train, x_test, y_train, y_test = train_test_split(X_dummied, y, test_size = .30, random_state=101)

Handling highly correlated features

In [12]:
fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccf73fe2c8>

No highly correlated features

Handling Unique features

In [13]:
from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
x_train_num = x_train
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train_num)
print(x_train_num.columns[constant_filter.get_support()])
x_num = x_train_num[x_train_num.columns[constant_filter.get_support()]]
print(len(x_train_num.columns), len(x_train.columns))
Index(['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti',
       'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_major_purchase',
       'purpose_small_business'],
      dtype='object')
18 18

No Unique features

Handling Feature Importance

In [14]:
from sklearn.ensemble import RandomForestClassifier
In [15]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
features = x_train.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)
fig, ax = plt.subplots(figsize=(16,14))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

No featues with 0 importance, we will use original features

Prediction

Decision Tree Classifier

In [16]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
In [17]:
from sklearn.tree import DecisionTreeClassifier
In [18]:
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = dtc.predict(x_test)
accuracy_score(pred_y, y_test)
Out[18]:
0.7289491997216423

Report

In [19]:
print(classification_report(y_test, pred_y))
              precision    recall  f1-score   support

           0       0.86      0.82      0.84      2431
           1       0.19      0.24      0.22       443

    accuracy                           0.73      2874
   macro avg       0.52      0.53      0.53      2874
weighted avg       0.75      0.73      0.74      2874

Confusion Matrix

In [20]:
print(confusion_matrix(y_test, pred_y))
[[1988  443]
 [ 336  107]]

Random Forest Classifier

In [21]:
# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
accuracy_score(pred_y, y_test)
Out[21]:
0.8444676409185804

Report

In [22]:
print(classification_report(y_test, pred_y))
              precision    recall  f1-score   support

           0       0.85      0.99      0.92      2431
           1       0.41      0.02      0.04       443

    accuracy                           0.84      2874
   macro avg       0.63      0.51      0.48      2874
weighted avg       0.78      0.84      0.78      2874

Confusion Matrix

In [23]:
print(confusion_matrix(y_test, pred_y))
[[2418   13]
 [ 434    9]]

Support Vector Machine

In [24]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
In [25]:
# SVC
svc = SVC()
svc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = svc.predict(x_test)
accuracy_score(pred_y, y_test)
Out[25]:
0.8462073764787752

Report

In [26]:
print(classification_report(y_test, pred_y))
              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       1.00      0.00      0.00       443

    accuracy                           0.85      2874
   macro avg       0.92      0.50      0.46      2874
weighted avg       0.87      0.85      0.78      2874

Confusion Matrix

In [27]:
print(confusion_matrix(y_test, pred_y))
[[2431    0]
 [ 442    1]]

Standardized SVC

In [28]:
scaler = StandardScaler().fit(x_train)
x_train_transformed = scaler.transform(x_train)
x_test_transformed = scaler.transform(x_test[x_train.columns])
In [29]:
svc = SVC()
svc.fit(x_train_transformed,y_train)
pred_y = svc.predict(x_test_transformed)
accuracy_score(pred_y, y_test)
Out[29]:
0.8462073764787752

Report

In [30]:
print(classification_report(y_test, pred_y))
              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       0.56      0.01      0.02       443

    accuracy                           0.85      2874
   macro avg       0.70      0.50      0.47      2874
weighted avg       0.80      0.85      0.78      2874

Confusion Matrix

In [31]:
print(confusion_matrix(y_test, pred_y))
[[2427    4]
 [ 438    5]]

Contact Me

www.linkedin.com/in/billygustave

billygustave.com