#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

PeerLoanKart NBFC ¶

Business Challenge/Requiment
PeerLoanKart is anNBFC (Non-BankingFinancial Company) which facilitates peer to peer loan.
It connects people who need money (borrowers) with people who have money (investors). As an investor,you would want to invest in people who showed a profile of having a high probability of paying you back.

Goal:
Create a model that will help predict whether a borrower will pay the loan or not.
Comparing accuracy, comfusion matrix and reports of 3 different models

Note:

- credit.policy: 1 if the customer meets the credit underwriting criteria of PeerLoanKart, and 0 otherwise
- purpose: Purpose of the loan ("credit_card","debt_consolidation","educational","major_purchase","small_business","all_other")
- int.rate: Interest rate of the loan, as a proportion (e.i.: 11% -> 0.11). Borrowers judged to be more risky are assigned higher interest rates
- installment: The monthly installments owed by the borrower if the loan is funded
- log.annual.inc: The natural log of the self-reported annual income of the borrower
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income)
- fico: The FICO credit score of the borrower
- days.with.cr.line: The number of days the borrower has had a credit line
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle)
- revol.util: The borrower's revolving line utilization rate (amount of the credit line used relative to total credit available)
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments)
- not.fully.paid: This is the output field. Please note that 1 means borrower is not going to pay the loan completely

</small>

Data Cleaning and Exploration ¶

import pandas as pd, numpy as np, matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r'loan_borowwer_data.csv')

df.head()

df.describe()

Handling missing data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB

No missing data

numerical_features = ['credit.policy','int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line',
                      'revol.bal','revol.util','inq.last.6mths','delinq.2yrs','pub.rec']
categorical_features = ['purpose']

X = df.drop('not.fully.paid',axis=1)
y = df['not.fully.paid']

Converting categorical features

X_dummied = pd.get_dummies(X, columns=categorical_features, drop_first=True)

X_dummied.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 18 columns):
credit.policy                 9578 non-null int64
int.rate                      9578 non-null float64
installment                   9578 non-null float64
log.annual.inc                9578 non-null float64
dti                           9578 non-null float64
fico                          9578 non-null int64
days.with.cr.line             9578 non-null float64
revol.bal                     9578 non-null int64
revol.util                    9578 non-null float64
inq.last.6mths                9578 non-null int64
delinq.2yrs                   9578 non-null int64
pub.rec                       9578 non-null int64
purpose_credit_card           9578 non-null uint8
purpose_debt_consolidation    9578 non-null uint8
purpose_educational           9578 non-null uint8
purpose_home_improvement      9578 non-null uint8
purpose_major_purchase        9578 non-null uint8
purpose_small_business        9578 non-null uint8
dtypes: float64(6), int64(6), uint8(6)
memory usage: 954.2 KB

Train-Test-Split

from sklearn.model_selection import train_test_split

# testing data size at 30%
x_train, x_test, y_train, y_test = train_test_split(X_dummied, y, test_size = .30, random_state=101)

Handling highly correlated features

fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x1ccf73fe2c8>

No highly correlated features

Handling Unique features

from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
x_train_num = x_train
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train_num)
print(x_train_num.columns[constant_filter.get_support()])
x_num = x_train_num[x_train_num.columns[constant_filter.get_support()]]
print(len(x_train_num.columns), len(x_train.columns))

Index(['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti',
       'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_major_purchase',
       'purpose_small_business'],
      dtype='object')
18 18

No Unique features

Handling Feature Importance

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
features = x_train.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)
fig, ax = plt.subplots(figsize=(16,14))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

No featues with 0 importance, we will use original features

Prediction ¶

Decision Tree Classifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = dtc.predict(x_test)
accuracy_score(pred_y, y_test)

0.7289491997216423

Report

print(classification_report(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.86      0.82      0.84      2431
           1       0.19      0.24      0.22       443

    accuracy                           0.73      2874
   macro avg       0.52      0.53      0.53      2874
weighted avg       0.75      0.73      0.74      2874

Confusion Matrix

print(confusion_matrix(y_test, pred_y))

[[1988  443]
 [ 336  107]]

Random Forest Classifier

# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
accuracy_score(pred_y, y_test)

0.8444676409185804

Report

print(classification_report(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.85      0.99      0.92      2431
           1       0.41      0.02      0.04       443

    accuracy                           0.84      2874
   macro avg       0.63      0.51      0.48      2874
weighted avg       0.78      0.84      0.78      2874

Confusion Matrix

print(confusion_matrix(y_test, pred_y))

[[2418   13]
 [ 434    9]]

Support Vector Machine

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# SVC
svc = SVC()
svc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = svc.predict(x_test)
accuracy_score(pred_y, y_test)

0.8462073764787752

Report

print(classification_report(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       1.00      0.00      0.00       443

    accuracy                           0.85      2874
   macro avg       0.92      0.50      0.46      2874
weighted avg       0.87      0.85      0.78      2874

Confusion Matrix

print(confusion_matrix(y_test, pred_y))

[[2431    0]
 [ 442    1]]

Standardized SVC

scaler = StandardScaler().fit(x_train)
x_train_transformed = scaler.transform(x_train)
x_test_transformed = scaler.transform(x_test[x_train.columns])

svc = SVC()
svc.fit(x_train_transformed,y_train)
pred_y = svc.predict(x_test_transformed)
accuracy_score(pred_y, y_test)

0.8462073764787752

Report

print(classification_report(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       0.56      0.01      0.02       443

    accuracy                           0.85      2874
   macro avg       0.70      0.50      0.47      2874
weighted avg       0.80      0.85      0.78      2874

Confusion Matrix

print(confusion_matrix(y_test, pred_y))

[[2427    4]
 [ 438    5]]

	credit.policy	purpose	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs
0	1	debt_consolidation	0.1189	829.10	11.350407	19.48	737	5639.958333	28854	52.1	0	0
1	1	credit_card	0.1071	228.22	11.082143	14.29	707	2760.000000	33623	76.7	0	0
2	1	debt_consolidation	0.1357	366.86	10.373491	11.63	682	4710.000000	3511	25.6	1	0
3	1	debt_consolidation	0.1008	162.34	11.350407	8.10	712	2699.958333	33667	73.2	1	0
4	1	credit_card	0.1426	102.92	11.299732	14.97	667	4066.000000	4740	39.5	0	1

	credit.policy	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs	pub.rec	not.fully.paid
count	9578.000000	9578.000000	9578.000000	9578.000000	9578.000000	9578.000000	9578.000000	9.578000e+03	9578.000000	9578.000000	9578.000000	9578.000000	9578.000000
mean	0.804970	0.122640	319.089413	10.932117	12.606679	710.846314	4560.767197	1.691396e+04	46.799236	1.577469	0.163708	0.062122	0.160054
std	0.396245	0.026847	207.071301	0.614813	6.883970	37.970537	2496.930377	3.375619e+04	29.014417	2.200245	0.546215	0.262126	0.366676
min	0.000000	0.060000	15.670000	7.547502	0.000000	612.000000	178.958333	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000
25%	1.000000	0.103900	163.770000	10.558414	7.212500	682.000000	2820.000000	3.187000e+03	22.600000	0.000000	0.000000	0.000000	0.000000
50%	1.000000	0.122100	268.950000	10.928884	12.665000	707.000000	4139.958333	8.596000e+03	46.300000	1.000000	0.000000	0.000000	0.000000
75%	1.000000	0.140700	432.762500	11.291293	17.950000	737.000000	5730.000000	1.824950e+04	70.900000	2.000000	0.000000	0.000000	0.000000
max	1.000000	0.216400	940.140000	14.528354	29.960000	827.000000	17639.958330	1.207359e+06	119.000000	33.000000	13.000000	5.000000	1.000000

Billy Gustave

PeerLoanKart NBFC

PeerLoanKart NBFC ¶

Data Cleaning and Exploration ¶

Prediction ¶

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

Billy Gustave