#MachineLearning #SupervisedLearning #Classification
By Billy Gustave
Business Challenge/Requiment
PeerLoanKart is anNBFC (Non-BankingFinancial Company) which facilitates peer to peer loan.
It connects people who need money (borrowers) with people who have money (investors). As an investor,you would want to invest in people who showed a profile of having a high probability of paying you back.
Goal:
Create a model that will help predict whether a borrower will pay the loan or not.
Comparing accuracy, comfusion matrix and reports of 3 different models
Note:
- credit.policy: 1 if the customer meets the credit underwriting criteria of PeerLoanKart, and 0 otherwise
- purpose: Purpose of the loan ("credit_card","debt_consolidation","educational","major_purchase","small_business","all_other")
- int.rate: Interest rate of the loan, as a proportion (e.i.: 11% -> 0.11). Borrowers judged to be more risky are assigned higher interest rates
- installment: The monthly installments owed by the borrower if the loan is funded
- log.annual.inc: The natural log of the self-reported annual income of the borrower
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income)
- fico: The FICO credit score of the borrower
- days.with.cr.line: The number of days the borrower has had a credit line
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle)
- revol.util: The borrower's revolving line utilization rate (amount of the credit line used relative to total credit available)
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments)
- not.fully.paid: This is the output field. Please note that 1 means borrower is not going to pay the loan completely
</small>
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r'loan_borowwer_data.csv')
df.head()
df.describe()
Handling missing data
df.info()
No missing data
numerical_features = ['credit.policy','int.rate','installment','log.annual.inc','dti','fico','days.with.cr.line',
'revol.bal','revol.util','inq.last.6mths','delinq.2yrs','pub.rec']
categorical_features = ['purpose']
X = df.drop('not.fully.paid',axis=1)
y = df['not.fully.paid']
Converting categorical features
X_dummied = pd.get_dummies(X, columns=categorical_features, drop_first=True)
X_dummied.info()
Train-Test-Split
from sklearn.model_selection import train_test_split
# testing data size at 30%
x_train, x_test, y_train, y_test = train_test_split(X_dummied, y, test_size = .30, random_state=101)
Handling highly correlated features
fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
No highly correlated features
Handling Unique features
from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
x_train_num = x_train
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train_num)
print(x_train_num.columns[constant_filter.get_support()])
x_num = x_train_num[x_train_num.columns[constant_filter.get_support()]]
print(len(x_train_num.columns), len(x_train.columns))
No Unique features
Handling Feature Importance
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
features = x_train.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)
fig, ax = plt.subplots(figsize=(16,14))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
No featues with 0 importance, we will use original features
Decision Tree Classifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = dtc.predict(x_test)
accuracy_score(pred_y, y_test)
Report
print(classification_report(y_test, pred_y))
Confusion Matrix
print(confusion_matrix(y_test, pred_y))
Random Forest Classifier
# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
accuracy_score(pred_y, y_test)
Report
print(classification_report(y_test, pred_y))
Confusion Matrix
print(confusion_matrix(y_test, pred_y))
Support Vector Machine
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# SVC
svc = SVC()
svc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = svc.predict(x_test)
accuracy_score(pred_y, y_test)
Report
print(classification_report(y_test, pred_y))
Confusion Matrix
print(confusion_matrix(y_test, pred_y))
Standardized SVC
scaler = StandardScaler().fit(x_train)
x_train_transformed = scaler.transform(x_train)
x_test_transformed = scaler.transform(x_test[x_train.columns])
svc = SVC()
svc.fit(x_train_transformed,y_train)
pred_y = svc.predict(x_test_transformed)
accuracy_score(pred_y, y_test)
Report
print(classification_report(y_test, pred_y))
Confusion Matrix
print(confusion_matrix(y_test, pred_y))