#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

Horse Survival ¶

Goal :
Attempting to predict the survival of a horse based on various observed medical conditions.
Data: horse.csv

Also comaparing 2 classifiers:

DecisionTreeClassifier
RandomForestClassifier

#from sklearn.feature_selection import VarianceThreshold
#
#from sklearn.linear_model import LogisticRegression

#libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from feature_selector import FeatureSelector
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Data Cleaning and Exploration ¶

# changing dataframe setting to display all columns
pd.options.display.max_columns = 30

df = pd.read_csv('horse.csv')
df.head()

df.shape

(299, 28)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 28 columns):
surgery                  299 non-null object
age                      299 non-null object
hospital_number          299 non-null int64
rectal_temp              239 non-null float64
pulse                    275 non-null float64
respiratory_rate         241 non-null float64
temp_of_extremities      243 non-null object
peripheral_pulse         230 non-null object
mucous_membrane          252 non-null object
capillary_refill_time    267 non-null object
pain                     244 non-null object
peristalsis              255 non-null object
abdominal_distention     243 non-null object
nasogastric_tube         195 non-null object
nasogastric_reflux       193 non-null object
nasogastric_reflux_ph    53 non-null float64
rectal_exam_feces        197 non-null object
abdomen                  181 non-null object
packed_cell_volume       270 non-null float64
total_protein            266 non-null float64
abdomo_appearance        134 non-null object
abdomo_protein           101 non-null float64
outcome                  299 non-null object
surgical_lesion          299 non-null object
lesion_1                 299 non-null int64
lesion_2                 299 non-null int64
lesion_3                 299 non-null int64
cp_data                  299 non-null object
dtypes: float64(7), int64(4), object(17)
memory usage: 65.5+ KB

# checking the percentage of missing values in each variable
df.isnull().sum()/len(df)*100

surgery                   0.000000
age                       0.000000
hospital_number           0.000000
rectal_temp              20.066890
pulse                     8.026756
respiratory_rate         19.397993
temp_of_extremities      18.729097
peripheral_pulse         23.076923
mucous_membrane          15.719064
capillary_refill_time    10.702341
pain                     18.394649
peristalsis              14.715719
abdominal_distention     18.729097
nasogastric_tube         34.782609
nasogastric_reflux       35.451505
nasogastric_reflux_ph    82.274247
rectal_exam_feces        34.113712
abdomen                  39.464883
packed_cell_volume        9.698997
total_protein            11.036789
abdomo_appearance        55.183946
abdomo_protein           66.220736
outcome                   0.000000
surgical_lesion           0.000000
lesion_1                  0.000000
lesion_2                  0.000000
lesion_3                  0.000000
cp_data                   0.000000
dtype: float64

Missing values

Imputation techniques we will use

< 50% -> impute with mean or median (mode or 'most_frequent' for categorical)
> 50% -> use a 'Missing' flag indicator
> 95% -> remove feature

Only a few features with missing values, hence imputation can be applied.
3 types of missing values:

- MCAR: Missing completely at random _ cannot perform imputation
- MAR:  Missing at random - can perform imputation
- NMAR: Not Missing At Random - structured missing values

</small>

# features and target
X = df.drop(['hospital_number','outcome'],axis=1)
y = df.outcome

# removing features with 95% missing data:

abdomo_appearance = df.abdomo_appearance

def removing_high_missing(data, threshold=95):
    for column in data.columns:
        missing_ratio = data[column].isnull().sum()/len(data[column])*100
        if( missing_ratio >= threshold):
            data = data.drop([column], axis=1)
    return data

X = removing_high_missing(X)

# filling in categorial missing values:

categorical_features = ['surgery', 'age', 'temp_of_extremities', 'peripheral_pulse','mucous_membrane', 
                        'capillary_refill_time', 'pain', 'peristalsis','abdominal_distention', 'nasogastric_tube', 
                        'nasogastric_reflux', 'rectal_exam_feces', 'abdomen','abdomo_appearance', 'surgical_lesion', 'cp_data']

def fill_categorical_missing(data, column_name_list):
    for column in column_name_list:
        missing_ratio = data[column].isnull().sum()/len(data[column])*100
        if( missing_ratio <= 50):
            data[column].fillna(data[column].value_counts().index[0],inplace=True) #mode
        elif(missing_ratio > 50 and missing_ratio < 95):
            data[column].fillna('Missing',inplace=True) # flag
    return data

X = fill_categorical_missing(X, categorical_features)

# filling in continuous missing values:

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 26 columns):
surgery                  299 non-null object
age                      299 non-null object
rectal_temp              239 non-null float64
pulse                    275 non-null float64
respiratory_rate         241 non-null float64
temp_of_extremities      299 non-null object
peripheral_pulse         299 non-null object
mucous_membrane          299 non-null object
capillary_refill_time    299 non-null object
pain                     299 non-null object
peristalsis              299 non-null object
abdominal_distention     299 non-null object
nasogastric_tube         299 non-null object
nasogastric_reflux       299 non-null object
nasogastric_reflux_ph    53 non-null float64
rectal_exam_feces        299 non-null object
abdomen                  299 non-null object
packed_cell_volume       270 non-null float64
total_protein            266 non-null float64
abdomo_appearance        299 non-null object
abdomo_protein           101 non-null float64
surgical_lesion          299 non-null object
lesion_1                 299 non-null int64
lesion_2                 299 non-null int64
lesion_3                 299 non-null int64
cp_data                  299 non-null object
dtypes: float64(7), int64(3), object(16)
memory usage: 60.9+ KB

numerical_features_missings = ['rectal_temp','pulse','respiratory_rate','nasogastric_reflux_ph','packed_cell_volume',
                      'total_protein','abdomo_protein']
numerical_features = ['rectal_temp','pulse','respiratory_rate','nasogastric_reflux_ph','packed_cell_volume',
                      'total_protein','abdomo_protein','lesion_1','lesion_2','lesion_3']

count = 0
fig, axes = plt.subplots(1, 7, figsize=(21,4))

for column in numerical_features_missings:
    X[column].hist(ax=axes[count])
    axes[count].set_title(column)
    count += 1
plt.show()

We will use mean for 'rectal_temp' and median for the rest:

for column in numerical_features_missings:
    if column == 'rectal_temp':
        mean_value=X[column].mean()
        X[column].fillna(mean_value, inplace=True)
    else:
        median_value=X[column].median()
        X[column].fillna(median_value, inplace=True)

X.isnull().sum()/len(X)*100

surgery                  0.0
age                      0.0
rectal_temp              0.0
pulse                    0.0
respiratory_rate         0.0
temp_of_extremities      0.0
peripheral_pulse         0.0
mucous_membrane          0.0
capillary_refill_time    0.0
pain                     0.0
peristalsis              0.0
abdominal_distention     0.0
nasogastric_tube         0.0
nasogastric_reflux       0.0
nasogastric_reflux_ph    0.0
rectal_exam_feces        0.0
abdomen                  0.0
packed_cell_volume       0.0
total_protein            0.0
abdomo_appearance        0.0
abdomo_protein           0.0
surgical_lesion          0.0
lesion_1                 0.0
lesion_2                 0.0
lesion_3                 0.0
cp_data                  0.0
dtype: float64

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 26 columns):
surgery                  299 non-null object
age                      299 non-null object
rectal_temp              299 non-null float64
pulse                    299 non-null float64
respiratory_rate         299 non-null float64
temp_of_extremities      299 non-null object
peripheral_pulse         299 non-null object
mucous_membrane          299 non-null object
capillary_refill_time    299 non-null object
pain                     299 non-null object
peristalsis              299 non-null object
abdominal_distention     299 non-null object
nasogastric_tube         299 non-null object
nasogastric_reflux       299 non-null object
nasogastric_reflux_ph    299 non-null float64
rectal_exam_feces        299 non-null object
abdomen                  299 non-null object
packed_cell_volume       299 non-null float64
total_protein            299 non-null float64
abdomo_appearance        299 non-null object
abdomo_protein           299 non-null float64
surgical_lesion          299 non-null object
lesion_1                 299 non-null int64
lesion_2                 299 non-null int64
lesion_3                 299 non-null int64
cp_data                  299 non-null object
dtypes: float64(7), int64(3), object(16)
memory usage: 60.9+ KB

Encoding Categoricals

# dummies for features:
X = pd.get_dummies(X, columns=categorical_features,drop_first=True)

y.unique()

array(['died', 'euthanized', 'lived'], dtype=object)

#  mapping for target {'died':0,'euthanized':1,'lived':2}
y = y.map({'died':0,'euthanized':1,'lived':2})

Train-Test-Split

# random_state guarantees the same output everytime the program is run
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state=1)

Removing Unique values

from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
x_train_num = x_train[numerical_features]
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train_num)
print(x_train_num.columns[constant_filter.get_support()])
x_train_num = x_train_num[x_train_num.columns[constant_filter.get_support()]]

Index(['rectal_temp', 'pulse', 'respiratory_rate', 'nasogastric_reflux_ph',
       'packed_cell_volume', 'total_protein', 'abdomo_protein', 'lesion_1',
       'lesion_2', 'lesion_3'],
      dtype='object')

No uniques

Removing Highly correlated features

Threshold: 75%

# Correlation matrix for all independent vars
corrMatrix = x_train_num.corr()
allVars = corrMatrix.keys()

absCorrWithDep = []
for var in allVars:
    absCorrWithDep.append(abs(y.corr(x_train_num[var])))
# threshold seeting
corrTol = 0.75

# for each column in the corr matrix
for col in corrMatrix:
    
    if col in corrMatrix.keys():
        thisCol = []
        thisVars = []
        temp = corrMatrix[col]
        
        # Store the corr with the dep var for fields that are highly correlated with each other
        for i in range(len(corrMatrix)):
            
            if abs(corrMatrix[col][i]) == 1.0 and col != corrMatrix.keys()[i]:
                thisCorr = 0
            else:
                thisCorr = (1 if abs(corrMatrix[col][i]) > corrTol else -1) * abs(temp[corrMatrix.keys()[i]])
            thisCol.append(thisCorr)
            thisVars.append(corrMatrix.keys()[i])
        
        mask = np.ones(len(thisCol), dtype = bool) # Initialize the mask
        
        ctDelCol = 0 # To keep track of the number of columns deleted
        
        for n, j in enumerate(thisCol):
            # Delete if (a) a var is correlated withh others and do not ave the best corr with dep,
            # or (b) completely corr with the 'col'
            mask[n] = not (j != max(thisCol) and j >= 0)
            
            if j != max(thisCol) and j >= 0:
                # Delete the column from the corr matrix
                corrMatrix.pop('%s' %thisVars[n])
                ctDelCol += 1
                
        # Delete the corresponding row(s) from the corr matrix
        corrMatrix = corrMatrix[mask]

len(corrMatrix.columns)

10

len(x_train_num.columns)

10

No highly correlated features

fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train_num.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x1fd8ddf5548>

x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239 entries, 161 to 37
Data columns (total 51 columns):
rectal_temp                         239 non-null float64
pulse                               239 non-null float64
respiratory_rate                    239 non-null float64
nasogastric_reflux_ph               239 non-null float64
packed_cell_volume                  239 non-null float64
total_protein                       239 non-null float64
abdomo_protein                      239 non-null float64
lesion_1                            239 non-null int64
lesion_2                            239 non-null int64
lesion_3                            239 non-null int64
surgery_yes                         239 non-null uint8
age_young                           239 non-null uint8
temp_of_extremities_cool            239 non-null uint8
temp_of_extremities_normal          239 non-null uint8
temp_of_extremities_warm            239 non-null uint8
peripheral_pulse_increased          239 non-null uint8
peripheral_pulse_normal             239 non-null uint8
peripheral_pulse_reduced            239 non-null uint8
mucous_membrane_bright_red          239 non-null uint8
mucous_membrane_dark_cyanotic       239 non-null uint8
mucous_membrane_normal_pink         239 non-null uint8
mucous_membrane_pale_cyanotic       239 non-null uint8
mucous_membrane_pale_pink           239 non-null uint8
capillary_refill_time_less_3_sec    239 non-null uint8
capillary_refill_time_more_3_sec    239 non-null uint8
pain_depressed                      239 non-null uint8
pain_extreme_pain                   239 non-null uint8
pain_mild_pain                      239 non-null uint8
pain_severe_pain                    239 non-null uint8
peristalsis_hypermotile             239 non-null uint8
peristalsis_hypomotile              239 non-null uint8
peristalsis_normal                  239 non-null uint8
abdominal_distention_none           239 non-null uint8
abdominal_distention_severe         239 non-null uint8
abdominal_distention_slight         239 non-null uint8
nasogastric_tube_significant        239 non-null uint8
nasogastric_tube_slight             239 non-null uint8
nasogastric_reflux_more_1_liter     239 non-null uint8
nasogastric_reflux_none             239 non-null uint8
rectal_exam_feces_decreased         239 non-null uint8
rectal_exam_feces_increased         239 non-null uint8
rectal_exam_feces_normal            239 non-null uint8
abdomen_distend_small               239 non-null uint8
abdomen_firm                        239 non-null uint8
abdomen_normal                      239 non-null uint8
abdomen_other                       239 non-null uint8
abdomo_appearance_clear             239 non-null uint8
abdomo_appearance_cloudy            239 non-null uint8
abdomo_appearance_serosanguious     239 non-null uint8
surgical_lesion_yes                 239 non-null uint8
cp_data_yes                         239 non-null uint8
dtypes: float64(7), int64(3), uint8(41)
memory usage: 40.1 KB

Removing 0 importance features

rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
features = x_train.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)
fig, ax = plt.subplots(figsize=(16,14))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

# removing lesion_3
x_train = x_train.drop('lesion_3',axis=1)
x_test = x_test.drop('lesion_3',axis=1)

Modeling ¶

DecisionTreeClassifier

No Cross validation

# DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = dtc.predict(x_test)
metrics.accuracy_score(pred_y, y_test)

0.5833333333333334

# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
metrics.accuracy_score(pred_y, y_test)

0.7166666666666667

With cross-validation (using train data only)

from sklearn.model_selection import KFold, cross_val_score

# DecisionTreeClassifier
model = DecisionTreeClassifier()
result = cross_val_score(model, X, y, cv=5, scoring='accuracy',)
result.mean()

0.6089265536723164

# RandomForest
model = RandomForestClassifier()
result = cross_val_score(model, X, y, cv=5, scoring='accuracy')
result.mean()

0.7223728813559321

RandomForest gives better results than DecisionTree because it is an ensemble model made of multiple DecisionTrees.

Improvements

MCA could be applied to reduce categorical dimension.
Fine tuning
Test other classifiers
Use better missing value handling methods such as: KNNimputerm, MICE, xgboost.

Billy Gustave

Horse Survival

Horse Survival ¶

Data Cleaning and Exploration ¶

Modeling ¶

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

Billy Gustave

	surgery	age	hospital_number	rectal_temp	pulse	respiratory_rate	temp_of_extremities	peripheral_pulse	mucous_membrane	capillary_refill_time	pain	peristalsis	abdominal_distention	nasogastric_tube	nasogastric_reflux	nasogastric_reflux_ph	rectal_exam_feces	abdomen	packed_cell_volume	total_protein	abdomo_appearance	abdomo_protein	outcome	surgical_lesion	lesion_1	cp_data
0	no	adult	530101	38.5	66.0	28.0	cool	reduced	NaN	more_3_sec	extreme_pain	absent	severe	NaN	NaN	NaN	decreased	distend_large	45.0	8.4	NaN	NaN	died	no	11300	no
1	yes	adult	534817	39.2	88.0	20.0	NaN	NaN	pale_cyanotic	less_3_sec	mild_pain	absent	slight	NaN	NaN	NaN	absent	other	50.0	85.0	cloudy	2.0	euthanized	no	2208	no
2	no	adult	530334	38.3	40.0	24.0	normal	normal	pale_pink	less_3_sec	mild_pain	hypomotile	none	NaN	NaN	NaN	normal	normal	33.0	6.7	NaN	NaN	lived	no	0	yes
3	yes	young	5290409	39.1	164.0	84.0	cold	normal	dark_cyanotic	more_3_sec	depressed	absent	severe	none	less_1_liter	5.0	decreased	NaN	48.0	7.2	serosanguious	5.3	died	yes	2208	yes
4	no	adult	530255	37.3	104.0	35.0	NaN	NaN	dark_cyanotic	more_3_sec	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	74.0	7.4	NaN	NaN	died	no	4300	no