#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

Horse Survival

Goal :
Attempting to predict the survival of a horse based on various observed medical conditions.
Data: horse.csv

Also comaparing 2 classifiers:

  • DecisionTreeClassifier
  • RandomForestClassifier
In [1]:
#from sklearn.feature_selection import VarianceThreshold
#
#from sklearn.linear_model import LogisticRegression
In [2]:
#libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from feature_selector import FeatureSelector
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Data Cleaning and Exploration

In [3]:
# changing dataframe setting to display all columns
pd.options.display.max_columns = 30
In [4]:
df = pd.read_csv('horse.csv')
df.head()
Out[4]:
surgery age hospital_number rectal_temp pulse respiratory_rate temp_of_extremities peripheral_pulse mucous_membrane capillary_refill_time pain peristalsis abdominal_distention nasogastric_tube nasogastric_reflux nasogastric_reflux_ph rectal_exam_feces abdomen packed_cell_volume total_protein abdomo_appearance abdomo_protein outcome surgical_lesion lesion_1 lesion_2 lesion_3 cp_data
0 no adult 530101 38.5 66.0 28.0 cool reduced NaN more_3_sec extreme_pain absent severe NaN NaN NaN decreased distend_large 45.0 8.4 NaN NaN died no 11300 0 0 no
1 yes adult 534817 39.2 88.0 20.0 NaN NaN pale_cyanotic less_3_sec mild_pain absent slight NaN NaN NaN absent other 50.0 85.0 cloudy 2.0 euthanized no 2208 0 0 no
2 no adult 530334 38.3 40.0 24.0 normal normal pale_pink less_3_sec mild_pain hypomotile none NaN NaN NaN normal normal 33.0 6.7 NaN NaN lived no 0 0 0 yes
3 yes young 5290409 39.1 164.0 84.0 cold normal dark_cyanotic more_3_sec depressed absent severe none less_1_liter 5.0 decreased NaN 48.0 7.2 serosanguious 5.3 died yes 2208 0 0 yes
4 no adult 530255 37.3 104.0 35.0 NaN NaN dark_cyanotic more_3_sec NaN NaN NaN NaN NaN NaN NaN NaN 74.0 7.4 NaN NaN died no 4300 0 0 no
In [5]:
df.shape
Out[5]:
(299, 28)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 28 columns):
surgery                  299 non-null object
age                      299 non-null object
hospital_number          299 non-null int64
rectal_temp              239 non-null float64
pulse                    275 non-null float64
respiratory_rate         241 non-null float64
temp_of_extremities      243 non-null object
peripheral_pulse         230 non-null object
mucous_membrane          252 non-null object
capillary_refill_time    267 non-null object
pain                     244 non-null object
peristalsis              255 non-null object
abdominal_distention     243 non-null object
nasogastric_tube         195 non-null object
nasogastric_reflux       193 non-null object
nasogastric_reflux_ph    53 non-null float64
rectal_exam_feces        197 non-null object
abdomen                  181 non-null object
packed_cell_volume       270 non-null float64
total_protein            266 non-null float64
abdomo_appearance        134 non-null object
abdomo_protein           101 non-null float64
outcome                  299 non-null object
surgical_lesion          299 non-null object
lesion_1                 299 non-null int64
lesion_2                 299 non-null int64
lesion_3                 299 non-null int64
cp_data                  299 non-null object
dtypes: float64(7), int64(4), object(17)
memory usage: 65.5+ KB
In [7]:
# checking the percentage of missing values in each variable
df.isnull().sum()/len(df)*100
Out[7]:
surgery                   0.000000
age                       0.000000
hospital_number           0.000000
rectal_temp              20.066890
pulse                     8.026756
respiratory_rate         19.397993
temp_of_extremities      18.729097
peripheral_pulse         23.076923
mucous_membrane          15.719064
capillary_refill_time    10.702341
pain                     18.394649
peristalsis              14.715719
abdominal_distention     18.729097
nasogastric_tube         34.782609
nasogastric_reflux       35.451505
nasogastric_reflux_ph    82.274247
rectal_exam_feces        34.113712
abdomen                  39.464883
packed_cell_volume        9.698997
total_protein            11.036789
abdomo_appearance        55.183946
abdomo_protein           66.220736
outcome                   0.000000
surgical_lesion           0.000000
lesion_1                  0.000000
lesion_2                  0.000000
lesion_3                  0.000000
cp_data                   0.000000
dtype: float64

Missing values

Imputation techniques we will use

  • < 50% -> impute with mean or median (mode or 'most_frequent' for categorical)
  • > 50% -> use a 'Missing' flag indicator
  • > 95% -> remove feature

Only a few features with missing values, hence imputation can be applied.
3 types of missing values:

- MCAR: Missing completely at random _ cannot perform imputation
- MAR:  Missing at random - can perform imputation
- NMAR: Not Missing At Random - structured missing values

</small>

In [8]:
# features and target
X = df.drop(['hospital_number','outcome'],axis=1)
y = df.outcome
In [9]:
# removing features with 95% missing data:
In [10]:
abdomo_appearance = df.abdomo_appearance
In [11]:
def removing_high_missing(data, threshold=95):
    for column in data.columns:
        missing_ratio = data[column].isnull().sum()/len(data[column])*100
        if( missing_ratio >= threshold):
            data = data.drop([column], axis=1)
    return data
In [12]:
X = removing_high_missing(X)
In [13]:
# filling in categorial missing values:
In [14]:
categorical_features = ['surgery', 'age', 'temp_of_extremities', 'peripheral_pulse','mucous_membrane', 
                        'capillary_refill_time', 'pain', 'peristalsis','abdominal_distention', 'nasogastric_tube', 
                        'nasogastric_reflux', 'rectal_exam_feces', 'abdomen','abdomo_appearance', 'surgical_lesion', 'cp_data']
In [15]:
def fill_categorical_missing(data, column_name_list):
    for column in column_name_list:
        missing_ratio = data[column].isnull().sum()/len(data[column])*100
        if( missing_ratio <= 50):
            data[column].fillna(data[column].value_counts().index[0],inplace=True) #mode
        elif(missing_ratio > 50 and missing_ratio < 95):
            data[column].fillna('Missing',inplace=True) # flag
    return data
In [16]:
X = fill_categorical_missing(X, categorical_features)
In [17]:
# filling in continuous missing values:
In [18]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 26 columns):
surgery                  299 non-null object
age                      299 non-null object
rectal_temp              239 non-null float64
pulse                    275 non-null float64
respiratory_rate         241 non-null float64
temp_of_extremities      299 non-null object
peripheral_pulse         299 non-null object
mucous_membrane          299 non-null object
capillary_refill_time    299 non-null object
pain                     299 non-null object
peristalsis              299 non-null object
abdominal_distention     299 non-null object
nasogastric_tube         299 non-null object
nasogastric_reflux       299 non-null object
nasogastric_reflux_ph    53 non-null float64
rectal_exam_feces        299 non-null object
abdomen                  299 non-null object
packed_cell_volume       270 non-null float64
total_protein            266 non-null float64
abdomo_appearance        299 non-null object
abdomo_protein           101 non-null float64
surgical_lesion          299 non-null object
lesion_1                 299 non-null int64
lesion_2                 299 non-null int64
lesion_3                 299 non-null int64
cp_data                  299 non-null object
dtypes: float64(7), int64(3), object(16)
memory usage: 60.9+ KB
In [19]:
numerical_features_missings = ['rectal_temp','pulse','respiratory_rate','nasogastric_reflux_ph','packed_cell_volume',
                      'total_protein','abdomo_protein']
numerical_features = ['rectal_temp','pulse','respiratory_rate','nasogastric_reflux_ph','packed_cell_volume',
                      'total_protein','abdomo_protein','lesion_1','lesion_2','lesion_3']
In [21]:
count = 0
fig, axes = plt.subplots(1, 7, figsize=(21,4))

for column in numerical_features_missings:
    X[column].hist(ax=axes[count])
    axes[count].set_title(column)
    count += 1
plt.show()

We will use mean for 'rectal_temp' and median for the rest:

In [22]:
for column in numerical_features_missings:
    if column == 'rectal_temp':
        mean_value=X[column].mean()
        X[column].fillna(mean_value, inplace=True)
    else:
        median_value=X[column].median()
        X[column].fillna(median_value, inplace=True)
In [23]:
X.isnull().sum()/len(X)*100
Out[23]:
surgery                  0.0
age                      0.0
rectal_temp              0.0
pulse                    0.0
respiratory_rate         0.0
temp_of_extremities      0.0
peripheral_pulse         0.0
mucous_membrane          0.0
capillary_refill_time    0.0
pain                     0.0
peristalsis              0.0
abdominal_distention     0.0
nasogastric_tube         0.0
nasogastric_reflux       0.0
nasogastric_reflux_ph    0.0
rectal_exam_feces        0.0
abdomen                  0.0
packed_cell_volume       0.0
total_protein            0.0
abdomo_appearance        0.0
abdomo_protein           0.0
surgical_lesion          0.0
lesion_1                 0.0
lesion_2                 0.0
lesion_3                 0.0
cp_data                  0.0
dtype: float64
In [24]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 26 columns):
surgery                  299 non-null object
age                      299 non-null object
rectal_temp              299 non-null float64
pulse                    299 non-null float64
respiratory_rate         299 non-null float64
temp_of_extremities      299 non-null object
peripheral_pulse         299 non-null object
mucous_membrane          299 non-null object
capillary_refill_time    299 non-null object
pain                     299 non-null object
peristalsis              299 non-null object
abdominal_distention     299 non-null object
nasogastric_tube         299 non-null object
nasogastric_reflux       299 non-null object
nasogastric_reflux_ph    299 non-null float64
rectal_exam_feces        299 non-null object
abdomen                  299 non-null object
packed_cell_volume       299 non-null float64
total_protein            299 non-null float64
abdomo_appearance        299 non-null object
abdomo_protein           299 non-null float64
surgical_lesion          299 non-null object
lesion_1                 299 non-null int64
lesion_2                 299 non-null int64
lesion_3                 299 non-null int64
cp_data                  299 non-null object
dtypes: float64(7), int64(3), object(16)
memory usage: 60.9+ KB

Encoding Categoricals

In [25]:
# dummies for features:
X = pd.get_dummies(X, columns=categorical_features,drop_first=True)
In [26]:
y.unique()
Out[26]:
array(['died', 'euthanized', 'lived'], dtype=object)
In [27]:
#  mapping for target {'died':0,'euthanized':1,'lived':2}
y = y.map({'died':0,'euthanized':1,'lived':2})

Train-Test-Split

In [28]:
# random_state guarantees the same output everytime the program is run
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state=1)

Removing Unique values

In [29]:
from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
x_train_num = x_train[numerical_features]
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train_num)
print(x_train_num.columns[constant_filter.get_support()])
x_train_num = x_train_num[x_train_num.columns[constant_filter.get_support()]]
Index(['rectal_temp', 'pulse', 'respiratory_rate', 'nasogastric_reflux_ph',
       'packed_cell_volume', 'total_protein', 'abdomo_protein', 'lesion_1',
       'lesion_2', 'lesion_3'],
      dtype='object')

No uniques

Removing Highly correlated features

Threshold: 75%

In [32]:
# Correlation matrix for all independent vars
corrMatrix = x_train_num.corr()
allVars = corrMatrix.keys()

absCorrWithDep = []
for var in allVars:
    absCorrWithDep.append(abs(y.corr(x_train_num[var])))
# threshold seeting
corrTol = 0.75

# for each column in the corr matrix
for col in corrMatrix:
    
    if col in corrMatrix.keys():
        thisCol = []
        thisVars = []
        temp = corrMatrix[col]
        
        # Store the corr with the dep var for fields that are highly correlated with each other
        for i in range(len(corrMatrix)):
            
            if abs(corrMatrix[col][i]) == 1.0 and col != corrMatrix.keys()[i]:
                thisCorr = 0
            else:
                thisCorr = (1 if abs(corrMatrix[col][i]) > corrTol else -1) * abs(temp[corrMatrix.keys()[i]])
            thisCol.append(thisCorr)
            thisVars.append(corrMatrix.keys()[i])
        
        mask = np.ones(len(thisCol), dtype = bool) # Initialize the mask
        
        ctDelCol = 0 # To keep track of the number of columns deleted
        
        for n, j in enumerate(thisCol):
            # Delete if (a) a var is correlated withh others and do not ave the best corr with dep,
            # or (b) completely corr with the 'col'
            mask[n] = not (j != max(thisCol) and j >= 0)
            
            if j != max(thisCol) and j >= 0:
                # Delete the column from the corr matrix
                corrMatrix.pop('%s' %thisVars[n])
                ctDelCol += 1
                
        # Delete the corresponding row(s) from the corr matrix
        corrMatrix = corrMatrix[mask]
In [34]:
len(corrMatrix.columns)
Out[34]:
10
In [35]:
len(x_train_num.columns)
Out[35]:
10

No highly correlated features

In [36]:
fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train_num.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fd8ddf5548>
In [41]:
x_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 239 entries, 161 to 37
Data columns (total 51 columns):
rectal_temp                         239 non-null float64
pulse                               239 non-null float64
respiratory_rate                    239 non-null float64
nasogastric_reflux_ph               239 non-null float64
packed_cell_volume                  239 non-null float64
total_protein                       239 non-null float64
abdomo_protein                      239 non-null float64
lesion_1                            239 non-null int64
lesion_2                            239 non-null int64
lesion_3                            239 non-null int64
surgery_yes                         239 non-null uint8
age_young                           239 non-null uint8
temp_of_extremities_cool            239 non-null uint8
temp_of_extremities_normal          239 non-null uint8
temp_of_extremities_warm            239 non-null uint8
peripheral_pulse_increased          239 non-null uint8
peripheral_pulse_normal             239 non-null uint8
peripheral_pulse_reduced            239 non-null uint8
mucous_membrane_bright_red          239 non-null uint8
mucous_membrane_dark_cyanotic       239 non-null uint8
mucous_membrane_normal_pink         239 non-null uint8
mucous_membrane_pale_cyanotic       239 non-null uint8
mucous_membrane_pale_pink           239 non-null uint8
capillary_refill_time_less_3_sec    239 non-null uint8
capillary_refill_time_more_3_sec    239 non-null uint8
pain_depressed                      239 non-null uint8
pain_extreme_pain                   239 non-null uint8
pain_mild_pain                      239 non-null uint8
pain_severe_pain                    239 non-null uint8
peristalsis_hypermotile             239 non-null uint8
peristalsis_hypomotile              239 non-null uint8
peristalsis_normal                  239 non-null uint8
abdominal_distention_none           239 non-null uint8
abdominal_distention_severe         239 non-null uint8
abdominal_distention_slight         239 non-null uint8
nasogastric_tube_significant        239 non-null uint8
nasogastric_tube_slight             239 non-null uint8
nasogastric_reflux_more_1_liter     239 non-null uint8
nasogastric_reflux_none             239 non-null uint8
rectal_exam_feces_decreased         239 non-null uint8
rectal_exam_feces_increased         239 non-null uint8
rectal_exam_feces_normal            239 non-null uint8
abdomen_distend_small               239 non-null uint8
abdomen_firm                        239 non-null uint8
abdomen_normal                      239 non-null uint8
abdomen_other                       239 non-null uint8
abdomo_appearance_clear             239 non-null uint8
abdomo_appearance_cloudy            239 non-null uint8
abdomo_appearance_serosanguious     239 non-null uint8
surgical_lesion_yes                 239 non-null uint8
cp_data_yes                         239 non-null uint8
dtypes: float64(7), int64(3), uint8(41)
memory usage: 40.1 KB

Removing 0 importance features

In [44]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
features = x_train.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)
fig, ax = plt.subplots(figsize=(16,14))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
In [47]:
# removing lesion_3
x_train = x_train.drop('lesion_3',axis=1)
x_test = x_test.drop('lesion_3',axis=1)

Modeling

DecisionTreeClassifier

No Cross validation

In [48]:
# DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = dtc.predict(x_test)
metrics.accuracy_score(pred_y, y_test)
Out[48]:
0.5833333333333334
In [49]:
# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
metrics.accuracy_score(pred_y, y_test)
Out[49]:
0.7166666666666667

With cross-validation (using train data only)

In [50]:
from sklearn.model_selection import KFold, cross_val_score
In [59]:
# DecisionTreeClassifier
model = DecisionTreeClassifier()
result = cross_val_score(model, X, y, cv=5, scoring='accuracy',)
result.mean()
Out[59]:
0.6089265536723164
In [58]:
# RandomForest
model = RandomForestClassifier()
result = cross_val_score(model, X, y, cv=5, scoring='accuracy')
result.mean()
Out[58]:
0.7223728813559321

RandomForest gives better results than DecisionTree because it is an ensemble model made of multiple DecisionTrees.

Improvements

  • MCA could be applied to reduce categorical dimension.
  • Fine tuning
  • Test other classifiers
  • Use better missing value handling methods such as: KNNimputerm, MICE, xgboost.

Contact Me

www.linkedin.com/in/billygustave

billygustave.com