#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

Breast Cancer

Business Challenge/Requiment
John Cancer Hospital (JCH) is a leading cancer hospital in USA. It specializes in preventingbreast cancer.
Over the period of the last few years, JCH has collected breast cancer data from patients who came for screening/treatment.
However,this data has almost 30 attributes and is difficult to run and interpret the result.

Goal:

  • Classify observations diagnosis.
    Data: breast-cancer-data.csv

  • using PCA and GridSearchCV for hyper parameters tuning
  • Comapring PCA result to no PCA's

    Note:
    • Diagnosis (M = malignant, B = benign)
    • All feature values are recoded with four significant digits.
    • Class distribution: 357 benign, 212 malignant </small>

Data Exploration

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
In [2]:
df = pd.read_csv('breast-cancer-data.csv')
df.shape
Out[2]:
(569, 32)
In [3]:
df.head()
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 32 columns

In [4]:
df.groupby('diagnosis').size()
Out[4]:
diagnosis
B    357
M    212
dtype: int64

Data Cleaning

In [5]:
df.diagnosis = df.diagnosis.map({'M':1,'B':0})
In [6]:
# Features and Target
X = df.drop(['id','diagnosis'], axis=1)
y = df.diagnosis
X.shape
Out[6]:
(569, 30)
In [7]:
#df.drop('id',axis=1).groupby('diagnosis').hist(figsize=(15,15))
X.describe()
Out[7]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
id                         569 non-null int64
diagnosis                  569 non-null int64
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
dtypes: float64(30), int64(2)
memory usage: 142.4 KB

No Missing values

In [9]:
from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X)
columns_to_remove = [name for name in X.columns if name not in X.columns[constant_filter.get_support()]]
print('Unique features: ', columns_to_remove)
Unique features:  []

No Features with Unique values

Hadling high correlation
Threshold 0.75

In [10]:
from feature_selector import FeatureSelector
fs = FeatureSelector(data = X, labels = y)
fs.identify_collinear(correlation_threshold=0.75)
keep = [name for name in X.columns if name not in fs.ops['collinear']]
keep
18 features with a correlation magnitude greater than 0.75.

Out[10]:
['radius_mean',
 'texture_mean',
 'smoothness_mean',
 'compactness_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'smoothness_se',
 'compactness_se',
 'symmetry_se',
 'symmetry_worst']
In [11]:
X_clean = X[keep]
In [12]:
fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(X_clean.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x19ad4b055c8>

Model Selection

In [13]:
# model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
In [14]:
models = []
models.append(('LR',LogisticRegression(solver = 'newton-cg')))
models.append(('NB',GaussianNB()))
models.append(('DTC',DecisionTreeClassifier()))
models.append(('RFC',RandomForestClassifier()))
models.append(('GBC',GradientBoostingClassifier()))
models.append(('SVC',SVC()))
models.append(('KNN',KNeighborsClassifier()))

Using Kfold and Cross Validation:

In [15]:
from sklearn.model_selection import cross_val_score, KFold
kfold = KFold(n_splits=10, random_state=17, shuffle=True)
In [16]:
names = []
scores = []
for name, model in models:
    score = cross_val_score(model, X_clean, y, cv=kfold, scoring='accuracy').mean()
    names.append(name)
    scores.append(score)
results  = pd.DataFrame({'Model': names,'Accuracy': scores})
results
Out[16]:
Model Accuracy
0 LR 0.906704
1 NB 0.901566
2 DTC 0.919142
3 RFC 0.945457
4 GBC 0.945395
5 SVC 0.889129
6 KNN 0.885652
In [17]:
# graph of performance
axis = sns.barplot(x = 'Model', y = 'Accuracy', data = results)
axis.set(xlabel='Classifier', ylabel='Accuracy')
for p in axis.patches:
    height = p.get_height()
    axis.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center") 
    
plt.show()

Both Random Forest and Gradient Boosting have about same accuracy
We will use Gradient Boosting

Feature Selection

In [18]:
from sklearn.feature_selection import RFECV
In [19]:
gbc = GradientBoostingClassifier()
rfecv = RFECV(estimator=gbc, step=1, cv=kfold, scoring='accuracy')
rfecv.fit(X_clean,y)
Out[19]:
RFECV(cv=KFold(n_splits=10, random_state=17, shuffle=True),
      estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                           criterion='friedman_mse', init=None,
                                           learning_rate=0.1, loss='deviance',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=1,
                                           min_samples_split=2,
                                           min_weight_fraction_leaf=0.0,
                                           n_estimators=100,
                                           n_iter_no_change=None,
                                           presort='deprecated',
                                           random_state=None, subsample=1.0,
                                           tol=0.0001, validation_fraction=0.1,
                                           verbose=0, warm_start=False),
      min_features_to_select=1, n_jobs=None, scoring='accuracy', step=1,
      verbose=0)
In [20]:
plt.figure()
plt.title('Gradient Boosting CV score vs No of Features')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
In [21]:
feature_importance = list(zip(X.columns, rfecv.support_))
new_features = []
for key,value in enumerate(feature_importance):
    if(value[1]) == True:
        new_features.append(value[0])
        
print(new_features)
['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se']
In [22]:
# Calculate accuracy scores 
X_clean_new = df[new_features]
initial_score = cross_val_score(gbc, X_clean, y, cv=kfold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gbc, X_clean_new, y, cv=kfold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))
Initial accuracy : 0.9453947368421053 
Accuracy after Feature Selection : 0.9313909774436089 

Though the accuracy on filtered 1 is less, we will be using it because it runs faster and the difference is small.

Feature Engineering

PCA
Testing accuracy with PCA

In [23]:
from sklearn.decomposition import PCA
In [24]:
#Fitting the PCA algorithm with our Data
pca = PCA().fit(X_clean_new)

#Plotting the Cumulative Summation of the Explained Variance
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Pulsar Dataset Explained Variance')
plt.ylim([0.995, 1.01])
plt.xlim([0,7])
plt.grid()
plt.show()
In [25]:
pca = PCA(n_components=2).fit(X_clean_new)
X_cleam_new_Trnsf = pca.transform(X_clean_new)
# total explained variance ratio
pca.explained_variance_ratio_.sum()
Out[25]:
0.9998769734480429
In [26]:
# Calculate accuracy scores 
initial_score = cross_val_score(gbc, X_clean_new, y, cv=kfold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gbc, X_cleam_new_Trnsf, y, cv=kfold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))
Initial accuracy : 0.9331453634085213 
Accuracy after Feature Selection : 0.8926691729323307 

Not better

PCA Standardized
Testing accuracy with PCA normalized

In [27]:
from sklearn.preprocessing import StandardScaler
In [28]:
scaler = StandardScaler().fit(X_clean_new)
X_clean_new_scaled = scaler.transform(X_clean_new)
In [29]:
#Fitting the PCA algorithm with our Data
pca = PCA().fit(X_clean_new_scaled)

#Plotting the Cumulative Summation of the Explained Variance
fig, ax = plt.subplots(figsize=(16,14))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Pulsar Dataset Explained Variance')
plt.xlim([1, X_clean_new_scaled.shape[1]])
plt.grid()
plt.show()
In [30]:
pca = PCA(n_components=5).fit(X_clean_new_scaled)
X_clean_new_scaled_Trnsf = pca.transform(X_clean_new_scaled)
# total explained variance ratio
pca.explained_variance_ratio_.sum()
Out[30]:
0.9535747688260753
In [31]:
# Calculate accuracy scores 
initial_score = cross_val_score(gbc, X_clean_new, y, cv=kfold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gbc, X_clean_new_scaled_Trnsf, y, cv=kfold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))
Initial accuracy : 0.9313909774436089 
Accuracy after Feature Selection : 0.9419486215538848 

Reduced and normalized data is better.

Model fine tuning

In [32]:
from sklearn.model_selection import GridSearchCV
In [33]:
# parameters
param_grid = {'loss':['deviance','exponential'],
              'learning_rate': [0.01, 0.05, 0.1, 0.2],
              'max_depth':[2, 3, 5],
              'max_features':['log2','sqrt', None],
              'criterion': ['friedman_mse', 'mse'],
              'subsample':[0.25, 0.5, 1.0],
              'n_estimators':[10, 100, 200]}
In [34]:
gsearch = GridSearchCV(GradientBoostingClassifier(), param_grid=param_grid, cv=kfold, scoring='accuracy')
gsearch.fit(X_clean_new_scaled_Trnsf, y)
print(gsearch.best_params_)
gsearch.best_estimator_
{'criterion': 'mse', 'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 3, 'max_features': 'sqrt', 'n_estimators': 100, 'subsample': 0.5}
Out[34]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='mse', init=None,
                           learning_rate=0.1, loss='exponential', max_depth=3,
                           max_features='sqrt', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=0.5, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
In [38]:
gbc_final = GradientBoostingClassifier(criterion='mse',
                                       learning_rate=0.1, 
                                       loss='exponential', 
                                       max_depth=3, 
                                       max_features='sqrt', 
                                       n_estimators=100, 
                                       subsample=0.5)
final_score = cross_val_score(gbc_final, X_clean_new_scaled_Trnsf, y, cv=kfold, scoring='accuracy').mean()
print("Final accuracy : {} ".format(final_score))
Final accuracy : 0.9419799498746867 

Note:

  • Can get better result with more grid search parameters but it also requires more computing power.
  • Not best practice to work on whole dataset, it better to do traing test split and use all transformation from train to test.

Contact Me

www.linkedin.com/in/billygustave

billygustave.com