#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

Breast Cancer ¶

Business Challenge/Requiment
John Cancer Hospital (JCH) is a leading cancer hospital in USA. It specializes in preventingbreast cancer.
Over the period of the last few years, JCH has collected breast cancer data from patients who came for screening/treatment.
However,this data has almost 30 attributes and is difficult to run and interpret the result.

Goal:

Classify observations diagnosis.
Data: breast-cancer-data.csv
using PCA and GridSearchCV for hyper parameters tuning
Comapring PCA result to no PCA's
Note:
- Diagnosis (M = malignant, B = benign)
- All feature values are recoded with four significant digits.
- Class distribution: 357 benign, 212 malignant </small>

Data Exploration ¶

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

df = pd.read_csv('breast-cancer-data.csv')
df.shape

(569, 32)

df.head()

df.groupby('diagnosis').size()

diagnosis
B    357
M    212
dtype: int64

Data Cleaning ¶

df.diagnosis = df.diagnosis.map({'M':1,'B':0})

# Features and Target
X = df.drop(['id','diagnosis'], axis=1)
y = df.diagnosis
X.shape

(569, 30)

#df.drop('id',axis=1).groupby('diagnosis').hist(figsize=(15,15))
X.describe()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
id                         569 non-null int64
diagnosis                  569 non-null int64
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
dtypes: float64(30), int64(2)
memory usage: 142.4 KB

No Missing values

from sklearn.feature_selection import VarianceThreshold
# zero variance (unique values)
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X)
columns_to_remove = [name for name in X.columns if name not in X.columns[constant_filter.get_support()]]
print('Unique features: ', columns_to_remove)

Unique features:  []

No Features with Unique values

Hadling high correlation
Threshold 0.75

from feature_selector import FeatureSelector
fs = FeatureSelector(data = X, labels = y)
fs.identify_collinear(correlation_threshold=0.75)
keep = [name for name in X.columns if name not in fs.ops['collinear']]
keep

18 features with a correlation magnitude greater than 0.75.

['radius_mean',
 'texture_mean',
 'smoothness_mean',
 'compactness_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'smoothness_se',
 'compactness_se',
 'symmetry_se',
 'symmetry_worst']

X_clean = X[keep]

fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(X_clean.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x19ad4b055c8>

Model Selection ¶

# model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = []
models.append(('LR',LogisticRegression(solver = 'newton-cg')))
models.append(('NB',GaussianNB()))
models.append(('DTC',DecisionTreeClassifier()))
models.append(('RFC',RandomForestClassifier()))
models.append(('GBC',GradientBoostingClassifier()))
models.append(('SVC',SVC()))
models.append(('KNN',KNeighborsClassifier()))

Using Kfold and Cross Validation:

from sklearn.model_selection import cross_val_score, KFold
kfold = KFold(n_splits=10, random_state=17, shuffle=True)

names = []
scores = []
for name, model in models:
    score = cross_val_score(model, X_clean, y, cv=kfold, scoring='accuracy').mean()
    names.append(name)
    scores.append(score)
results  = pd.DataFrame({'Model': names,'Accuracy': scores})
results

# graph of performance
axis = sns.barplot(x = 'Model', y = 'Accuracy', data = results)
axis.set(xlabel='Classifier', ylabel='Accuracy')
for p in axis.patches:
    height = p.get_height()
    axis.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center") 
    
plt.show()

Both Random Forest and Gradient Boosting have about same accuracy
We will use Gradient Boosting

Feature Selection

from sklearn.feature_selection import RFECV

gbc = GradientBoostingClassifier()
rfecv = RFECV(estimator=gbc, step=1, cv=kfold, scoring='accuracy')
rfecv.fit(X_clean,y)

RFECV(cv=KFold(n_splits=10, random_state=17, shuffle=True),
      estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                           criterion='friedman_mse', init=None,
                                           learning_rate=0.1, loss='deviance',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=1,
                                           min_samples_split=2,
                                           min_weight_fraction_leaf=0.0,
                                           n_estimators=100,
                                           n_iter_no_change=None,
                                           presort='deprecated',
                                           random_state=None, subsample=1.0,
                                           tol=0.0001, validation_fraction=0.1,
                                           verbose=0, warm_start=False),
      min_features_to_select=1, n_jobs=None, scoring='accuracy', step=1,
      verbose=0)

plt.figure()
plt.title('Gradient Boosting CV score vs No of Features')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

feature_importance = list(zip(X.columns, rfecv.support_))
new_features = []
for key,value in enumerate(feature_importance):
    if(value[1]) == True:
        new_features.append(value[0])
        
print(new_features)

['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se']

# Calculate accuracy scores 
X_clean_new = df[new_features]
initial_score = cross_val_score(gbc, X_clean, y, cv=kfold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gbc, X_clean_new, y, cv=kfold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

Initial accuracy : 0.9453947368421053 
Accuracy after Feature Selection : 0.9313909774436089

Though the accuracy on filtered 1 is less, we will be using it because it runs faster and the difference is small.

Feature Engineering ¶

PCA
Testing accuracy with PCA

from sklearn.decomposition import PCA

#Fitting the PCA algorithm with our Data
pca = PCA().fit(X_clean_new)

#Plotting the Cumulative Summation of the Explained Variance
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Pulsar Dataset Explained Variance')
plt.ylim([0.995, 1.01])
plt.xlim([0,7])
plt.grid()
plt.show()

pca = PCA(n_components=2).fit(X_clean_new)
X_cleam_new_Trnsf = pca.transform(X_clean_new)
# total explained variance ratio
pca.explained_variance_ratio_.sum()

0.9998769734480429

# Calculate accuracy scores 
initial_score = cross_val_score(gbc, X_clean_new, y, cv=kfold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gbc, X_cleam_new_Trnsf, y, cv=kfold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

Initial accuracy : 0.9331453634085213 
Accuracy after Feature Selection : 0.8926691729323307

Not better

PCA Standardized
Testing accuracy with PCA normalized

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_clean_new)
X_clean_new_scaled = scaler.transform(X_clean_new)

#Fitting the PCA algorithm with our Data
pca = PCA().fit(X_clean_new_scaled)

#Plotting the Cumulative Summation of the Explained Variance
fig, ax = plt.subplots(figsize=(16,14))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Pulsar Dataset Explained Variance')
plt.xlim([1, X_clean_new_scaled.shape[1]])
plt.grid()
plt.show()

pca = PCA(n_components=5).fit(X_clean_new_scaled)
X_clean_new_scaled_Trnsf = pca.transform(X_clean_new_scaled)
# total explained variance ratio
pca.explained_variance_ratio_.sum()

0.9535747688260753

# Calculate accuracy scores 
initial_score = cross_val_score(gbc, X_clean_new, y, cv=kfold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gbc, X_clean_new_scaled_Trnsf, y, cv=kfold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

Initial accuracy : 0.9313909774436089 
Accuracy after Feature Selection : 0.9419486215538848

Reduced and normalized data is better.

Model fine tuning ¶

from sklearn.model_selection import GridSearchCV

# parameters
param_grid = {'loss':['deviance','exponential'],
              'learning_rate': [0.01, 0.05, 0.1, 0.2],
              'max_depth':[2, 3, 5],
              'max_features':['log2','sqrt', None],
              'criterion': ['friedman_mse', 'mse'],
              'subsample':[0.25, 0.5, 1.0],
              'n_estimators':[10, 100, 200]}

gsearch = GridSearchCV(GradientBoostingClassifier(), param_grid=param_grid, cv=kfold, scoring='accuracy')
gsearch.fit(X_clean_new_scaled_Trnsf, y)
print(gsearch.best_params_)
gsearch.best_estimator_

{'criterion': 'mse', 'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 3, 'max_features': 'sqrt', 'n_estimators': 100, 'subsample': 0.5}

GradientBoostingClassifier(ccp_alpha=0.0, criterion='mse', init=None,
                           learning_rate=0.1, loss='exponential', max_depth=3,
                           max_features='sqrt', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=0.5, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

gbc_final = GradientBoostingClassifier(criterion='mse',
                                       learning_rate=0.1, 
                                       loss='exponential', 
                                       max_depth=3, 
                                       max_features='sqrt', 
                                       n_estimators=100, 
                                       subsample=0.5)
final_score = cross_val_score(gbc_final, X_clean_new_scaled_Trnsf, y, cv=kfold, scoring='accuracy').mean()
print("Final accuracy : {} ".format(final_score))

Final accuracy : 0.9419799498746867

Note:

Can get better result with more grid search parameters but it also requires more computing power.
Not best practice to work on whole dataset, it better to do traing test split and use all transformation from train to test.

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	...	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	0.062798	...	16.269190	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946
std	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	0.007060	...	4.833242	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.049960	...	7.930000	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040
25%	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	0.057700	...	13.010000	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460
50%	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	0.061540	...	14.970000	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040
75%	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	0.066120	...	18.790000	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080
max	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	0.097440	...	36.040000	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500

Billy Gustave

Breast Cancer