#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

Male/Female Voice Classifier ¶

Using voice.csv data to classify voices as Male or Female

#libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import VarianceThreshold
from feature_selector import FeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Data Cleaning and Exploration ¶

df = pd.read_csv('voice.csv')
df.head()

Applying LabelEncoder to label:

Male: 1
Female: 0

df.label = LabelEncoder().fit_transform(df.label)
df.head()

Goal:

Using observations to predict 'label'

df.shape

(3168, 21)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
meanfreq    3168 non-null float64
sd          3168 non-null float64
median      3168 non-null float64
Q25         3168 non-null float64
Q75         3168 non-null float64
IQR         3168 non-null float64
skew        3168 non-null float64
kurt        3168 non-null float64
sp.ent      3168 non-null float64
sfm         3168 non-null float64
mode        3168 non-null float64
centroid    3168 non-null float64
meanfun     3168 non-null float64
minfun      3168 non-null float64
maxfun      3168 non-null float64
meandom     3168 non-null float64
mindom      3168 non-null float64
maxdom      3168 non-null float64
dfrange     3168 non-null float64
modindx     3168 non-null float64
label       3168 non-null int32
dtypes: float64(20), int32(1)
memory usage: 507.5 KB

Data:

observation: 3168
target: label
number of features: 20

X = df.drop('label', axis=1)
y = df.label

Train-Test-Split ¶

# testing data size at 20%
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = .2, random_state=4)
x_train.shape

(2534, 20)

No Missing values, checking for zero and no variance features:

# zero variance (unique values)
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train)
print(x_train.columns[constant_filter.get_support()])
x_train = x_train[x_train.columns[constant_filter.get_support()]]

Index(['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
       'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
       'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx'],
      dtype='object')

x_train.var()

meanfreq        0.000912
sd              0.000280
median          0.001356
Q25             0.002398
Q75             0.000575
IQR             0.001834
skew           18.663979
kurt        19023.395830
sp.ent          0.002012
sfm             0.031667
mode            0.005961
centroid        0.000912
meanfun         0.001061
minfun          0.000374
maxfun          0.000900
meandom         0.277841
mindom          0.004009
maxdom         12.448487
dfrange        12.441876
modindx         0.013649
dtype: float64

Note: This data is not ideal for low variance cut off.

High correlation filter (threshold: 75%):</h4>
*Technique by Vishal Patel

# Correlation matrix for all independent vars
corrMatrix = x_train.corr()
allVars = corrMatrix.keys()

absCorrWithDep = []
for var in allVars:
    absCorrWithDep.append(abs(y.corr(x_train[var])))
# threshold seeting
corrTol = 0.75

# for each column in the corr matrix
for col in corrMatrix:
    
    if col in corrMatrix.keys():
        thisCol = []
        thisVars = []
        temp = corrMatrix[col]
        
        # Store the corr with the dep var for fields that are highly correlated with each other
        for i in range(len(corrMatrix)):
            
            if abs(corrMatrix[col][i]) == 1.0 and col != corrMatrix.keys()[i]:
                thisCorr = 0
            else:
                thisCorr = (1 if abs(corrMatrix[col][i]) > corrTol else -1) * abs(temp[corrMatrix.keys()[i]])
            thisCol.append(thisCorr)
            thisVars.append(corrMatrix.keys()[i])
        
        mask = np.ones(len(thisCol), dtype = bool) # Initialize the mask
        
        ctDelCol = 0 # To keep track of the number of columns deleted
        
        for n, j in enumerate(thisCol):
            # Delete if (a) a var is correlated withh others and do not ave the best corr with dep,
            # or (b) completely corr with the 'col'
            mask[n] = not (j != max(thisCol) and j >= 0)
            
            if j != max(thisCol) and j >= 0:
                # Delete the column from the corr matrix
                corrMatrix.pop('%s' %thisVars[n])
                ctDelCol += 1
                
        # Delete the corresponding row(s) from the corr matrix
        corrMatrix = corrMatrix[mask]

print(corrMatrix.columns)
x_train = x_train[corrMatrix.columns]
fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)

Index(['meanfreq', 'sd', 'Q75', 'skew', 'sp.ent', 'mode', 'meanfun', 'minfun',
       'maxfun', 'meandom', 'mindom', 'modindx'],
      dtype='object')

<matplotlib.axes._subplots.AxesSubplot at 0x1d082aa5408>

Filtering for feature importance:

x_2 = x_train
x_2.shape

(2534, 12)

fs = FeatureSelector(data = x_train, labels = y_train)
fs.identify_zero_importance(task='classification', eval_metric='auc')

Training Gradient Boosting Model

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[66]	valid_0's auc: 0.99256	valid_0's binary_logloss: 0.112121
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[171]	valid_0's auc: 0.999284	valid_0's binary_logloss: 0.0404421
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[160]	valid_0's auc: 0.997796	valid_0's binary_logloss: 0.0731871
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[134]	valid_0's auc: 0.997189	valid_0's binary_logloss: 0.0661349
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[128]	valid_0's auc: 0.997823	valid_0's binary_logloss: 0.062535
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[136]	valid_0's auc: 0.997658	valid_0's binary_logloss: 0.0772197
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[81]	valid_0's auc: 0.997217	valid_0's binary_logloss: 0.0652777
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[122]	valid_0's auc: 0.994544	valid_0's binary_logloss: 0.103064
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[103]	valid_0's auc: 0.998044	valid_0's binary_logloss: 0.0697494
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[185]	valid_0's auc: 0.999366	valid_0's binary_logloss: 0.0334895

0 features with zero importance after one-hot encoding.

Feature Importance:

etc = RandomForestClassifier()
etc.fit(x_train, y_train)
features = x_train.columns
importances = etc.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Testing different models using tose 12 features:

# LogisticRegression
logr = LogisticRegression()
logr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = logr.predict(x_test)
metrics.accuracy_score(pred_y, y_test)

0.9053627760252366

# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
metrics.accuracy_score(pred_y, y_test)

0.973186119873817

from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = nb.predict(x_test)
metrics.accuracy_score(pred_y, y_test)

0.9148264984227129

from sklearn.svm import SVC
svc = SVC()
svc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = svc.predict(x_test)
metrics.accuracy_score(pred_y, y_test)

0.7208201892744479

Random Forest classifier as best accuracy: 97%

***Improvements:***

Pipeline GridSearchCV with SequentialFeatureSelector for further narrowing number of features and selecting a better model while fine tuning it.

	meanfreq	sd	median	Q25	Q75	IQR	skew	kurt	sp.ent	sfm	...	centroid	meanfun	minfun	maxfun	meandom	mindom	maxdom	dfrange	modindx	label
0	0.059781	0.064241	0.032027	0.015071	0.090193	0.075122	12.863462	274.402906	0.893369	0.491918	...	0.059781	0.084279	0.015702	0.275862	0.007812	0.007812	0.007812	0.000000	0.000000	male
1	0.066009	0.067310	0.040229	0.019414	0.092666	0.073252	22.423285	634.613855	0.892193	0.513724	...	0.066009	0.107937	0.015826	0.250000	0.009014	0.007812	0.054688	0.046875	0.052632	male
2	0.077316	0.083829	0.036718	0.008701	0.131908	0.123207	30.757155	1024.927705	0.846389	0.478905	...	0.077316	0.098706	0.015656	0.271186	0.007990	0.007812	0.015625	0.007812	0.046512	male
3	0.151228	0.072111	0.158011	0.096582	0.207955	0.111374	1.232831	4.177296	0.963322	0.727232	...	0.151228	0.088965	0.017798	0.250000	0.201497	0.007812	0.562500	0.554688	0.247119	male
4	0.135120	0.079146	0.124656	0.078720	0.206045	0.127325	1.101174	4.333713	0.971955	0.783568	...	0.135120	0.106398	0.016931	0.266667	0.712812	0.007812	5.484375	5.476562	0.208274	male

	meanfreq	sd	median	Q25	Q75	IQR	skew	kurt	sp.ent	sfm	...	centroid	meanfun	minfun	maxfun	meandom	mindom	maxdom	dfrange	modindx	label
0	0.059781	0.064241	0.032027	0.015071	0.090193	0.075122	12.863462	274.402906	0.893369	0.491918	...	0.059781	0.084279	0.015702	0.275862	0.007812	0.007812	0.007812	0.000000	0.000000	1
1	0.066009	0.067310	0.040229	0.019414	0.092666	0.073252	22.423285	634.613855	0.892193	0.513724	...	0.066009	0.107937	0.015826	0.250000	0.009014	0.007812	0.054688	0.046875	0.052632	1
2	0.077316	0.083829	0.036718	0.008701	0.131908	0.123207	30.757155	1024.927705	0.846389	0.478905	...	0.077316	0.098706	0.015656	0.271186	0.007990	0.007812	0.015625	0.007812	0.046512	1
3	0.151228	0.072111	0.158011	0.096582	0.207955	0.111374	1.232831	4.177296	0.963322	0.727232	...	0.151228	0.088965	0.017798	0.250000	0.201497	0.007812	0.562500	0.554688	0.247119	1
4	0.135120	0.079146	0.124656	0.078720	0.206045	0.127325	1.101174	4.333713	0.971955	0.783568	...	0.135120	0.106398	0.016931	0.266667	0.712812	0.007812	5.484375	5.476562	0.208274	1

Billy Gustave

Male/Female Voice Classifier

Male/Female Voice Classifier ¶

Data Cleaning and Exploration ¶

Train-Test-Split ¶

High correlation filter (threshold: 75%):</h4>
*Technique by Vishal Patel

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

Billy Gustave

Male/Female Voice Classifier

Male/Female Voice Classifier ¶

Data Cleaning and Exploration ¶

Train-Test-Split ¶

High correlation filter (threshold: 75%):</h4> *Technique by Vishal Patel

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

High correlation filter (threshold: 75%):</h4>
*Technique by Vishal Patel