#MachineLearning #SupervisedLearning #Classification

By Billy Gustave

Male/Female Voice Classifier

Using voice.csv data to classify voices as Male or Female

In [1]:
#libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import VarianceThreshold
from feature_selector import FeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Data Cleaning and Exploration

In [2]:
df = pd.read_csv('voice.csv')
df.head()
Out[2]:
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm ... centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx label
0 0.059781 0.064241 0.032027 0.015071 0.090193 0.075122 12.863462 274.402906 0.893369 0.491918 ... 0.059781 0.084279 0.015702 0.275862 0.007812 0.007812 0.007812 0.000000 0.000000 male
1 0.066009 0.067310 0.040229 0.019414 0.092666 0.073252 22.423285 634.613855 0.892193 0.513724 ... 0.066009 0.107937 0.015826 0.250000 0.009014 0.007812 0.054688 0.046875 0.052632 male
2 0.077316 0.083829 0.036718 0.008701 0.131908 0.123207 30.757155 1024.927705 0.846389 0.478905 ... 0.077316 0.098706 0.015656 0.271186 0.007990 0.007812 0.015625 0.007812 0.046512 male
3 0.151228 0.072111 0.158011 0.096582 0.207955 0.111374 1.232831 4.177296 0.963322 0.727232 ... 0.151228 0.088965 0.017798 0.250000 0.201497 0.007812 0.562500 0.554688 0.247119 male
4 0.135120 0.079146 0.124656 0.078720 0.206045 0.127325 1.101174 4.333713 0.971955 0.783568 ... 0.135120 0.106398 0.016931 0.266667 0.712812 0.007812 5.484375 5.476562 0.208274 male

5 rows × 21 columns

Applying LabelEncoder to label:

  • Male: 1
  • Female: 0
In [3]:
df.label = LabelEncoder().fit_transform(df.label)
df.head()
Out[3]:
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm ... centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx label
0 0.059781 0.064241 0.032027 0.015071 0.090193 0.075122 12.863462 274.402906 0.893369 0.491918 ... 0.059781 0.084279 0.015702 0.275862 0.007812 0.007812 0.007812 0.000000 0.000000 1
1 0.066009 0.067310 0.040229 0.019414 0.092666 0.073252 22.423285 634.613855 0.892193 0.513724 ... 0.066009 0.107937 0.015826 0.250000 0.009014 0.007812 0.054688 0.046875 0.052632 1
2 0.077316 0.083829 0.036718 0.008701 0.131908 0.123207 30.757155 1024.927705 0.846389 0.478905 ... 0.077316 0.098706 0.015656 0.271186 0.007990 0.007812 0.015625 0.007812 0.046512 1
3 0.151228 0.072111 0.158011 0.096582 0.207955 0.111374 1.232831 4.177296 0.963322 0.727232 ... 0.151228 0.088965 0.017798 0.250000 0.201497 0.007812 0.562500 0.554688 0.247119 1
4 0.135120 0.079146 0.124656 0.078720 0.206045 0.127325 1.101174 4.333713 0.971955 0.783568 ... 0.135120 0.106398 0.016931 0.266667 0.712812 0.007812 5.484375 5.476562 0.208274 1

5 rows × 21 columns

Goal:

  • Using observations to predict 'label'
In [4]:
df.shape
Out[4]:
(3168, 21)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
meanfreq    3168 non-null float64
sd          3168 non-null float64
median      3168 non-null float64
Q25         3168 non-null float64
Q75         3168 non-null float64
IQR         3168 non-null float64
skew        3168 non-null float64
kurt        3168 non-null float64
sp.ent      3168 non-null float64
sfm         3168 non-null float64
mode        3168 non-null float64
centroid    3168 non-null float64
meanfun     3168 non-null float64
minfun      3168 non-null float64
maxfun      3168 non-null float64
meandom     3168 non-null float64
mindom      3168 non-null float64
maxdom      3168 non-null float64
dfrange     3168 non-null float64
modindx     3168 non-null float64
label       3168 non-null int32
dtypes: float64(20), int32(1)
memory usage: 507.5 KB

Data:

  • observation: 3168
  • target: label
  • number of features: 20
In [6]:
X = df.drop('label', axis=1)
y = df.label

Train-Test-Split

In [7]:
# testing data size at 20%
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = .2, random_state=4)
x_train.shape
Out[7]:
(2534, 20)

No Missing values, checking for zero and no variance features:

In [8]:
# zero variance (unique values)
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train)
print(x_train.columns[constant_filter.get_support()])
x_train = x_train[x_train.columns[constant_filter.get_support()]]
Index(['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
       'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
       'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx'],
      dtype='object')
In [9]:
x_train.var()
Out[9]:
meanfreq        0.000912
sd              0.000280
median          0.001356
Q25             0.002398
Q75             0.000575
IQR             0.001834
skew           18.663979
kurt        19023.395830
sp.ent          0.002012
sfm             0.031667
mode            0.005961
centroid        0.000912
meanfun         0.001061
minfun          0.000374
maxfun          0.000900
meandom         0.277841
mindom          0.004009
maxdom         12.448487
dfrange        12.441876
modindx         0.013649
dtype: float64

Note: This data is not ideal for low variance cut off.

High correlation filter (threshold: 75%):</h4>
*Technique by Vishal Patel

In [10]:
# Correlation matrix for all independent vars
corrMatrix = x_train.corr()
allVars = corrMatrix.keys()

absCorrWithDep = []
for var in allVars:
    absCorrWithDep.append(abs(y.corr(x_train[var])))
# threshold seeting
corrTol = 0.75

# for each column in the corr matrix
for col in corrMatrix:
    
    if col in corrMatrix.keys():
        thisCol = []
        thisVars = []
        temp = corrMatrix[col]
        
        # Store the corr with the dep var for fields that are highly correlated with each other
        for i in range(len(corrMatrix)):
            
            if abs(corrMatrix[col][i]) == 1.0 and col != corrMatrix.keys()[i]:
                thisCorr = 0
            else:
                thisCorr = (1 if abs(corrMatrix[col][i]) > corrTol else -1) * abs(temp[corrMatrix.keys()[i]])
            thisCol.append(thisCorr)
            thisVars.append(corrMatrix.keys()[i])
        
        mask = np.ones(len(thisCol), dtype = bool) # Initialize the mask
        
        ctDelCol = 0 # To keep track of the number of columns deleted
        
        for n, j in enumerate(thisCol):
            # Delete if (a) a var is correlated withh others and do not ave the best corr with dep,
            # or (b) completely corr with the 'col'
            mask[n] = not (j != max(thisCol) and j >= 0)
            
            if j != max(thisCol) and j >= 0:
                # Delete the column from the corr matrix
                corrMatrix.pop('%s' %thisVars[n])
                ctDelCol += 1
                
        # Delete the corresponding row(s) from the corr matrix
        corrMatrix = corrMatrix[mask]
In [11]:
print(corrMatrix.columns)
x_train = x_train[corrMatrix.columns]
fig, ax = plt.subplots(figsize=(16,14))
sns.heatmap(x_train.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
Index(['meanfreq', 'sd', 'Q75', 'skew', 'sp.ent', 'mode', 'meanfun', 'minfun',
       'maxfun', 'meandom', 'mindom', 'modindx'],
      dtype='object')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d082aa5408>

Filtering for feature importance:

In [12]:
x_2 = x_train
x_2.shape
Out[12]:
(2534, 12)
In [13]:
fs = FeatureSelector(data = x_train, labels = y_train)
fs.identify_zero_importance(task='classification', eval_metric='auc')
Training Gradient Boosting Model

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[66]	valid_0's auc: 0.99256	valid_0's binary_logloss: 0.112121
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[171]	valid_0's auc: 0.999284	valid_0's binary_logloss: 0.0404421
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[160]	valid_0's auc: 0.997796	valid_0's binary_logloss: 0.0731871
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[134]	valid_0's auc: 0.997189	valid_0's binary_logloss: 0.0661349
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[128]	valid_0's auc: 0.997823	valid_0's binary_logloss: 0.062535
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[136]	valid_0's auc: 0.997658	valid_0's binary_logloss: 0.0772197
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[81]	valid_0's auc: 0.997217	valid_0's binary_logloss: 0.0652777
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[122]	valid_0's auc: 0.994544	valid_0's binary_logloss: 0.103064
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[103]	valid_0's auc: 0.998044	valid_0's binary_logloss: 0.0697494
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[185]	valid_0's auc: 0.999366	valid_0's binary_logloss: 0.0334895

0 features with zero importance after one-hot encoding.

Feature Importance:

In [14]:
etc = RandomForestClassifier()
etc.fit(x_train, y_train)
features = x_train.columns
importances = etc.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Testing different models using tose 12 features:

In [15]:
# LogisticRegression
logr = LogisticRegression()
logr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = logr.predict(x_test)
metrics.accuracy_score(pred_y, y_test)
Out[15]:
0.9053627760252366
In [16]:
# RandomForest
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = rfr.predict(x_test)
metrics.accuracy_score(pred_y, y_test)
Out[16]:
0.973186119873817
In [17]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = nb.predict(x_test)
metrics.accuracy_score(pred_y, y_test)
Out[17]:
0.9148264984227129
In [18]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(x_train,y_train)
x_test = x_test[x_train.columns]
pred_y = svc.predict(x_test)
metrics.accuracy_score(pred_y, y_test)
Out[18]:
0.7208201892744479

Random Forest classifier as best accuracy: 97%

***Improvements:***

  • Pipeline GridSearchCV with SequentialFeatureSelector for further narrowing number of features and selecting a better model while fine tuning it.

Contact Me

www.linkedin.com/in/billygustave

billygustave.com