#MachineLearning #UnsupervisedLearning #Clustering
By Billy Gustave
Business challenge/requirement
Lithion Power is the largest provider of electric vehicle(e-vehicle) batteries. It provides battery on a rental model to e-vehicle drivers. Drivers rent battery typically for a day and then replace it with a charged battery from the company.
Lithion Power has a variable pricing model based on driver's driving history. As the life of a battery depends on factors such as overspeeding, distance driven per day etc.
Goal
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
df = pd.read_csv('driver-data.csv')
df.shape
df.head()
Checking for missing values
df.info()
df.isnull().sum()
df.isna().sum()
No missing values
Removing non-informative features
data = df.drop(['id'],axis=1)
data.head()
Data visualization
fig, ax = plt.subplots()
sns.heatmap(data.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
Looks like the variables are independant (no strong correlation)
# scattered plot
fig, ax = plt.subplots(figsize=(16,14))
plt.grid()
sns.scatterplot(data.mean_dist_day, data.mean_over_speed_perc)
We can see from the graph there are main groups/clusters.
Note:
Elbow curve
is another way to estimate number of clusters
KMeans
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
sse = {}
for k in range(2,20):
kmeans = KMeans(n_clusters=k, random_state=42).fit(data)
label = kmeans.labels_
data['clusters'] = label
# Inertia: Sum of distances of samples to their closest cluster center
sse[k] = kmeans.inertia_
sil_coeff = silhouette_score(data, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(k, sil_coeff))
plt.figure(figsize=(16,14))
plt.plot(list(sse.keys()),list(sse.values()))
plt.grid()
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
Elbow curve graph suggests as best number of clusters but we will go with since this grouping is a lot easier to explain
k = 4
kmeans = KMeans(n_clusters=k, random_state=13).fit(data)
y_pred = kmeans.predict(data)
centroids = kmeans.cluster_centers_
centroids
# plotting centroids
data['clusters'] = kmeans.labels_
sns.lmplot('mean_dist_day','mean_over_speed_perc',data=data, hue='clusters',height=14,aspect=1,fit_reg=False)
plt.grid()
plt.scatter(centroids[:,0],centroids[:,1],marker='*',c='black',s=200)
Kmeans normalized
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = data.drop('clusters',axis=1)
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=k, random_state=13).fit(X_scaled)
data['clusters'] = kmeans.labels_
sns.lmplot('mean_dist_day','mean_over_speed_perc',data=data, hue='clusters',height=14,aspect=1,fit_reg=False)
plt.grid()
plt.scatter(centroids[:,0],centroids[:,1],marker='*',c='black',s=200)
Agglomerative Clustering
agc = AgglomerativeClustering(n_clusters=4).fit(X_scaled)
clusters = agc.labels_
# plot the cluster assignments
plt.figure(figsize=(16,14))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap="plasma")
plt.xlabel('mean_dist_day')
plt.ylabel('mean_over_speed_perc')
We can conclude there are for categories of drivers: