#MachineLearning #UnsupervisedLearning #Clustering

By Billy Gustave

Lithion Power

Business challenge/requirement
Lithion Power is the largest provider of electric vehicle(e-vehicle) batteries. It provides battery on a rental model to e-vehicle drivers. Drivers rent battery typically for a day and then replace it with a charged battery from the company.
Lithion Power has a variable pricing model based on driver's driving history. As the life of a battery depends on factors such as overspeeding, distance driven per day etc.

Goal

  • Group Drivers base on driving history/data
  • Compare different unsupervised learning models

Data Exploration

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
In [2]:
df = pd.read_csv('driver-data.csv')
df.shape
Out[2]:
(4000, 3)
In [3]:
df.head()
Out[3]:
id mean_dist_day mean_over_speed_perc
0 3423311935 71.24 28
1 3423313212 52.53 25
2 3423313724 64.54 27
3 3423311373 55.69 22
4 3423310999 54.58 25

Checking for missing values

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 3 columns):
id                      4000 non-null int64
mean_dist_day           4000 non-null float64
mean_over_speed_perc    4000 non-null int64
dtypes: float64(1), int64(2)
memory usage: 93.9 KB
In [5]:
df.isnull().sum()
Out[5]:
id                      0
mean_dist_day           0
mean_over_speed_perc    0
dtype: int64
In [6]:
df.isna().sum()
Out[6]:
id                      0
mean_dist_day           0
mean_over_speed_perc    0
dtype: int64

No missing values

Removing non-informative features

In [7]:
data = df.drop(['id'],axis=1)
data.head()
Out[7]:
mean_dist_day mean_over_speed_perc
0 71.24 28
1 52.53 25
2 64.54 27
3 55.69 22
4 54.58 25

Data visualization

In [8]:
fig, ax = plt.subplots()
sns.heatmap(data.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e35fc60388>

Looks like the variables are independant (no strong correlation)

In [9]:
# scattered plot
fig, ax = plt.subplots(figsize=(16,14))
plt.grid()
sns.scatterplot(data.mean_dist_day, data.mean_over_speed_perc)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e35fe10448>

We can see from the graph there are main groups/clusters.
Note: Elbow curve is another way to estimate number of clusters

Clustering

KMeans

In [10]:
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
In [11]:
sse = {}
for k in range(2,20):
    kmeans = KMeans(n_clusters=k, random_state=42).fit(data)
    label = kmeans.labels_
    data['clusters'] = label
    # Inertia: Sum of distances of samples to their closest cluster center
    sse[k] = kmeans.inertia_ 
    sil_coeff = silhouette_score(data, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(k, sil_coeff))
plt.figure(figsize=(16,14))
plt.plot(list(sse.keys()),list(sse.values()))
plt.grid()
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
For n_clusters=2, The Silhouette Coefficient is 0.849026997876221
For n_clusters=3, The Silhouette Coefficient is 0.8231537468201255
For n_clusters=4, The Silhouette Coefficient is 0.5926362782704455
For n_clusters=5, The Silhouette Coefficient is 0.518627849370832
For n_clusters=6, The Silhouette Coefficient is 0.4976915962489952
For n_clusters=7, The Silhouette Coefficient is 0.4794286320734636
For n_clusters=8, The Silhouette Coefficient is 0.49221045038304795
For n_clusters=9, The Silhouette Coefficient is 0.4770246730671588
For n_clusters=10, The Silhouette Coefficient is 0.4781034937725309
For n_clusters=11, The Silhouette Coefficient is 0.49440350134329114
For n_clusters=12, The Silhouette Coefficient is 0.45066926846096556
For n_clusters=13, The Silhouette Coefficient is 0.5303734494421017
For n_clusters=14, The Silhouette Coefficient is 0.47538129633319465
For n_clusters=15, The Silhouette Coefficient is 0.5675304825013167
For n_clusters=16, The Silhouette Coefficient is 0.4786438868054198
For n_clusters=17, The Silhouette Coefficient is 0.48274746501445115
For n_clusters=18, The Silhouette Coefficient is 0.519587364865119
For n_clusters=19, The Silhouette Coefficient is 0.5147384448777217
Out[11]:
Text(0, 0.5, 'SSE')

Elbow curve graph suggests as best number of clusters but we will go with since this grouping is a lot easier to explain

In [12]:
k = 4
kmeans = KMeans(n_clusters=k, random_state=13).fit(data)
y_pred = kmeans.predict(data)
centroids = kmeans.cluster_centers_
centroids
Out[12]:
array([[ 49.98140338,   5.23425693,   6.1986326 ],
       [180.34311782,  10.52011494,   8.48850575],
       [177.83509615,  70.28846154,   8.57692308],
       [ 50.48482185,  32.55581948,   9.9239905 ]])
In [13]:
# plotting centroids
data['clusters'] = kmeans.labels_
sns.lmplot('mean_dist_day','mean_over_speed_perc',data=data, hue='clusters',height=14,aspect=1,fit_reg=False)
plt.grid()
plt.scatter(centroids[:,0],centroids[:,1],marker='*',c='black',s=200)
Out[13]:
<matplotlib.collections.PathCollection at 0x1e3609bc788>

Kmeans normalized

In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = data.drop('clusters',axis=1)
X_scaled = scaler.fit_transform(X)
In [15]:
kmeans = KMeans(n_clusters=k, random_state=13).fit(X_scaled)
data['clusters'] = kmeans.labels_
sns.lmplot('mean_dist_day','mean_over_speed_perc',data=data, hue='clusters',height=14,aspect=1,fit_reg=False)
plt.grid()
plt.scatter(centroids[:,0],centroids[:,1],marker='*',c='black',s=200)
Out[15]:
<matplotlib.collections.PathCollection at 0x1e360df5ec8>

Agglomerative Clustering

In [16]:
agc = AgglomerativeClustering(n_clusters=4).fit(X_scaled)
clusters = agc.labels_
# plot the cluster assignments
plt.figure(figsize=(16,14))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap="plasma")
plt.xlabel('mean_dist_day')
plt.ylabel('mean_over_speed_perc')
Out[16]:
Text(0, 0.5, 'mean_over_speed_perc')

We can conclude there are for categories of drivers:

  • (short distance, low speed)
  • (long distance, low speed)
  • (long distance, high speed)
  • (short distance, igh speed)

Contact Me

www.linkedin.com/in/billygustave

billygustave.com