Lithion Power ¶

Business challenge/requirement
Lithion Power is the largest provider of electric vehicle(e-vehicle) batteries. It provides battery on a rental model to e-vehicle drivers. Drivers rent battery typically for a day and then replace it with a charged battery from the company.
Lithion Power has a variable pricing model based on driver's driving history. As the life of a battery depends on factors such as overspeeding, distance driven per day etc.

Goal

Group Drivers base on driving history/data
Compare different unsupervised learning models

Data Exploration ¶

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

df = pd.read_csv('driver-data.csv')
df.shape

(4000, 3)

df.head()

Checking for missing values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 3 columns):
id                      4000 non-null int64
mean_dist_day           4000 non-null float64
mean_over_speed_perc    4000 non-null int64
dtypes: float64(1), int64(2)
memory usage: 93.9 KB

df.isnull().sum()

id                      0
mean_dist_day           0
mean_over_speed_perc    0
dtype: int64

df.isna().sum()

id                      0
mean_dist_day           0
mean_over_speed_perc    0
dtype: int64

No missing values

Removing non-informative features

data = df.drop(['id'],axis=1)
data.head()

Data visualization

fig, ax = plt.subplots()
sns.heatmap(data.corr(), cmap='Reds', annot=True, linewidths=.5, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x1e35fc60388>

Looks like the variables are independant (no strong correlation)

# scattered plot
fig, ax = plt.subplots(figsize=(16,14))
plt.grid()
sns.scatterplot(data.mean_dist_day, data.mean_over_speed_perc)

<matplotlib.axes._subplots.AxesSubplot at 0x1e35fe10448>

We can see from the graph there are main groups/clusters.
Note: Elbow curve is another way to estimate number of clusters

Clustering ¶

KMeans

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score

sse = {}
for k in range(2,20):
    kmeans = KMeans(n_clusters=k, random_state=42).fit(data)
    label = kmeans.labels_
    data['clusters'] = label
    # Inertia: Sum of distances of samples to their closest cluster center
    sse[k] = kmeans.inertia_ 
    sil_coeff = silhouette_score(data, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(k, sil_coeff))
plt.figure(figsize=(16,14))
plt.plot(list(sse.keys()),list(sse.values()))
plt.grid()
plt.xlabel("Number of cluster")
plt.ylabel("SSE")

For n_clusters=2, The Silhouette Coefficient is 0.849026997876221
For n_clusters=3, The Silhouette Coefficient is 0.8231537468201255
For n_clusters=4, The Silhouette Coefficient is 0.5926362782704455
For n_clusters=5, The Silhouette Coefficient is 0.518627849370832
For n_clusters=6, The Silhouette Coefficient is 0.4976915962489952
For n_clusters=7, The Silhouette Coefficient is 0.4794286320734636
For n_clusters=8, The Silhouette Coefficient is 0.49221045038304795
For n_clusters=9, The Silhouette Coefficient is 0.4770246730671588
For n_clusters=10, The Silhouette Coefficient is 0.4781034937725309
For n_clusters=11, The Silhouette Coefficient is 0.49440350134329114
For n_clusters=12, The Silhouette Coefficient is 0.45066926846096556
For n_clusters=13, The Silhouette Coefficient is 0.5303734494421017
For n_clusters=14, The Silhouette Coefficient is 0.47538129633319465
For n_clusters=15, The Silhouette Coefficient is 0.5675304825013167
For n_clusters=16, The Silhouette Coefficient is 0.4786438868054198
For n_clusters=17, The Silhouette Coefficient is 0.48274746501445115
For n_clusters=18, The Silhouette Coefficient is 0.519587364865119
For n_clusters=19, The Silhouette Coefficient is 0.5147384448777217

Text(0, 0.5, 'SSE')

Elbow curve graph suggests as best number of clusters but we will go with since this grouping is a lot easier to explain

k = 4
kmeans = KMeans(n_clusters=k, random_state=13).fit(data)
y_pred = kmeans.predict(data)
centroids = kmeans.cluster_centers_
centroids

array([[ 49.98140338,   5.23425693,   6.1986326 ],
       [180.34311782,  10.52011494,   8.48850575],
       [177.83509615,  70.28846154,   8.57692308],
       [ 50.48482185,  32.55581948,   9.9239905 ]])

# plotting centroids
data['clusters'] = kmeans.labels_
sns.lmplot('mean_dist_day','mean_over_speed_perc',data=data, hue='clusters',height=14,aspect=1,fit_reg=False)
plt.grid()
plt.scatter(centroids[:,0],centroids[:,1],marker='*',c='black',s=200)

<matplotlib.collections.PathCollection at 0x1e3609bc788>

Kmeans normalized

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = data.drop('clusters',axis=1)
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=k, random_state=13).fit(X_scaled)
data['clusters'] = kmeans.labels_
sns.lmplot('mean_dist_day','mean_over_speed_perc',data=data, hue='clusters',height=14,aspect=1,fit_reg=False)
plt.grid()
plt.scatter(centroids[:,0],centroids[:,1],marker='*',c='black',s=200)

<matplotlib.collections.PathCollection at 0x1e360df5ec8>

Agglomerative Clustering

agc = AgglomerativeClustering(n_clusters=4).fit(X_scaled)
clusters = agc.labels_
# plot the cluster assignments
plt.figure(figsize=(16,14))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap="plasma")
plt.xlabel('mean_dist_day')
plt.ylabel('mean_over_speed_perc')

Text(0, 0.5, 'mean_over_speed_perc')

We can conclude there are for categories of drivers:

(short distance, low speed)
(long distance, low speed)
(long distance, high speed)
(short distance, igh speed)

	id	mean_dist_day	mean_over_speed_perc
0	3423311935	71.24	28
1	3423313212	52.53	25
2	3423313724	64.54	27
3	3423311373	55.69	22
4	3423310999	54.58	25

Billy Gustave

Lithion Power

Lithion Power ¶

Data Exploration ¶

Clustering ¶

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

Billy Gustave