#MachineLearning #UnsupervisedLearning #Clustering

By Billy Gustave

Zoo Animal High Class clustering

Goal :

  • Predict animal high class using Agglomerative clustering
  • Calcultate absolute-error

Data Exploration

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
In [2]:
df = pd.read_csv('zoo.csv')
df.head()
Out[2]:
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize class_type
0 aardvark 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
1 antelope 1 0 0 1 0 0 0 1 1 1 0 0 4 1 0 1 1
2 bass 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 4
3 bear 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
4 boar 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
animal_name    101 non-null object
hair           101 non-null int64
feathers       101 non-null int64
eggs           101 non-null int64
milk           101 non-null int64
airborne       101 non-null int64
aquatic        101 non-null int64
predator       101 non-null int64
toothed        101 non-null int64
backbone       101 non-null int64
breathes       101 non-null int64
venomous       101 non-null int64
fins           101 non-null int64
legs           101 non-null int64
tail           101 non-null int64
domestic       101 non-null int64
catsize        101 non-null int64
class_type     101 non-null int64
dtypes: int64(17), object(1)
memory usage: 14.3+ KB

No missing values

Checking Distribution

In [4]:
df.groupby('class_type').size()
Out[4]:
class_type
1    41
2    20
3     5
4    13
5     4
6     8
7    10
dtype: int64
In [5]:
sns.countplot(df.class_type)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x22e247d45c8>
In [6]:
# getting X anf y
X = df.loc[:,'hair':'catsize']
y = df.class_type -1
In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Modeling and Prediction

With fine tuning

In [8]:
# using n_cluster = 7
from sklearn.cluster import AgglomerativeClustering
k = 7
agglo = AgglomerativeClustering(n_clusters=k,affinity='euclidean',linkage='average')
y_pred = agglo.fit_predict(X)
In [9]:
from sklearn.metrics import mean_squared_error
# measure mean square error
np.sqrt(mean_squared_error(y, y_pred))
Out[9]:
2.0990332522519517

Contact Me

www.linkedin.com/in/billygustave

billygustave.com