In previous tutorial we learned how to divide data-set into right number of clusters using K-means clustering.In this tutorial we will see another type of clustering i.e Hierarchical Clustering.This algorithm uses bottom up approach to divide data-set into clusters so,let’s get started.
Hierarchical clustering algorithm works as follows:
- Make each data point a single point cluster to form N clusters : N
- Take 2 closest data points and make them one cluster : N-1
- Take 2 closest clusters and make them one cluster : N-2
- Repeat step 3 until you get only one cluster
We are going to use same problem we previously used in k-means clustering tutorial.There is mall and they want to identify types of customers depending on the spending score and annual income of customers so that they can target right category for various campaigns.
Choosing right number of clusters
As our data-set is same as last tutorial we already know the right number of clusters which is 5.We can find right number of clusters in hierarchical clustering using Dendrograms .But,right now we will not go into deep.So,after getting the right number of clusters we are ready to fit our data-set into hierarchical clustering algorithm as below.
# Hierarchical Clustering
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values
# Fitting Hierarchical Clustering to the dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
# Visualising the clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 10, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 10, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 10, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 10, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 10, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
- AgglomerativeClustering is class from sklearn.cluster library in Python.
- We have provided 5 as the number of clusters that we already know from k-means clustering.
- affinity is the metric used to compute linkage and it contains default value as euclidean.
- linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.Its default values is ‘ward’.
- y_hc will have predicted types of customers in the form of clusters
- Then we plotted types with different colors with actual customers
Execute above code and final result will look like below
Observe the 5 clusters. They can be categorized into
- red cluster: Customers with high annual income but less spending score
- magenta cluster: Customers with low income and low spending score
- blue cluster: Customers with moderate income and moderate spending score
- cyan cluster: Customers with low income but high spending score
- green cluster: Customers with high income and high spending score.
- Hierarchical clustering will not have centroids as it forms clusters on the basis of thresh-holding in dendrograms which we will see in future.
Using this data from hierarchical clustering, mall can create their campaigns and target customers accordingly. I hope this article helped understand hierarchical clustering. Happy Learning 🙂