We learned to classify our data sets in previous articles. We already knew the categories in which we wanted to classify our data sets. However, we may come across a data set which does not have pre-defined classes. We want to identify segments or clusters in given data set by using clustering methods in machine learning. This article will deal with one of the methods, k-means clustering.
K-Means clustering algorithm works as below:
- Choose the number of clusters k
- Select random k points known as centroids
- Assign each data-point to closest centroid
- Compute and place new centroid of each cluster
- Reassign each data point to new closest centroid.If any reassignment took place again compute and place new centriod else finish
As we are going to use Python libraries we need to provide only data-set and Elbow method will identify the right number of clusters.
There is mall and they want to identify types of customers depending on the spending score and annual income of customers so that they can target right category for various campaigns.
You can download data-set from superdatascience.com/machine-learning/
Choosing right number of clusters:
To identify the number of clusters we will use ‘Elbow Method‘ of K-means clustering algorithm.Following is the code snippet.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = 
for i in range(1,26):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
plt.plot(range(1, 26), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
In above code,
- KMeans is class of sklearn.cluster library
- wcss (Within Clusters Sum of Squares ) is an array where we store kmeans.inertia_ is Sum of squared distances of samples to their closest cluster center.
- Then we created a for loop to iterate from 1 to 25 and for each iteration, invoked KMeans algorithm with ‘k-means++’ method and random state as 0. We also appended kmeans.inertia_ to wcss array to keep record of all values for each Kmeans cluster from 1 to 25. This will be useful in plotting elbow graph.
- After executing above code a graph will be plotted as below
In graph,we can see a curve which look like an elbow.The point where the curve is,it is the optimum number of clusters in our data-set i.e.5.
After choosing the right number of clusters we are ready to fit data-set into K-Means and visualize the types of customers in our data-set.To see the final results execute the following code snippet
# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(X)
# Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 10, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 10, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 10, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 10, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 10, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 30, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
- from elbow method we have 5 clusters and same we have provided to KMeans algorithm
- y_kmeans will have predicted types of customers in the form of clusters
- Then we plotted types with different colors with actual customers
Execute above code and final result will look like below
Observe the 5 clusters. They can be categorized into
- red cluster: Customers with high annual income but less spending score
- magenta cluster: Customers with low income and low spending score
- blue cluster: Customers with moderate income and moderate spending score
- cyan cluster: Customers with low income but high spending score
- green cluster: Customers with high income and high spending score.
- yellow points are centroids for each cluster.
Using this data from K-means clustering, mall can create their campaigns and target customers accordingly. I hope this article helped understand K-means clustering. Happy Learning 🙂