K-means clustering

The following two tabs change content below.
I am a technology enthusiast and always up for challenges. Recently I have started getting hands dirty in machine learning using python and aspiring to gather everything I can.

Latest posts by Renuka Joshi (see all)

We learned to classify our data sets in previous articles. We already knew the categories in which we wanted to classify our data sets. However, we may come across a data set which does not have pre-defined classes. We want to identify segments or clusters in given data set by using clustering methods in machine learning. This article will deal with one of the methods, k-means clustering.

K-Means Clustering:

K-Means clustering algorithm works as below:

  • Choose the number of clusters k
  • Select random k points known as centroids
  • Assign each data-point to closest centroid
  • Compute and place new centroid of each cluster
  • Reassign each data point to new closest centroid.If any reassignment took place again compute and place new centriod else finish

As we are going to use Python libraries we need to provide only data-set and Elbow method will identify the right number of clusters.


Problem Statement:

There is mall and they want to identify types of customers depending on the spending score and annual income of customers so that they can target right category for various campaigns.

Sample dataset:

Mall Dataset

Mall Data-set

You can download data-set from superdatascience.com/machine-learning/

Choosing right number of clusters:

To identify the number of clusters we will use ‘Elbow Method‘ of K-means clustering algorithm.Following is the code snippet.

In above code,

  • KMeans is class of sklearn.cluster library
  • wcss (Within Clusters Sum of Squares ) is an array where we store kmeans.inertia_ is Sum of squared distances of samples to their closest cluster center.
  • Then we created a for loop to iterate from 1 to 25 and for each iteration, invoked KMeans algorithm with ‘k-means++’ method and random state as 0. We also appended kmeans.inertia_ to wcss array to keep record of all values for each Kmeans cluster from 1 to 25. This will be useful in plotting elbow graph.
  • After executing above code a graph will be plotted as below
Elbow Method

Elbow Method

In graph,we can see a curve which look like an elbow.The point where the curve is,it is  the optimum number of clusters in our data-set i.e.5.

After choosing the right number of clusters we are ready to fit data-set into K-Means and visualize the types of customers in our data-set.To see the final results execute the following code snippet


  • from elbow method we have 5 clusters and same we have provided to KMeans algorithm
  • y_kmeans will have predicted types of customers in the form of clusters
  • Then we plotted types with different colors with actual customers

Execute above code and final result will look like below

KMeans Clustering

KMeans Clustering

Observe the 5 clusters. They can be categorized into

  • red cluster: Customers with high annual income but less spending score
  • magenta cluster: Customers with low income and low spending score
  • blue cluster: Customers with moderate income and moderate spending score
  • cyan cluster: Customers with low income but high spending score
  • green cluster: Customers with high income and high spending score.
  • yellow points are centroids for each cluster.

Using this data from K-means clustering, mall can create their campaigns and target customers accordingly. I hope this article helped understand K-means clustering.  Happy Learning 🙂


Share Button

Leave a Reply

Your email address will not be published. Required fields are marked *