If you remember how your parents or grandparents used a sieve to segregate stones and other unwanted stuff from grains, then the good news is you already know the fundamentals of clustering.

Like how a sieve is used to separate a mixture of grains and stones of different weight to cluster around in different spots, clustering algorithms are used to segregate a mixture of data points into clusters of similar data points.

Source : carousell.com

Clustering is an unsupervised machine learning technique, which tries to identify the underlying data structure.

There are several types of clustering available. Three types of clustering such as K-Means Clustering, Hierarchical Clustering, and Density based clustering (DBSCAN) are widely used in the industry. 

In this article, I am writing about K-Means Clustering and covering the following

What is Clustering?

How does K-Means Clustering work?

How to select the optimum number of K?

What is Clustering?

Let’s see the real life scenario,


Source: workersfirstnz.com

If we go to the supermarket, we can see all the items or products are arranged in groups. For example, apple or orange, which is arranged in the fruits section. Apples are placed in one place and oranges are placed in another place in the fruit section. The same things are applicable for other products like oil, grains, rice, etc.

Source:Igor Ovsyannykov from Pixabay 

What a Supermarket does is create groups or clusters for the user so they can pick the items easily without wasting time. This is the concept of Clustering.

Let’s define the Cluster now,

Clustering is an unsupervised machine learning technique that helps to split the datasets into groups, consisting of similar data points. The data in the same group are similar to each other and data in the different group are dissimilar to each other. 

How K-Means Clustering works?

K-Means clustering is a very popular and simple clustering technique. The main objective of K-Means clustering is to group the similar data points into clusters. Here, ‘K’ means the number of clusters, which is predefined. 

Let’s take some example,

We have a dataset which has three features (three variables) and a total of 200 observations. Let’s assume K= 3 (number of clusters). As we know, clustering will group similar data points into one group. Hence, our dataset has been divided into three clusters with each cluster having similar points.

From the above figure, our dataset has divided into three clusters such as cluster 1, cluster 2 and cluster 3.

OK, how does the K-Means algorithm know the similar point in order to form the clusters?

For every cluster, it assigns a random point called centroid which is called the central point of clusters. From the below figure, we can see the centroids for each cluster.

K-Means clustering is also called centroid based clustering. If you say K =5, then we can get five centroids and say K = 4, then we have four centroids.

In our above example, a number of clusters are 3 (K), so it has three centroids. If we find the centroids, then the nearest data points would be assigned to each centroid. In this case, three centroids such as C1, C2, and C3 were created and all the nearest data points would be assigned to each centroid and form the cluster. From the above picture, if we add all the data points in all three clusters would give our total observations of 200 data points. Please note no data points belong to two clusters. 

Here, the important task is to minimize the sum of distances of the points to their respective centroid.

In simpler terms, the goal of the K-Means cluster is to find out the centroid for each cluster and get the data points to be assigned to the nearest centroid. 

How do we find the Centroid?

Finding a centroid is not an easy task. It is an iterative process and used Llyod’s algorithm to solve this problem.

The objective of K-Means clustering is to minimize the sum of squared distances between all points and the cluster center.

Source: saedsayad.com

Let’s say we have two features X1 and X2. X1 represents x-axis and Y1 represents y-axis. Here, we need to identify clusters in these two features and we don’t have any target label. 


Step 1:

Cluster the data into K groups. K is predefined. Assume K =2. That means two centroids. 

Place these two centroids at random places. 

Step 2:

Calculate the distance between data points from centroid and cluster them.

Distance measures will help to find the similarity between two elements and it will influence the shape of the clusters. Normally, the Euclidean distance will be used. The Euclidean distance is the ordinary straight line (shortest) distance between two points in Euclidean space.

In this step, we need to calculate the distance of each data point from these two centroids. Example, if the data point is near to the green centroid then it belongs to the green cluster and data points belong to the black centroid then it belongs to the black cluster. 


In order to identify the distance, just draw a line connecting between two centroids and draw a perpendicular line. Data Points left-hand side belongs to the green cluster and right-hand site belongs to the black cluster.

Step 3: Adjust centroid in order to become a centre of the given cluster

Adjust the centroids by calculating the mean of all the data points in the red and blue clusters. Now, data points moved near to the new centroids. 

Step 4: Again re-cluster every data points based on their distance with centroid

Step 3 will be repeated again and it will calculate new centroids. Nearest data points moved to the nearest cluster. We can see the new centroids.


Step 5: Again adjust centroids 

It keeps on going until centroid movements become almost negligible. Then it becomes cluster 1 and cluster 2. 

Below is the animation of K-Means Clustering. This is an example of three clusters. Reference: Shabal.in

Source: Shabal.in

Let’s calculate cluster for only one feature X = {1,2, 6,7,8,10,15,17,20} and assume cluster K = 3

Steps:

  1. Assume K (no of cluster) = 3
  2. Randomly select three centroids
  3. Calculate the distance using Euclidean distance. Find the nearest number to centroid and put it in the cluster
  4. Take mean value of cluster
  5. Repeat step 3 and step 4 until we get the same mean value of cluster



The clusters are Cluster 1 = {1,2}, Cluster 2 = {8,10,15,17,20}, Cluster 3 = {6,7}

Calculate the new centroids by taking cluster means

C1 = (1+2)/2 = 1.5

C2 = (8+10+15+17+20)/5 = 14

C3 = (6+7) = 6.5


The clusters are Cluster 1 = {1,2}, Cluster 2 = {15,17,20}, Cluster 3 = {6,7,8,10}

Calculate the new centroids by taking cluster means

C1 = (1+2)/2 = 1.5

C2 = (15+17+20)/3 = 17.33

C3 = (6+7+8+10)/4 = 7.75

Cluster 1 has not changed. But, cluster 2 and cluster 3 have changed so we need to re-calculate the centroid.


The clusters are Cluster 1 = {1,2}, Cluster 2 = {15,17,20}, Cluster 3 = {6,7,8,10}

Clusters are the same in iteration 2 & 3 and there is no movement. So we can conclude our final clusters are Cluster 1 = {1,2}, Cluster 2 = {15,17,20}, Cluster 3 = {6,7,8,10}

In iteration 1, Distance 2 and Distance 3 are the same for the Datapoint 7. We considered the same into cluster 3. If we could have considered cluster 2, then we can get an answer in iteration 2. 

How to select the value of K?

So far, we have assumed the values of K to be 2 or 3. But, how to select a K that is optimum for the given dataset? One of the commonly used methods is the Elbow method. 

Elbow method is a graphical visualization between the number of clusters (K) and within-cluster sum of squares (WCSS). WCSS is the sum of squared distance of data points from their respective centroid for all clusters. WSS is calculated as below:


Here, m is the number of K, for example, 1, 2, 3 .. 10 and c is the centroid in a cluster. So, when K =3, there will be 3 three clusters and thus three centroids. 

Basically, a range of K that is suitable for a given dataset is chosen. For every K, the data is clustered into K clusters and WSS is calculated.

Now, plot K vs WSS. The optimal K is the K at which the rate of change in WCSS levels off. That is, the K at which (a) WCSS is minimum and (b) the change in WCSS does not vary much with every additional cluster. 

Based on the dataset and domain knowledge, we can assume the k - value. Let’s assume K = 1 to 10. This will result in 10 K’s and 10 WCSS’s. Below is the plot between the two:

 

 

 As we can see, the rate of change of WCSS drops at K=4 and levels off. Thus, we can conclude that the optimum value of K is 4 for this dataset. 

This is the overview of how K-Means clustering works and it’s principles.

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in