Created
Apr 30, 2025
Last Modified
2 weeks ago

K-Means Clustering

Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other clusters.

It is a key task in exploratory data mining and is widely used in various fields, including:

  • Machine Learning

  • Pattern Recognition

  • Image Analysis

  • Bio Informatics

  • Computer Graphics


K-Means Clustering

The K-means clustering algorithm is one of the simplest unsupervised learning algorithms for solving clustering problems.

Let it be required to classify a given dataset into a certain number of clusters, say K clusters.

  1. We start by choosing K points arbitrarily as the centers of clusters, one for each cluster.

  2. We then associate each of the given data points with the nearest center.

  3. We take the average of the data points associated with a center, and replace the center with the average.

  4. This is done for each of the centers.

  5. We repeat the process until the centers converge on some fixed points.

The data points nearest to the centers form the various clusters in the dataset. Each cluster is represented by the associated center.


Example

Use the K-Means clustering algorithm to divide the following data into 2 clusters, and compute the representative data points (centroids) for each cluster.

1

2

2

3

4

5

1

1

3

2

3

5

Plotting the data points

Scatter plot showing data points used as input for K-means clustering algorithm

1. For this problem required clusters are 2, so k = 2

2. We choose two points arbitrarily as the initial cluster centres selected arbitrarily

3. We compute the distances of the given data points from the cluster centres.

Now

Iteration 1: Distance Table

Data Point

Distance

Distance

Min Distance

Assigned

cluster

(1, 1)

1

2.24

1

(2, 1)

0

2

0

(2, 3)

2

0

0

(3, 2)

1.41

1.41

0

(4, 3)

2.82

2

2

(5, 5)

5

3.61

3.61

📌 Note:

The distances of from and are equal, so we assigned to arbitrarily


Cluster Division

Thus, divide the data into two clusters.

  • Number of data points in cluster are:

  • Number of data points in cluster are:


Recalculation of Cluster Centers

Now, we recalculate the cluster centers by taking the mean of the points in each cluster:

  • New Center of Cluster 1
    = Average of (1, 1), (2, 1), (3, 2)

  • New Center of Cluster 2
    = Average of (2, 3), (4, 3), (5, 5)


  • New Center of Cluster 1 : (2, 1.33)

  • New Center of Cluster 2 : (3.67, 3.67)

We'll now compute the distances of all data points from these new centers and reassign them accordingly.

📌 Notes

Hyperparameter: In Machine Learning, a hyperparameter is a value set before the training process begins.
By contrast, the values of other parameters are learned during training.

Iteration 2: Distance Table

Data Point

Distance

Distance

Min Distance

Assigned

cluster

(1, 1)

1.05

3.77

1.05

(2, 1)

0.33

3.14

0.33

(2, 3)

1.67

1.80

1.67

(3, 2)

1.20

1.82

1.20

(4, 3)

2.60

0.75

0.75

(5, 5)

4.74

1.89

1.89

Cluster Division

Thus, divide the data into two clusters.

  • Number of data points in cluster are: .

  • Number of data points in cluster is: .


Recalculation of Cluster Centers

Now, we recalculate the cluster centers by taking the mean of the points in each cluster:

  • New Center of Cluster 1
    = Average of (1, 1), (2, 1), (2, 3), (3, 2)

  • New Center of Cluster 2
    = Average of (4, 3), (5, 5)


  • New Center of Cluster 1 : (2, 1.75)

  • New Center of Cluster 2 : (4.5, 4)

Iteration 2: Distance Table

Data Point

Distance

Distance

Min Distance

Assigned

cluster

(1, 1)

1.25

4.61

1.25

(2, 1)

0.75

3.91

0.75

(2, 3)

1.25

2.69

1.25

(3, 2)

1.03

2.50

1.03

(4, 3)

2.36

1.12

1.12

(5, 5)

4.42

1.12

1.12

Final Clusters (No Change Detected)

  • Number of data points in cluster are: .

  • Number of data points in cluster are: .