K-Means Clustering

Apr 30, 2025

Updated 1 month ago

3 min read

K-mean Clustering

Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other clusters.

It is a key task in exploratory data mining and is widely used in various fields, including:

Machine Learning
Pattern Recognition
Image Analysis
Bio Informatics
Computer Graphics

K-Means Clustering

The K-means clustering algorithm is one of the simplest unsupervised learning algorithms for solving clustering problems.

Let it be required to classify a given dataset into a certain number of clusters, say K clusters.

We start by choosing K points arbitrarily as the centers of clusters, one for each cluster.
We then associate each of the given data points with the nearest center.
We take the average of the data points associated with a center, and replace the center with the average.
This is done for each of the centers.
We repeat the process until the centers converge on some fixed points.

The data points nearest to the centers form the various clusters in the dataset. Each cluster is represented by the associated center.

Example

Use the K-Means clustering algorithm to divide the following data into 2 clusters, and compute the representative data points (centroids) for each cluster.

$x_{1}$	1	2	2	3	4	5
$y_{1}$	1	1	3	2	3	5

Plotting the data points

Scatter plot showing data points used as input for K-means clustering algorithm

1. For this problem required clusters are 2, so k = 2

2. We choose two points arbitrarily as the initial cluster centres selected arbitrarily

\overline{V_{1}} = (2, 1) \overline{V_{2}} = (2, 3)

3. We compute the distances of the given data points from the cluster centres.

Now

Iteration 1: Distance Table

$\overline{x_{i}}$	Data Point	Distance $V_{1} (2, 1)$	Distance $V_{2} (2, 3)$	Min Distance	Assigned cluster
$\overline{x_{1}}$	(1, 1)	1	2.24	1	$\overline{V_{1}}$
$\overline{x_{2}}$	(2, 1)	0	2	0	$\overline{V_{1}}$
$\overline{x_{3}}$	(2, 3)	2	0	0	$\overline{V_{2}}$
$\overline{x_{4}}$	(3, 2)	1.41	1.41	0	$\overline{V_{1}}$
$\overline{x_{5}}$	(4, 3)	2.82	2	2	$\overline{V_{2}}$
$\overline{x_{6}}$	(5, 5)	5	3.61	3.61	$\overline{V_{2}}$

📌 Note:
The distances of $\overline{x_{4}}$ from $\overline{V_{1}}$ and $\overline{V_{2}}$ are equal, so we assigned $\overline{V_{1}}$ to $\overline{x_{4}}$ arbitrarily

Cluster Division

Thus, divide the data into two clusters.

$Cluster 1 : {\overline{x_{1}}, \overline{x_{2}}, \overline{x_{4}}}$

$Cluster 2 : {\overline{x_{3}}, \overline{x_{5}}, \overline{x_{6}}}$

Number of data points in cluster $\overline{V_{1}}$ are: $C_{1} = 3$
Number of data points in cluster $\overline{V_{2}}$ are: $C_{2} = 3$

Recalculation of Cluster Centers

Now, we recalculate the cluster centers by taking the mean of the points in each cluster:

New Center of Cluster 1
= Average of (1, 1), (2, 1), (3, 2)
$= \frac{1}{C _{1}} (\overline{x_{1}}, \overline{x_{2}}, \overline{x_{4}})$
$= (\frac{1 + 2 + 3}{3}, \frac{1 + 1 + 2}{3}) = (2, 1.33)$
New Center of Cluster 2
= Average of (2, 3), (4, 3), (5, 5)
$= \frac{1}{C _{2}} (\overline{x_{3}}, \overline{x_{5}}, \overline{x_{6}})$
$= (\frac{2 + 4 + 5}{3}, \frac{3 + 3 + 5}{3}) = (3.67, 3.67)$

New Center of Cluster 1 $(V_{1})$ : (2, 1.33)
New Center of Cluster 2 $(V_{2})$ : (3.67, 3.67)

We'll now compute the distances of all data points from these new centers and reassign them accordingly.

📌 Notes
Hyperparameter: In Machine Learning, a hyperparameter is a value set before the training process begins.
By contrast, the values of other parameters are learned during training.

Iteration 2: Distance Table

$\overline{x_{i}}$	Data Point	Distance $V_{1} (2, 1.33)$	Distance $V_{2} (3.67, 3.67)$	Min Distance	Assigned cluster
$\overline{x_{1}}$	(1, 1)	1.05	3.77	1.05	$\overline{V_{1}}$
$\overline{x_{2}}$	(2, 1)	0.33	3.14	0.33	$\overline{V_{1}}$
$\overline{x_{3}}$	(2, 3)	1.67	1.80	1.67	$\overline{V_{1}}$
$\overline{x_{4}}$	(3, 2)	1.20	1.82	1.20	$\overline{V_{1}}$
$\overline{x_{5}}$	(4, 3)	2.60	0.75	0.75	$\overline{V_{2}}$
$\overline{x_{6}}$	(5, 5)	4.74	1.89	1.89	$\overline{V_{2}}$

Cluster Division

Thus, divide the data into two clusters.

$Cluster 1 : {\overline{x_{1}}, \overline{x_{2}}, \overline{x_{3}}, \overline{x_{4}}}$

$Cluster 2 : {\overline{x_{5}}, \overline{x_{6}}}$

Number of data points in cluster $\overline{V_{1}}$ are: .
Number of data points in cluster $\overline{V_{2}}$ is: .

Recalculation of Cluster Centers

Now, we recalculate the cluster centers by taking the mean of the points in each cluster:

New Center of Cluster 1
= Average of (1, 1), (2, 1), (2, 3), (3, 2)
$= \frac{1}{C _{1}} (\overline{x_{1}}, \overline{x_{2}}, \overline{x_{4}})$
$= (\frac{1 + 2 + 2 + 3}{4}, \frac{1 + 1 + 3 + 2}{4}) = (2, 1.75)$
New Center of Cluster 2
= Average of (4, 3), (5, 5)
$= \frac{1}{C _{2}} (\overline{x_{3}}, \overline{x_{5}}, \overline{x_{6}})$
$= (\frac{4 + 5}{2}, \frac{3 + 5}{2}) = (4.5, 4)$

New Center of Cluster 1 $(V_{1})$ : (2, 1.75)
New Center of Cluster 2 $(V_{2})$ : (4.5, 4)

Iteration 2: Distance Table

$\overline{x_{i}}$	Data Point	Distance $V_{1} (2, 1.75)$	Distance $V_{2} (4.5, 4)$	Min Distance	Assigned cluster
$\overline{x_{1}}$	(1, 1)	1.25	4.61	1.25	$\overline{V_{1}}$
$\overline{x_{2}}$	(2, 1)	0.75	3.91	0.75	$\overline{V_{1}}$
$\overline{x_{3}}$	(2, 3)	1.25	2.69	1.25	$\overline{V_{1}}$
$\overline{x_{4}}$	(3, 2)	1.03	2.50	1.03	$\overline{V_{1}}$
$\overline{x_{5}}$	(4, 3)	2.36	1.12	1.12	$\overline{V_{2}}$
$\overline{x_{6}}$	(5, 5)	4.42	1.12	1.12	$\overline{V_{2}}$

Final Clusters (No Change Detected)

$Cluster 1 : {\overline{x_{1}}, \overline{x_{2}}, \overline{x_{3}}, \overline{x_{4}}}$
$Cluster 2 : {\overline{x_{5}}, \overline{x_{6}}}$

Number of data points in cluster $\overline{V_{1}}$ are: .
Number of data points in cluster $\overline{V_{2}}$ are: .