Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 6 0 K mean

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.
AI-Machine-Learning

  • AI-Machine-Learning Index

  • Machine Learning Learning Path

  • Machine Learning: Introduction and Core Algorithms

  • Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent

  • Evaluating a Hypothesis in Neural Networks

  • Bias-Variance Dilemma

  • Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

  • Polynomial Regression

  • Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

  • XGBoost (Extreme Gradient Boosting) Explained

  • Dimensionality Reduction in Machine Learning

  • Principal Component Analysis (PCA) Explained

  • t-SNE (t-distributed Stochastic Neighbor Embedding) Explained

  • K-Means Clustering

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Anomaly Detection Using Gaussian Distribution in Machine Learning

  • Anomaly Detection Using Multivariate Gaussian Distribution

  • Recommender Systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Approaches

  • Collaborative Filtering: Building Recommender Systems with Feature Learning

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Large Scale Machine Learning: Training Models on Massive Datasets

  • Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

  • MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

Cover Image for K-Means Clustering

K-Means Clustering

K-Means is a powerful unsupervised learning algorithm for clustering data into coherent subsets. It iteratively assigns points to the nearest centroid and updates centroids to minimize distortion, making it widely used in practice.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Principal Component Analysis (PCA) Explained

Next →

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

K-Means Clustering

K- Mean is used to solves the clustering problem:

  • one of the most widely used clustering algorithms.
  • K-Means is an Iterative Algorithm

It is simple, fast, and widely used in practice.

KMean Working

Scikit Learn k-mean example:


from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])
              
kmeans = KMeans(
    n_clusters=2,    # Number of Cluster
    random_state=0,  # Dont Starts with random centroid locations
    n_init="auto")   # automatically chooses the number of runs.
    .fit(X)          # Train Model

# print Cluster centers
kmeans.cluster_centers_

# print groups label for data items
kmeans.labels_

Clustering

Given an *unlabeled dataset automatically group the data into coherent subsets, called clusters.

Cluster analysis is a powerful unsupervised learning technique for discovering groups in data.

When K-Means Works Well

K-Means works best when:

  • Clusters are roughly spherical.
  • Clusters have similar sizes.
  • Features are properly scaled.

Feature scaling is important:

xj:=xj−μjσjx_j := \frac{x_j - \mu_j}{\sigma_j}xj​:=σj​xj​−μj​​

Without scaling, features with larger magnitude dominate distance calculations.

When K-Means Struggles

K-Means performs poorly when:

  • Clusters are non-convex.
  • Clusters have very different densities.
  • Data contains significant outliers.
  • Clusters are not well separated.

K-Mean Algorithm

K-Means works as follows:

Takes KKK and unlabeled data as input.

Given:

Training set
{x(1),x(2),...,x(m)}\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}{x(1),x(2),...,x(m)}

where x(i)∈Rnx^{(i)} \in \mathbb{R}^nx(i)∈Rn.

And number of clusters KKK

Repeats: {

  1. Assignment Step

    Assign points to nearest centroid.

  2. Centroid update step

    Move centroids μ\muμ to mean of assigned points.

    Minimizes Cost function (Distortion):

    J=1m∑i=1m∥x(i)−μc(i)∥2J = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J=m1​i=1∑m​​x(i)−μc(i)​​2

    }

Stops when assignments stabilize.

Output is cluster assignments for each example and cluster centroids.

Suppose we want to group data into K=2K = 2K=2 clusters.

K-Means Algorithm

Algorithm

Repeat for t=1,…,Tt = 1, \dots, Tt=1,…,T:

where T = number of runs

  1. Randomly initialize centroids
  2. Run K-Means to convergence
  3. Compute distortion:
J(t)=1m∑i=1m∥x(i)−μc(i)∥2J^{(t)} = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J(t)=m1​i=1∑m​​x(i)−μc(i)​​2

Finally:

Choose clustering with smallest J(t)\text{Choose clustering with smallest } J^{(t)}Choose clustering with smallest J(t)

Input

  • KKK (number of clusters) = 2 in this case
  • Dataset {x(1),x(2),...,x(m)}\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}{x(1),x(2),...,x(m)}, where
    • x(i)∈Rnx^{(i)} \in \mathbb{R}^nx(i)∈Rn
    • mmm = number of examples

Example: 🍓🍡🍬🍭

Candies scatter on table with K=2K=2K=2 bowls to collect them.

1. Initialize Cluster centroids(μ\muμ) 🥣

Randmly put 2 Bowls on the table to start collecting candies.

Randomly initialize KKK cluster centroids

Total number of clusters

KKK

We have KKK centroids:

  • k∈{1,2,...,K}k \in \{1, 2, ..., K\}k∈{1,2,...,K}
  • μk\mu_kμk​ = centroid of cluster kkk

μ1,μ2,μ3…μk \mu_1, \mu_2, \mu_3 \dots \mu_kμ1​,μ2​,μ3​…μk​

where μk∈Rn\mu_k \in \mathbb{R}^nμk​∈Rn

  • μk\mu_kμk​ is a point in the same space as the data (i.e., Rn\mathbb{R}^nRn)

Choosing the Number of Clusters (K)

There is no universal rule for choosing KKK.

Proper Random Initialization

When running K-Means:

K<mK < mK<m

Where:

  • KKK = number of clusters
  • mmm = number of training examples

It does not make sense to choose K≥mK \ge mK≥m.

Recommended Initialization Method

  1. Randomly select KKK distinct training examples.
  2. Set initial centroids equal to those examples:
μ1=x(i1),μ2=x(i2),…,μK=x(iK)\mu_1 = x^{(i_1)}, \quad \mu_2 = x^{(i_2)}, \quad \dots, \quad \mu_K = x^{(i_K)}μ1​=x(i1​),μ2​=x(i2​),…,μK​=x(iK​)

where i1,…,iKi_1, \dots, i_Ki1​,…,iK​ are randomly chosen distinct indices.

This ensures centroids start within the data distribution.

Elbow Method

When we increase KKK, distortion JJJ decreases because we have more centroids to fit the data.

Idea is to find point where adding more clusters does not significantly reduce distortion.

Elbow Method

  1. Run K-Means for different values of KKK.
  2. Compute distortion JJJ for each.
  3. Plot JJJ vs KKK.
  4. Look for an “elbow” where improvement slows down.

Repeatedly Steps 2-3 until convergence {

2. 📏 Cluster Assignment Step

For each training example(x(i)x^{(i)}x(i)) where i=1,…,mi = 1, \dots, mi=1,…,m:

Assign x(i)x^{(i)}x(i) to the cluster with the closest centroid.

In Mathematical terms:

Index of Cluster c(i)c^{(i)}c(i)

Index of cluster is Centroid closest to x(i)x^{(i)}x(i) in {1,…,K}\{1, \dots, K\}{1,…,K}

Examples:

  • If x(i)x^{(i)}x(i) is closest to μ1\mu_1μ1​, then c(i)=1c^{(i)} = 1c(i)=1
  • If x(i)x^{(i)}x(i) is closest to μ2\mu_2μ2​, then c(i)=2c^{(i)} = 2c(i)=2

μc(i)\mu_c^{(i)}μc(i)​ = CLuster centroid of cluster

Value of centroid of cluster assigned to x(i)x^{(i)}x(i) is μc(i)\mu_{c^{(i)}}μc(i)​

  • If c(i)=1c^{(i)} = 1c(i)=1, then μc(i)=μ1\mu_{c^{(i)}} = \mu_1μc(i)​=μ1​
  • If c(i)=2c^{(i)} = 2c(i)=2, then μc(i)=μ2\mu_{c^{(i)}} = \mu_2μc(i)​=μ2​

Which can be expressed as:

c(i):=arg⁡min⁡k∈{1,…,K}∥x(i)−μk∥2c^{(i)} := \arg\min_{k \in \{1, \dots, K\}} \left\| x^{(i)} - \mu_k \right\|^2c(i):=argk∈{1,…,K}min​​x(i)−μk​​2

Formally:

c(i)=arg⁡min⁡k∥x(i)−μk∥2c^{(i)} = \arg\min_k \left\| x^{(i)} - \mu_k \right\|^2c(i)=argkmin​​x(i)−μk​​2

Where:

  • μk\mu_kμk​ = centroid of cluster kkk
  • c(i)c^{(i)}c(i) = cluster assigned to example iii

We typically use squared Euclidean distance.

Centrod Assignment

Example

Given

Centroids:

μ1=[12]\mu_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}μ1​=[12​] μ2=[−30]\mu_2 = \begin{bmatrix} -3 \\ 0 \end{bmatrix}μ2​=[−30​] μ3=[42]\mu_3 = \begin{bmatrix} 4 \\ 2 \end{bmatrix}μ3​=[42​]

Training example:

x(i)=[−12]x^{(i)} = \begin{bmatrix} -1 \\ 2 \end{bmatrix}x(i)=[−12​]

Distance to μ1\mu_1μ1​

∥x(i)−μ1∥2=(−1−1)2+(2−2)2\| x^{(i)} - \mu_1 \|^2 = (-1 - 1)^2 + (2 - 2)^2∥x(i)−μ1​∥2=(−1−1)2+(2−2)2 =(−2)2+02=4= (-2)^2 + 0^2 = 4=(−2)2+02=4

Distance to μ2\mu_2μ2​

∥x(i)−μ2∥2=(−1+3)2+(2−0)2\| x^{(i)} - \mu_2 \|^2 = (-1 + 3)^2 + (2 - 0)^2∥x(i)−μ2​∥2=(−1+3)2+(2−0)2 =(2)2+(2)2=4+4=8= (2)^2 + (2)^2 = 4 + 4 = 8=(2)2+(2)2=4+4=8

Distance to μ3 \mu_3μ3​

∥x(i)−μ3∥2=(−1−4)2+(2−2)2\| x^{(i)} - \mu_3 \|^2 = (-1 - 4)^2 + (2 - 2)^2∥x(i)−μ3​∥2=(−1−4)2+(2−2)2 =(−5)2+02=25= (-5)^2 + 0^2 = 25=(−5)2+02=25

Assigning Cluster

  • Distance to μ1\mu_1μ1​: 4
  • Distance to μ2\mu_2 μ2​: 8
  • Distance to μ3\mu_3 μ3​: 25

The smallest distance is to μ1\mu_1μ1​ so c(i)=1c^{(i)} = 1c(i)=1


3. 🧿 Move Centroid Step

For each cluster kkk where k=1,…,Kk = 1, \dots, Kk=1,…,K:

  • KKK = total number of clusters
  • k∈{1,2,...,K}k \in \{1, 2, ..., K\}k∈{1,2,...,K}

Update centroid μk\mu_kμk​ to be the mean of all points assigned to it.

μk=1∣Ck∣∑i:c(i)=kx(i)\mu_k = \frac{1}{|C_k|} \sum_{i : c^{(i)} = k} x^{(i)}μk​=∣Ck​∣1​i:c(i)=k∑​x(i)

Where:

  • CkC_kCk​ = set of points assigned to cluster kkk
  • ∣Ck∣|C_k|∣Ck​∣ = number of points in cluster kkk

This moves the centroid to the center of its assigned points.

  • Each centroid becomes the mean of the points assigned to it.

⛔ Stop Condition

The algorithm converges when:

  • Cluster assignments c(i)c^{(i)}c(i) no longer change, or
  • Centroids μk\mu_kμk​ stop moving.

What If a Cluster Gets No Points?

If some centroid has zero assigned points:

  1. Eliminate the cluster (resulting in K−1K - 1K−1 clusters), or
  2. Reinitialize the centroid randomly

In practice, empty clusters are uncommon with reasonable initialization.


💰 Cost Function (Distortion)

Minimize th Distance between points and their assigned centroids.

J(c(1),…,c(m),μ1,…,μK)=1m∑i=1m∥x(i)−μc(i)∥2J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J(c(1),…,c(m),μ1​,…,μK​)=m1​i=1∑m​​x(i)−μc(i)​​2

Or equivalently:

J(μ1,…,μK)=1m∑i=1m∥x(i)−μc(i)∥2J(\mu_1, \dots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J(μ1​,…,μK​)=m1​i=1∑m​​x(i)−μc(i)​​2

Where:

  • JJJ = cost function (distortion)
  • mmm = number of training examples
  • x(i)x^{(i)}x(i) = iii-th training example
  • μc(i)\mu_{c^{(i)}}μc(i)​ = centroid of the cluster assigned to x(i)x^{(i)}x(i)
  • c(i)c^{(i)}c(i) = index of cluster assigned to x(i)x^{(i)}x(i)

Each iteration consists of:

  1. Assignment step
    Minimizes JJJ with respect to c(i)c^{(i)}c(i) (cluster assignments).

  2. Centroid update step
    Minimizes JJJ with respect to μk\mu_kμk​ (centroid locations).

Since each step does not increase JJJ, the algorithm is guaranteed to converge.

Each iteration of K-Means guarantees that JJJ does not increase.

However, it may converge to a local minimum, not necessarily the global minimum.

A good solution:

  • Each natural cluster is captured by one centroid.

A bad solution:

  • Merge two true clusters into one
  • Split one true cluster into multiple parts
  • Assign very few points to some clusters

All of these correspond to higher distortion:

Jbad>JgoodJ_{\text{bad}} > J_{\text{good}}Jbad​>Jgood​

To reduce the risk of poor local minima:

← Previous

Principal Component Analysis (PCA) Explained

Next →

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

AI-Machine-Learning/6-0-K-mean
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.