Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 6 2 Gaussian Distribution

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-Machine-Learning

  • AI-Machine-Learning Index

  • Machine Learning Learning Path

  • Machine Learning: Introduction and Core Algorithms

  • Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent

  • Evaluating a Hypothesis in Neural Networks

  • Bias-Variance Dilemma

  • Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

  • Polynomial Regression

  • Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

  • XGBoost (Extreme Gradient Boosting) Explained

  • Dimensionality Reduction in Machine Learning

  • Principal Component Analysis (PCA) Explained

  • t-SNE (t-distributed Stochastic Neighbor Embedding) Explained

  • K-Means Clustering

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Anomaly Detection Using Gaussian Distribution in Machine Learning

  • Anomaly Detection Using Multivariate Gaussian Distribution

  • Recommender Systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Approaches

  • Collaborative Filtering: Building Recommender Systems with Feature Learning

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Large Scale Machine Learning: Training Models on Massive Datasets

  • Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

  • MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

Cover Image for Anomaly Detection Using Gaussian Distribution in Machine Learning

Anomaly Detection Using Gaussian Distribution in Machine Learning

Learn how anomaly detection works using the Gaussian (normal) distribution. Understand how to model data probabilistically, estimate parameters, compute likelihoods, and identify outliers using threshold-based decision making in machine learning systems.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

Next →

Anomaly Detection Using Multivariate Gaussian Distribution

Anomaly Detection Using Gaussian Distribution

In this approach, we build an anomaly detection algorithm by modeling the probability of data using Gaussian distributions.

  1. Model each feature using a Gaussian distribution.
  2. Estimate μ\muμ and σ2\sigma^2σ2 from training data.
  3. Compute p(x)p(x)p(x) for new examples.
  4. Flag examples where:
p(x)<εp(x) < \varepsilonp(x)<ε

Low probability → likely anomaly.

Anomaly Detection


🔔 Understanding Gaussian Distribution

Probability Density Function

The Gaussian probability density function is:

p(x)=12πσe−(x−μ)22σ2p(x) = \frac{1}{\sqrt{2\pi}\sigma} e ^{ -\frac{(x-\mu)^2}{2\sigma^2}}p(x)=2π​σ1​e−2σ2(x−μ)2​

Where:

  • xxx = random variable in data set
  • μ\muμ = mean : average of all data points
  • σ\sigmaσ = standard deviation : how much data varies from the mean
  • σ2\sigma^2σ2 = variance : square of standard deviation

Shape of the Gaussian Distribution

The Gaussian curve has the following properties:

  • It is bell-shaped
  • It is symmetric around the mean
  • The total area under the curve equals 1

When plotted:

Any random variable xxx follows a Gaussian distribution with:

x∼N(μ,σ2)x \sim \mathcal{N}(\mu, \sigma^2)x∼N(μ,σ2)

Where

The symbol ∼\sim∼ means “is distributed as”.

x(1),x(2),...,x(m)x^{(1)}, x^{(2)}, ..., x^{(m)}x(1),x(2),...,x(m)

Effect of Parameters

The curve is fully defined by two parameters. So our goal is to estimate:

Normal case: μ = 0, σ = 1

This is the standard normal distribution.

  • Centered at 0
  • Moderate width

1. Mean (μ) ↔️

The average of all the data points

μ=1m∑i=1mx(i)\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}μ=m1​i=1∑m​x(i)
  • Controls the center of the distribution.
  • Changing μ shifts the curve left or right.

Example:

  • If μ = 0 → centered at 0
  • If μ = 3 → centered at 3

Effect of μ\muμ ↔️

μ\muμ σ\sigmaσ Shape of the Curve
0 1 Standard bell curve
3 1 Shifted to the right, same shape as standard curve
-2 1 Shifted to the left, same shape as standard curve

2. Standard Deviation (σ2\sigma^2σ2) ↕️

This measures how far the data points are from the mean.

It is the average squared deviation from the mean.

σ2=1m∑i=1m(x(i)−μ)2\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu)^2σ2=m1​i=1∑m​(x(i)−μ)2
  • Controls the width of the curve.
  • Smaller σ\sigmaσ → narrower and taller curve
  • Larger σ\sigmaσ → wider and flatter curve

Since total area must equal 1:

  • If the curve gets wider → it becomes shorter
  • If the curve gets narrower → it becomes taller

Effect of σ\sigmaσ ↕️

μ\muμ σ\sigmaσ Shape of the Curve
0 1 Standard bell curve
0 0.5 Narrower and taller curve
0 2 Wider and flatter curve

Note on 1/m vs 1/(m−1)

In statistics, you may sometimes see:

1m−1\frac{1}{m-1}m−11​

instead of:

1m\frac{1}{m}m1​

In machine learning, we usually use 1/m. When the dataset is large, the difference is very small in practice.


Intuition (2D Case)

When n=2n = 2n=2:

  • Each feature has its own Gaussian distribution.
  • Their product forms a 3D probability surface.
  • High probability regions form an ellipse-shaped area.
  • Points outside that region have low probability and are flagged as anomalies.

Gaussian Anomaly Detection

Problem Setup

We are given:

  • An unlabeled training set of mmm examples:

    x(1),x(2),…,x(m)x^{(1)}, x^{(2)}, \dots, x^{(m)}x(1),x(2),…,x(m)
  • Each example is a feature vector in Rn\mathbb{R}^nRn

Examples:

  • Aircraft engine sensor data
  • User behavior features
  • System monitoring metrics

The goal is to determine whether a new example is normal or anomalous.

1. Training Phase 📚

  1. Choose relevant features.
  2. Compute μ1,…,μn\mu_1, \dots, \mu_nμ1​,…,μn​.
  3. Compute σ12,…,σn2\sigma_1^2, \dots, \sigma_n^2σ12​,…,σn2​.

Modeling p(x)p(x)p(x)

We model the probability of a data point using Gaussian Distribution:

p(x)=p(x1,x2,…,xn)p(x) = p(x_1, x_2, \dots, x_n)p(x)=p(x1​,x2​,…,xn​) x∈Rnx \in \mathbb{R}^nx∈Rn

where nnn is the number of features.

Each feature is modeled using a Gaussian distribution:

xj∼N(μj,σj2)x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)xj​∼N(μj​,σj2​)

The symbol ∼\sim∼ means “is distributed as”.

This means the random variable xxx follows a Gaussian distribution with:

  • Mean μj\mu_jμj​
  • Variance σj2\sigma_j^2σj2​

1. Parameter Estimation

Given training data, we estimate parameters.

Mean μj\mu_jμj​

For each feature jjj:

μj=1m∑i=1mxj(i)\mu_j = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)}μj​=m1​i=1∑m​xj(i)​

This is the average value of feature jjj.

Variance σj2\sigma_j^2σj2​

σj2=1m∑i=1m(xj(i)−μj)2\sigma_j^2 = \frac{1}{m} \sum_{i=1}^{m} \left(x_j^{(i)} - \mu_j\right)^2σj2​=m1​i=1∑m​(xj(i)​−μj​)2

This measures how spread out the feature values are.

2. Density estimation 🌌

Compute Probability of Examples

Probabilities are multiplicative for independent features.

p(x)=p(x1,x2,…,xn)p(x) = p(x_1, x_2, \dots, x_n)p(x)=p(x1​,x2​,…,xn​)

We assume the features are independent, so:

p(x)=∏j=1np(xj)p(x) = \prod_{j=1}^{n} p(x_j)p(x)=j=1∏n​p(xj​)

Here, ∏\prod∏ denotes a product (multiplication over a range).

p(x)=∏j=1np(xj;μj,σj2)p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2)p(x)=j=1∏n​p(xj​;μj​,σj2​)

Where each feature probability is:

p(xj)=12πσjexp⁡(−(xj−μj)22σj2)p(x_j) = \frac{1}{\sqrt{2\pi}\sigma_j} \exp\left( -\frac{(x_j - \mu_j)^2}{2\sigma_j^2} \right)p(xj​)=2π​σj​1​exp(−2σj2​(xj​−μj​)2​)

Example

For a 2-feature example:

Temperature

  • x1x_1x1​ = 17.5

  • p(x1)=0.0738p(x_1) = 0.0738p(x1​)=0.0738

Vibration Intensity

  • x2x_2x2​ = 48
  • p(x2)=0.02288p(x_2) = 0.02288p(x2​)=0.02288

To find the overall probability of this example:

p(x)=p(x1)×p(x2)p(x) = p(x_1) \times p(x_2)p(x)=p(x1​)×p(x2​)

Therefore:

p(x)=0.0738×0.02288p(x) = 0.0738 \times 0.02288p(x)=0.0738×0.02288

p(x)=0.001688544p(x) = 0.001688544p(x)=0.001688544

p(x)≈0.00169 p(x) \approx 0.00169p(x)≈0.00169


2. Making Predictions 🔎

Detection Phase

For a new example xtestx_{test}xtest​:

Step 1: Compute probability

p(xtest)=∏j=1np(xtest,j;μj,σj2)p(x_{test}) = \prod_{j=1}^{n} p(x_{test,j}; \mu_j, \sigma_j^2)p(xtest​)=j=1∏n​p(xtest,j​;μj​,σj2​)

Step 2: Choose threshold ε\varepsilonε

Decision Rule

Compare with ε\varepsilonε.

If p(xtest)<ε⇒Anomaly\text{If } p(x_{test}) < \varepsilon \Rightarrow \text{Anomaly}If p(xtest​)<ε⇒Anomaly If p(xtest)≥ε⇒Normal\text{If } p(x_{test}) \ge \varepsilon \Rightarrow \text{Normal}If p(xtest​)≥ε⇒Normal

Flag as anomaly if probability is low.


Key Takeaway

The multivariate Gaussian distribution models:

Feature Variance
+
Feature Correlation

which allows anomaly detection systems to detect unusual combinations of features, not just unusual individual values.

← Previous

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

Next →

Anomaly Detection Using Multivariate Gaussian Distribution

AI-Machine-Learning/6-2-Gaussian-Distribution
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.