Anomaly Detection Using Gaussian Distribution: Detecting Outliers with Probability Models

Learn how anomaly detection works using the Gaussian (normal) distribution. Understand how to model data probabilistically, estimate parameters, compute likelihoods, and identify outliers using threshold-based decision making in machine learning systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

Recommender Systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Approaches

Gaussian Distribution/ Normal distribution

This understanding of the Gaussian distribution is essential before building the anomaly detection algorithm.

x \sim \mathcal{N}(\mu, \sigma^2)

This means the random variable $x$ follows a Gaussian distribution with:

Mean: $\mu$
Variance: $\sigma^2$

The symbol $\sim$ means “is distributed as”.

Probability Density Function

The Gaussian probability density function is:

p(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)

You do not need to memorize this formula.
It simply defines the bell-shaped curve.

Shape of the Gaussian Distribution

The Gaussian curve has the following properties:

It is bell-shaped
It is symmetric around the mean
The total area under the curve equals 1

x^{(1)}, x^{(2)}, ..., x^{(m)}

Effect of Parameters

The curve is fully defined by two parameters. SO our goal is to estimate:

1. Mean (μ)

Controls the center of the distribution.
Changing μ shifts the curve left or right.

Example:

If μ = 0 → centered at 0
If μ = 3 → centered at 3

Estimating the Mean

\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}

This is simply the average of the data.

2. Standard Deviation (σ)

This measures how far the data points are from the mean.

It is the average squared deviation from the mean.

Controls the width of the curve.
Smaller σ → narrower and taller curve
Larger σ → wider and flatter curve

Since total area must equal 1:

If the curve gets wider → it becomes shorter
If the curve gets narrower → it becomes taller

Estimating the Variance

\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu)^2

Note on 1/m vs 1/(m−1)

In statistics, you may sometimes see:

\frac{1}{m-1}

instead of:

\frac{1}{m}

In machine learning, we usually use 1/m. When the dataset is large, the difference is very small in practice.

Examples

Case 1: μ = 0, σ = 1

This is the standard normal distribution.

Centered at 0
Moderate width

Case 2: μ = 0, σ = 0.5

Still centered at 0
Much narrower
Taller curve
Variance = $( \sigma^2 = 0.25)$

Case 3: μ = 0, σ = 2

Centered at 0
Much wider
Flatter curve

Anomaly Detection Using Gaussian Distribution

In this approach, we build an anomaly detection algorithm by modeling the probability of data using Gaussian distributions.

Model each feature using a Gaussian distribution.
Estimate $\mu$ and $\sigma^2$ from training data.
Compute $p(x)$ for new examples.
Flag examples where:

p(x) < \varepsilon

Low probability → likely anomaly.

Problem Setup

We are given:

An unlabeled training set of $m$ examples:
$x^{(1)}, x^{(2)}, \dots, x^{(m)}$
Each example is a feature vector in $\mathbb{R}^n$

Examples:

Aircraft engine sensor data
User behavior features
System monitoring metrics

The goal is to determine whether a new example is normal or anomalous.

📚 1. Training Phase

Choose relevant features.
Compute $\mu_1, \dots, \mu_n$ .
Compute $\sigma_1^2, \dots, \sigma_n^2$ .

Modeling $p(x)$

We model the probability of a data point using Gaussian Distribution:

p(x) = p(x_1, x_2, \dots, x_n)

x \in \mathbb{R}^n

where $n$ is the number of features.

Each feature is modeled using a Gaussian distribution:

x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)

The symbol $\sim$ means “is distributed as”.

This means the random variable $x$ follows a Gaussian distribution with:

Mean $\mu_j$
Variance $\sigma_j^2$

Parameter Estimation

Given training data, we estimate parameters.

Mean

For each feature $j$ :

\mu_j = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)}

This is the average value of feature $j$ .

Variance

\sigma_j^2 = \frac{1}{m} \sum_{i=1}^{m} \left(x_j^{(i)} - \mu_j\right)^2

This measures how spread out the feature values are.

Compute Probability of Examples

We assume the features are independent, so:

p(x) = \prod_{j=1}^{n} p(x_j)

Here, $\prod$ denotes a product (multiplication over a range).

p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2)

Where each feature probability is:

p(x_j) = \frac{1}{\sqrt{2\pi}\sigma_j} \exp\left( -\frac{(x_j - \mu_j)^2}{2\sigma_j^2} \right)

This approach is called density estimation.

🔎 2. Making Predictions

Detection Phase

For a new example $x_{test}$ :

Step 1: Compute probability

p(x_{test}) = \prod_{j=1}^{n} p(x_{test,j}; \mu_j, \sigma_j^2)

Step 2: Choose threshold $\varepsilon$

Decision Rule

Compare with $\varepsilon$ .

\text{If } p(x_{test}) < \varepsilon \Rightarrow \text{Anomaly}

\text{If } p(x_{test}) \ge \varepsilon \Rightarrow \text{Normal}

Flag as anomaly if probability is low.

Intuition (2D Case)

When $n = 2$ :

Each feature has its own Gaussian distribution.
Their product forms a 3D probability surface.
High probability regions form an ellipse-shaped area.
Points outside that region have low probability and are flagged as anomalies.

Anomaly Detection Using Gaussian Distribution: Detecting Outliers with Probability Models

Learn how anomaly detection works using the Gaussian (normal) distribution. Understand how to model data probabilistically, estimate parameters, compute likelihoods, and identify outliers using threshold-based decision making in machine learning systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

Recommender Systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Approaches

Gaussian Distribution/ Normal distribution

This understanding of the Gaussian distribution is essential before building the anomaly detection algorithm.

x \sim \mathcal{N}(\mu, \sigma^2)

This means the random variable $x$ follows a Gaussian distribution with:

Mean: $\mu$
Variance: $\sigma^2$

The symbol $\sim$ means “is distributed as”.

Probability Density Function

The Gaussian probability density function is:

p(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)

You do not need to memorize this formula.
It simply defines the bell-shaped curve.

Shape of the Gaussian Distribution

The Gaussian curve has the following properties:

It is bell-shaped
It is symmetric around the mean
The total area under the curve equals 1

x^{(1)}, x^{(2)}, ..., x^{(m)}

Effect of Parameters

The curve is fully defined by two parameters. SO our goal is to estimate:

1. Mean (μ)

Controls the center of the distribution.
Changing μ shifts the curve left or right.

Example:

If μ = 0 → centered at 0
If μ = 3 → centered at 3

Estimating the Mean

\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}

This is simply the average of the data.

2. Standard Deviation (σ)

This measures how far the data points are from the mean.

It is the average squared deviation from the mean.

Controls the width of the curve.
Smaller σ → narrower and taller curve
Larger σ → wider and flatter curve

Since total area must equal 1:

If the curve gets wider → it becomes shorter
If the curve gets narrower → it becomes taller

Estimating the Variance

\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu)^2

Note on 1/m vs 1/(m−1)

In statistics, you may sometimes see:

\frac{1}{m-1}

instead of:

\frac{1}{m}

In machine learning, we usually use 1/m. When the dataset is large, the difference is very small in practice.

Examples

Case 1: μ = 0, σ = 1

This is the standard normal distribution.

Centered at 0
Moderate width

Case 2: μ = 0, σ = 0.5

Still centered at 0
Much narrower
Taller curve
Variance = $( \sigma^2 = 0.25)$

Case 3: μ = 0, σ = 2

Centered at 0
Much wider
Flatter curve

Anomaly Detection Using Gaussian Distribution

In this approach, we build an anomaly detection algorithm by modeling the probability of data using Gaussian distributions.

Model each feature using a Gaussian distribution.
Estimate $\mu$ and $\sigma^2$ from training data.
Compute $p(x)$ for new examples.
Flag examples where:

p(x) < \varepsilon

Low probability → likely anomaly.

Problem Setup

We are given:

An unlabeled training set of $m$ examples:
$x^{(1)}, x^{(2)}, \dots, x^{(m)}$
Each example is a feature vector in $\mathbb{R}^n$

Examples:

Aircraft engine sensor data
User behavior features
System monitoring metrics

The goal is to determine whether a new example is normal or anomalous.

📚 1. Training Phase

Choose relevant features.
Compute $\mu_1, \dots, \mu_n$ .
Compute $\sigma_1^2, \dots, \sigma_n^2$ .

Modeling $p(x)$

We model the probability of a data point using Gaussian Distribution:

p(x) = p(x_1, x_2, \dots, x_n)

x \in \mathbb{R}^n

where $n$ is the number of features.

Each feature is modeled using a Gaussian distribution:

x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)

The symbol $\sim$ means “is distributed as”.

This means the random variable $x$ follows a Gaussian distribution with:

Mean $\mu_j$
Variance $\sigma_j^2$

Parameter Estimation

Given training data, we estimate parameters.

Mean

For each feature $j$ :

\mu_j = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)}

This is the average value of feature $j$ .

Variance

\sigma_j^2 = \frac{1}{m} \sum_{i=1}^{m} \left(x_j^{(i)} - \mu_j\right)^2

This measures how spread out the feature values are.

Compute Probability of Examples

We assume the features are independent, so:

p(x) = \prod_{j=1}^{n} p(x_j)

Here, $\prod$ denotes a product (multiplication over a range).

p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2)

Where each feature probability is:

p(x_j) = \frac{1}{\sqrt{2\pi}\sigma_j} \exp\left( -\frac{(x_j - \mu_j)^2}{2\sigma_j^2} \right)

This approach is called density estimation.

🔎 2. Making Predictions

Detection Phase

For a new example $x_{test}$ :

Step 1: Compute probability

p(x_{test}) = \prod_{j=1}^{n} p(x_{test,j}; \mu_j, \sigma_j^2)

Step 2: Choose threshold $\varepsilon$

Decision Rule

Compare with $\varepsilon$ .

\text{If } p(x_{test}) < \varepsilon \Rightarrow \text{Anomaly}

\text{If } p(x_{test}) \ge \varepsilon \Rightarrow \text{Normal}

Flag as anomaly if probability is low.

Intuition (2D Case)

When $n = 2$ :

Each feature has its own Gaussian distribution.
Their product forms a 3D probability surface.
High probability regions form an ellipse-shaped area.
Points outside that region have low probability and are flagged as anomalies.

Anomaly Detection Using Gaussian Distribution: Detecting Outliers with Probability Models

Learn how anomaly detection works using the Gaussian (normal) distribution. Understand how to model data probabilistically, estimate parameters, compute likelihoods, and identify outliers using threshold-based decision making in machine learning systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Gaussian Distribution/ Normal distribution

Probability Density Function

Shape of the Gaussian Distribution

Effect of Parameters

1. Mean (μ)

Estimating the Mean

2. Standard Deviation (σ)

Estimating the Variance

Note on 1/m vs 1/(m−1)

Examples

Case 1: μ = 0, σ = 1

Case 2: μ = 0, σ = 0.5

Case 3: μ = 0, σ = 2

Anomaly Detection Using Gaussian Distribution

Problem Setup

📚 1. Training Phase

Modeling p(x)p(x)p(x)

Parameter Estimation

Mean

Variance

Compute Probability of Examples

🔎 2. Making Predictions

Detection Phase

Step 1: Compute probability

Step 2: Choose threshold ε\varepsilonε

Decision Rule

Intuition (2D Case)

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

Anomaly Detection Using Gaussian Distribution: Detecting Outliers with Probability Models

Learn how anomaly detection works using the Gaussian (normal) distribution. Understand how to model data probabilistically, estimate parameters, compute likelihoods, and identify outliers using threshold-based decision making in machine learning systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Gaussian Distribution/ Normal distribution

Probability Density Function

Shape of the Gaussian Distribution

Effect of Parameters

1. Mean (μ)

Estimating the Mean

2. Standard Deviation (σ)

Estimating the Variance

Note on 1/m vs 1/(m−1)

Examples

Case 1: μ = 0, σ = 1

Case 2: μ = 0, σ = 0.5

Case 3: μ = 0, σ = 2

Anomaly Detection Using Gaussian Distribution

Problem Setup

📚 1. Training Phase

Modeling p(x)p(x)p(x)

Parameter Estimation

Mean

Variance

Compute Probability of Examples

🔎 2. Making Predictions

Detection Phase

Step 1: Compute probability

Step 2: Choose threshold ε\varepsilonε

Decision Rule

Intuition (2D Case)

Modeling $p(x)$

Step 2: Choose threshold $\varepsilon$

Modeling $p(x)$

Step 2: Choose threshold $\varepsilon$