Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

Bias-Variance Dilemma

Understanding the bias-variance tradeoff in machine learning, including the concepts of bias and variance, underfitting and overfitting, and strategies to balance model complexity for better generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Bias-Variance Dilemma

Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.

🔬 Diagnosing Bias vs. Variance

When a model performs poorly, the key question is:

Is the problem bias or variance?

High bias → underfitting
High variance → overfitting

Our goal is to find the balance between the two.

Effect of Polynomial Degree $d$

As we increase the polynomial degree $d$ :

Training error

J_{\text{train}}(\Theta)

Generally decreases
Higher-degree models fit the training data better

Cross-validation error

J_{\text{CV}}(\Theta)

Decreases initially
Then increases after some point
Forms a convex (U-shaped) curve

This behavior helps us diagnose bias vs. variance.

🦎 High Bias (Underfitting)

The model is too simple to capture the underlying pattern of the data

Characteristics:

Model is too simple
Fails to capture structure in the data

Problem:

Poor training performance

J_{\text{train}}(\Theta) \text{ is high}

Poor Test performance

J_{\text{CV}}(\Theta) \text{ is high}

And importantly:

J_{\text{CV}}(\Theta) \approx J_{\text{train}}(\Theta)

Interpretation:

The model performs poorly everywhere
Adding more data usually does not help much
Increasing model complexity may help

🪱 High Variance (Overfitting)

model is too complex and starts fitting the training data perfectly

Model can bend heavily to pass through every training point.

Characteristics:

Model is too complex
Fits noise in the training data

Problem:

Low training error ie. good training performance

J_{\text{train}}(\Theta) \text{ is low}

Poor test performance lead to poor performance on unseen data.

Poor generalization to new data

J_{\text{CV}}(\Theta) \gg J_{\text{train}}(\Theta)

Interpretation:

Model performs very well on training data
Performs poorly on unseen data
Large gap between training and validation error

Solutions

Use Regularization term to add Penalty for features
Reduce model complexity:
- Reduce Number of Features: Manually select important features
- Remove irrelevant variables
- Use automated model selection methods

Visual Summary (Conceptual)

As degree of polynomial $d$ increases:

$J_{\text{train}}(\Theta)$ steadily decreases
$J_{\text{CV}}(\Theta)$ decreases, then increases

Low $d$ → High bias
High $d$ → High variance
Middle $d$ → Good balance

Practical Diagnostic Rule

Situation	$J_{\text{train}}$	$J_{\text{CV}}$	Diagnosis
Both high and similar	High	High	High bias
Large gap (train low, CV high)	Low	High	High variance
Both low and similar	Low	Low	Good fit

Bias vs Variance Summary

Concept	Meaning	Cause	Effect
High Bias	Model too simple	Too few features	Underfitting
High Variance	Model too complex	Too many features	Overfitting

Regularization techniques

Used to make reduce variance and solve problem of Overfitting

Instead of removing features, keep them all but reduce parameter sizes.
Regularization adds a penalty term to the cost function to discourage complexity.
Regularization helps prevent overfitting by keeping the model simpler.
The regularization parameter λ controls the strength of the penalty. A larger λ means more regularization.

Lasso vs Ridge

Feature	Lasso (L1)	Ridge (L2)
Penalty	Sum of absolute values $\sum \\| \theta_j\\|$	Sum of squares $\sum \theta_j^2$
Effect	Can shrink some coefficients exactly to 0 → feature selection	Shrinks coefficients but rarely to 0
Use Case	Many irrelevant features	Prevent overfitting, keep all features

Instead of removing features, keep them all but reduce parameter sizes.

The idea:

Large weights → complex model
Small weights → smoother model

🔹 Lasso Regression (L1 Regularization)

Lasso: Cost = MSE + λ * sum(|θ|)

Lasso (L1) can shrink some coefficients to zero, effectively performing feature selection.

In standard linear regression, the cost function is:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

Lasso adds a penalty proportional to the sum of absolute values of the coefficients:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|

Where:

$\lambda$ = regularization strength
$|\theta_j|$ = absolute value of parameter $\theta_j$
$\theta_0$ (bias) is usually not penalized

🏔️ Ridge Regression (L2 Regularization)

Ridge: Cost = MSE + λ * sum(θ^2)

Ridge (L2) shrinks coefficients but does not set them to zero.

The standard linear regression cost function is:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

Ridge adds a penalty proportional to the sum of squared coefficients:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2

Where:

$\lambda$ = regularization strength
$\theta_j$ = model parameters
$\theta_0$ (bias) is usually not penalized

Key Insight

Bias is about model simplicity.
Variance is about model sensitivity to data.

Good model selection is about finding the degree $d$ that minimizes:

J_{\text{CV}}(\Theta)

while avoiding both underfitting and overfitting.

AI-Machine-Learning/6-Bias-Variance

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Bias-Variance Dilemma

Understanding the bias-variance tradeoff in machine learning, including the concepts of bias and variance, underfitting and overfitting, and strategies to balance model complexity for better generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Bias-Variance Dilemma

Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.

🔬 Diagnosing Bias vs. Variance

When a model performs poorly, the key question is:

Is the problem bias or variance?

High bias → underfitting
High variance → overfitting

Our goal is to find the balance between the two.

Effect of Polynomial Degree $d$

As we increase the polynomial degree $d$ :

Training error

J_{\text{train}}(\Theta)

Generally decreases
Higher-degree models fit the training data better

Cross-validation error

J_{\text{CV}}(\Theta)

Decreases initially
Then increases after some point
Forms a convex (U-shaped) curve

This behavior helps us diagnose bias vs. variance.

🦎 High Bias (Underfitting)

The model is too simple to capture the underlying pattern of the data

Characteristics:

Model is too simple
Fails to capture structure in the data

Problem:

Poor training performance

J_{\text{train}}(\Theta) \text{ is high}

Poor Test performance

J_{\text{CV}}(\Theta) \text{ is high}

And importantly:

J_{\text{CV}}(\Theta) \approx J_{\text{train}}(\Theta)

Interpretation:

The model performs poorly everywhere
Adding more data usually does not help much
Increasing model complexity may help

🪱 High Variance (Overfitting)

model is too complex and starts fitting the training data perfectly

Model can bend heavily to pass through every training point.

Characteristics:

Model is too complex
Fits noise in the training data

Problem:

Low training error ie. good training performance

J_{\text{train}}(\Theta) \text{ is low}

Poor test performance lead to poor performance on unseen data.

Poor generalization to new data

J_{\text{CV}}(\Theta) \gg J_{\text{train}}(\Theta)

Interpretation:

Model performs very well on training data
Performs poorly on unseen data
Large gap between training and validation error

Solutions

Use Regularization term to add Penalty for features
Reduce model complexity:
- Reduce Number of Features: Manually select important features
- Remove irrelevant variables
- Use automated model selection methods

Visual Summary (Conceptual)

As degree of polynomial $d$ increases:

$J_{\text{train}}(\Theta)$ steadily decreases
$J_{\text{CV}}(\Theta)$ decreases, then increases

Low $d$ → High bias
High $d$ → High variance
Middle $d$ → Good balance

Practical Diagnostic Rule

Situation	$J_{\text{train}}$	$J_{\text{CV}}$	Diagnosis
Both high and similar	High	High	High bias
Large gap (train low, CV high)	Low	High	High variance
Both low and similar	Low	Low	Good fit

Bias vs Variance Summary

Concept	Meaning	Cause	Effect
High Bias	Model too simple	Too few features	Underfitting
High Variance	Model too complex	Too many features	Overfitting

Regularization techniques

Used to make reduce variance and solve problem of Overfitting

Instead of removing features, keep them all but reduce parameter sizes.
Regularization adds a penalty term to the cost function to discourage complexity.
Regularization helps prevent overfitting by keeping the model simpler.
The regularization parameter λ controls the strength of the penalty. A larger λ means more regularization.

Lasso vs Ridge

Feature	Lasso (L1)	Ridge (L2)
Penalty	Sum of absolute values $\sum \\| \theta_j\\|$	Sum of squares $\sum \theta_j^2$
Effect	Can shrink some coefficients exactly to 0 → feature selection	Shrinks coefficients but rarely to 0
Use Case	Many irrelevant features	Prevent overfitting, keep all features

Instead of removing features, keep them all but reduce parameter sizes.

The idea:

Large weights → complex model
Small weights → smoother model

🔹 Lasso Regression (L1 Regularization)

Lasso: Cost = MSE + λ * sum(|θ|)

Lasso (L1) can shrink some coefficients to zero, effectively performing feature selection.

In standard linear regression, the cost function is:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

Lasso adds a penalty proportional to the sum of absolute values of the coefficients:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|

Where:

$\lambda$ = regularization strength
$|\theta_j|$ = absolute value of parameter $\theta_j$
$\theta_0$ (bias) is usually not penalized

🏔️ Ridge Regression (L2 Regularization)

Ridge: Cost = MSE + λ * sum(θ^2)

Ridge (L2) shrinks coefficients but does not set them to zero.

The standard linear regression cost function is:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

Ridge adds a penalty proportional to the sum of squared coefficients:

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2

Where:

$\lambda$ = regularization strength
$\theta_j$ = model parameters
$\theta_0$ (bias) is usually not penalized

Key Insight

Bias is about model simplicity.
Variance is about model sensitivity to data.

Good model selection is about finding the degree $d$ that minimizes:

J_{\text{CV}}(\Theta)

while avoiding both underfitting and overfitting.

AI-Machine-Learning/6-Bias-Variance

Fetching content, this won’t take long…

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

Bias-Variance Dilemma

Understanding the bias-variance tradeoff in machine learning, including the concepts of bias and variance, underfitting and overfitting, and strategies to balance model complexity for better generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Bias-Variance Dilemma

🔬 Diagnosing Bias vs. Variance

Effect of Polynomial Degree ddd

🦎 High Bias (Underfitting)

Problem:

🪱 High Variance (Overfitting)

Problem:

Solutions

Visual Summary (Conceptual)

Practical Diagnostic Rule

Bias vs Variance Summary

Regularization techniques

Lasso vs Ridge

🔹 Lasso Regression (L1 Regularization)

🏔️ Ridge Regression (L2 Regularization)

Key Insight

Fetching content, this won’t take long…

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Bias-Variance Dilemma

Understanding the bias-variance tradeoff in machine learning, including the concepts of bias and variance, underfitting and overfitting, and strategies to balance model complexity for better generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Bias-Variance Dilemma

🔬 Diagnosing Bias vs. Variance

Effect of Polynomial Degree ddd

🦎 High Bias (Underfitting)

Problem:

🪱 High Variance (Overfitting)

Problem:

Solutions

Visual Summary (Conceptual)

Practical Diagnostic Rule

Bias vs Variance Summary

Regularization techniques

Lasso vs Ridge

🔹 Lasso Regression (L1 Regularization)

🏔️ Ridge Regression (L2 Regularization)

Key Insight

Effect of Polynomial Degree $d$

Effect of Polynomial Degree $d$