Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 6 Bias Variance

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for Bias-Variance Dilemma

Bias-Variance Dilemma

Understanding the bias-variance tradeoff in machine learning, including the concepts of bias and variance, underfitting and overfitting, and strategies to balance model complexity for better generalization.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

Next →

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Bias-Variance Dilemma

Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.

🔬 Diagnosing Bias vs. Variance

When a model performs poorly, the key question is:

Is the problem bias or variance?

  • High bias → underfitting
  • High variance → overfitting

Our goal is to find the balance between the two.

Effect of Polynomial Degree ddd

As we increase the polynomial degree ddd:

Training error

Jtrain(Θ)J_{\text{train}}(\Theta)Jtrain​(Θ)
  • Generally decreases
  • Higher-degree models fit the training data better

Cross-validation error

JCV(Θ)J_{\text{CV}}(\Theta)JCV​(Θ)
  • Decreases initially
  • Then increases after some point
  • Forms a convex (U-shaped) curve

This behavior helps us diagnose bias vs. variance.


🦎 High Bias (Underfitting)

The model is too simple to capture the underlying pattern of the data

Characteristics:

  • Model is too simple
  • Fails to capture structure in the data

Problem:

  • Poor training performance
Jtrain(Θ) is highJ_{\text{train}}(\Theta) \text{ is high}Jtrain​(Θ) is high
  • Poor Test performance
JCV(Θ) is highJ_{\text{CV}}(\Theta) \text{ is high}JCV​(Θ) is high

And importantly:

JCV(Θ)≈Jtrain(Θ)J_{\text{CV}}(\Theta) \approx J_{\text{train}}(\Theta)JCV​(Θ)≈Jtrain​(Θ)

Interpretation:

  • The model performs poorly everywhere
  • Adding more data usually does not help much
  • Increasing model complexity may help

🪱 High Variance (Overfitting)

model is too complex and starts fitting the training data perfectly

  • Model can bend heavily to pass through every training point.

Characteristics:

  • Model is too complex
  • Fits noise in the training data

Problem:

Low training error ie. good training performance

Jtrain(Θ) is lowJ_{\text{train}}(\Theta) \text{ is low}Jtrain​(Θ) is low

Poor test performance lead to poor performance on unseen data.

  • Poor generalization to new data
JCV(Θ)≫Jtrain(Θ)J_{\text{CV}}(\Theta) \gg J_{\text{train}}(\Theta)JCV​(Θ)≫Jtrain​(Θ)

Interpretation:

  • Model performs very well on training data
  • Performs poorly on unseen data
  • Large gap between training and validation error

Solutions

  1. Use Regularization term to add Penalty for features
  2. Reduce model complexity:
    • Reduce Number of Features: Manually select important features
    • Remove irrelevant variables
    • Use automated model selection methods

Visual Summary (Conceptual)

As degree of polynomial ddd increases:

  • Jtrain(Θ)J_{\text{train}}(\Theta)Jtrain​(Θ) steadily decreases
  • JCV(Θ)J_{\text{CV}}(\Theta)JCV​(Θ) decreases, then increases

Low ddd → High bias
High ddd → High variance
Middle ddd → Good balance

Practical Diagnostic Rule

Situation JtrainJ_{\text{train}}Jtrain​ JCVJ_{\text{CV}}JCV​ Diagnosis
Both high and similar High High High bias
Large gap (train low, CV high) Low High High variance
Both low and similar Low Low Good fit

Bias vs Variance Summary

Concept Meaning Cause Effect
High Bias Model too simple Too few features Underfitting
High Variance Model too complex Too many features Overfitting

Regularization techniques

Used to make reduce variance and solve problem of Overfitting

  • Instead of removing features, keep them all but reduce parameter sizes.

  • Regularization adds a penalty term to the cost function to discourage complexity.

  • Regularization helps prevent overfitting by keeping the model simpler.

  • The regularization parameter λ controls the strength of the penalty. A larger λ means more regularization.

Lasso vs Ridge

Feature Lasso (L1) Ridge (L2)
Penalty Sum of absolute values ∑∥θj∥\sum \| \theta_j\|∑∥θj​∥ Sum of squares ∑θj2\sum \theta_j^2∑θj2​
Effect Can shrink some coefficients exactly to 0 → feature selection Shrinks coefficients but rarely to 0
Use Case Many irrelevant features Prevent overfitting, keep all features

Instead of removing features, keep them all but reduce parameter sizes.

The idea:

  • Large weights → complex model
  • Small weights → smoother model

🔹 Lasso Regression (L1 Regularization)

Lasso: Cost = MSE + λ * sum(|θ|)

  • Lasso (L1) can shrink some coefficients to zero, effectively performing feature selection.

In standard linear regression, the cost function is:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

Lasso adds a penalty proportional to the sum of absolute values of the coefficients:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1n∣θj∣J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​∣θj​∣

Where:

  • λ\lambdaλ = regularization strength
  • ∣θj∣|\theta_j|∣θj​∣ = absolute value of parameter θj\theta_jθj​
  • θ0\theta_0θ0​ (bias) is usually not penalized

🏔️ Ridge Regression (L2 Regularization)

Ridge: Cost = MSE + λ * sum(θ^2)

  • Ridge (L2) shrinks coefficients but does not set them to zero.

The standard linear regression cost function is:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

Ridge adds a penalty proportional to the sum of squared coefficients:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​θj2​

Where:

  • λ\lambdaλ = regularization strength
  • θj\theta_jθj​ = model parameters
  • θ0\theta_0θ0​ (bias) is usually not penalized

Key Insight

  • Bias is about model simplicity.
  • Variance is about model sensitivity to data.

Good model selection is about finding the degree ddd that minimizes:

JCV(Θ)J_{\text{CV}}(\Theta)JCV​(Θ)

while avoiding both underfitting and overfitting.

AI-Machine-Learning/6-Bias-Variance
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.