Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 6 Bias Variance

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Bias-Variance Dilemma

Bias-Variance Dilemma

Understanding the bias-variance tradeoff in machine learning, including the concepts of bias and variance, underfitting and overfitting, and strategies to balance model complexity for better generalization.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Bias-Variance Dilemma

Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.

Bias

Error from erroneous assumptions in the learning algorithm.

  • High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

variance

Error from sensitivity to small fluctuations in the training set.

  • High variance may result from an algorithm modeling the random noise in the training data (overfitting).

Bias vs Variance Summary

Concept Meaning Cause Effect
High Bias Model too simple Too few features Underfitting
High Variance Model too complex Too many features Overfitting

Underfitting (High Bias)

Underfitting occurs when a model is too simple to capture the underlying pattern of the data, resulting in poor performance on both training and test data. If the real data has curvature, a straight line cannot capture it. This is called:

Cause:

  • The model is too simple
  • It fails to capture structure in the data

Problem:

  • Poor training performance
  • Poor Test performance

Overfitting (High Variance)

Overfitting happens when a model is too complex and starts fitting the training data perfectly, including noise, instead of capturing the real pattern.

  • Model can bend heavily to pass through every training point.

Cause:

  • The model is too complex
  • It captures noise in the training data as if it were a true pattern

Problem:

  • Low training error Good training performance but poor test performance
  • This leads to poor performance on unseen data.
  • Poor generalization to new data

Solutions

1. Reduce model complexity (e.g., use fewer features, simpler model)

  • Reduce Number of Features: Manually select important features
  • Use automated model selection methods
  • Remove irrelevant variables

This simplifies the model.

2. Use regularization techniques (e.g., Lasso, Ridge)

  • Instead of removing features, keep them all but reduce parameter sizes.

  • Regularization adds a penalty term to the cost function to discourage complexity.

  • Regularization helps prevent overfitting by keeping the model simpler.

  • The regularization parameter λ controls the strength of the penalty. A larger λ means more regularization.

Lasso vs Ridge

Feature Lasso (L1) Ridge (L2)
Penalty Sum of absolute values ∑∥θj∥\sum \| \theta_j\|∑∥θj​∥ Sum of squares ∑θj2\sum \theta_j^2∑θj2​
Effect Can shrink some coefficients exactly to 0 → feature selection Shrinks coefficients but rarely to 0
Use Case Many irrelevant features Prevent overfitting, keep all features

Instead of removing features, keep them all but reduce parameter sizes.

The idea:

  • Large weights → complex model
  • Small weights → smoother model

🔹 Lasso Regression (L1 Regularization)

Lasso: Cost = MSE + λ * sum(|θ|)

  • Lasso (L1) can shrink some coefficients to zero, effectively performing feature selection.

In standard linear regression, the cost function is:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

Lasso adds a penalty proportional to the sum of absolute values of the coefficients:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1n∣θj∣J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​∣θj​∣

Where:

  • λ\lambdaλ = regularization strength
  • ∣θj∣|\theta_j|∣θj​∣ = absolute value of parameter θj\theta_jθj​
  • θ0\theta_0θ0​ (bias) is usually not penalized

🏔️ Ridge Regression (L2 Regularization)

Ridge: Cost = MSE + λ * sum(θ^2)

  • Ridge (L2) shrinks coefficients but does not set them to zero.

The standard linear regression cost function is:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

Ridge adds a penalty proportional to the sum of squared coefficients:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​θj2​

Where:

  • λ\lambdaλ = regularization strength
  • θj\theta_jθj​ = model parameters
  • θ0\theta_0θ0​ (bias) is usually not penalized

AI-Machine-Learning/6-Bias-Variance
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.