Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 3 Regularization

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦥 Sloths can hold their breath longer than dolphins 🐬.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-Machine-Learning

  • AI-Machine-Learning Index

  • Machine Learning Learning Path

  • Machine Learning: Introduction and Core Algorithms

  • Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent

  • Evaluating a Hypothesis in Neural Networks

  • Bias-Variance Dilemma

  • Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

  • Polynomial Regression

  • Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

  • XGBoost (Extreme Gradient Boosting) Explained

  • Dimensionality Reduction in Machine Learning

  • Principal Component Analysis (PCA) Explained

  • t-SNE (t-distributed Stochastic Neighbor Embedding) Explained

  • K-Means Clustering

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Anomaly Detection Using Gaussian Distribution in Machine Learning

  • Anomaly Detection Using Multivariate Gaussian Distribution

  • Recommender Systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Approaches

  • Collaborative Filtering: Building Recommender Systems with Feature Learning

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Large Scale Machine Learning: Training Models on Massive Datasets

  • Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

  • MapReduce for Large-Scale Machine Learning: Distributed Training at Scale


Cover Image for Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Learn how cost function regularization helps prevent overfitting in machine learning models by adding a penalty term to the cost function, controlling model complexity, and improving generalization performance.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Bias-Variance Dilemma

Next →

Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

Regularization 🛑

If a model is overfitting, we can reduce the influence of certain terms by increasing their cost. This discourages large weights.

Regularization balances:

  • Bias
  • Variance

Regularization techniques

Used to make reduce variance and solve problem of Overfitting

  • Instead of removing features, keep them all but reduce parameter sizes.

  • Regularization adds a penalty term to the cost function to discourage complexity.

  • Regularization helps prevent overfitting by keeping the model simpler.

  • The regularization parameter λ controls the strength of the penalty. A larger λ means more regularization.

Instead of removing features, keep them all but reduce parameter sizes.

The idea:

  • Large weights → complex model
  • Small weights → smoother model

Regularization Term


General Regularized Cost Function

In standard linear regression, the cost function is:

Mean Squared Error

Measures how well the model fits the training data.

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

We Add Regularization Term to it

Regularization term

Penalizes large parameter values to prevent overfitting.

The Regularization term is:

λ∑j=1nθj2\lambda \sum_{j=1}^{n} \theta_j^2λj=1∑n​θj2​
  • λ\lambdaλ is the regularization parameter that controls the strength of regularization.
  • This term penalizes large values of θj\theta_jθj​, encouraging smaller weights and thus simpler models.

The parameter vector contains:

θ1,…,θn\theta_1, \dots, \theta_nθ1​,…,θn​

Explicitly excludes the bias term θ0\theta_0θ0​.

  • Regularization runs from j=1j = 1j=1 to nnn
  • So θ0\theta_0θ0​ is not penalized

Why Exclude θ0\theta_0θ0​?

The bias term controls the decision boundary shift.

We do not want to shrink it toward zero.

Only the other parameters are regularized.

So effective cost become

J(θ)=min⁡θ  12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2J(\theta) = \min_\theta \; \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2+ \lambda \sum_{j=1}^{n} \theta_j^2J(θ)=θmin​2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​θj2​

We can regularize all parameters using a single summation over j=1j=1j=1 to nnn

We dont regularized Bias parameter θ0\theta_0θ0​

Regularization Algos

Lasso vs Ridge

Feature Lasso (L1) 🔹 Ridge (L2)
Penalty Sum of absolute values ∑∥θj∥\sum \| \theta_j\|∑∥θj​∥ Sum of squares ∑θj2\sum \theta_j^2∑θj2​
Effect Can shrink some coefficients exactly to 0 → feature selection Shrinks coefficients but rarely to 0
Use Case Many irrelevant features Prevent overfitting, keep all features

1. Lasso Regression (L1 Regularization) 🔹

Lasso: Cost = MSE + λ * sum(|θ|)

  • Lasso (L1) can shrink some coefficients to zero, effectively performing feature selection.

Lasso adds a penalty proportional to the sum of absolute values of the coefficients:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1n∣θj∣J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​∣θj​∣

Where:

  • λ\lambdaλ = regularization strength
  • ∣θj∣|\theta_j|∣θj​∣ = absolute value of parameter θj\theta_jθj​
  • θ0\theta_0θ0​ (bias) is usually not penalized

2. Ridge Regression (L2 Regularization) 🏔️

Ridge: Cost = MSE + λ * sum(θ^2)

  • Ridge (L2) shrinks coefficients but does not set them to zero.

Ridge adds a penalty proportional to the sum of squared coefficients:

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​θj2​

Where:

  • λ\lambdaλ = regularization strength
  • θj\theta_jθj​ = model parameters
  • θ0\theta_0θ0​ (bias) is usually not penalized

Regularization Parameter λ\lambdaλ

Regularization shrinks parameters. The more shrinkage you see, the larger the λ\lambdaλ

Choosing λ\lambdaλ correctly is essential for good generalization.

The regularization parameter λ (lambda) controls the tradeoff between bias and variance.

Lambda (λ) Model Complexity Bias Variance
Very Small (0) Very Complex Low High
Moderate Balanced Moderate Moderate
Very Large Very Simple High Low

Larger λ\lambdaλ → stronger regularization

λ→∞\lambda \to \inftyλ→∞ → all parameters shrink to zero → model becomes too simple → underfitting

  • Parameter Weights θj\theta_jθj​ shrink toward zero
  • Reduces model complexity and make it rigid/linear
  • Underfitting may occur
    • Bias increases
    • Variance decreases

Example:

λ=1=>θ=[13.01,0.91]\lambda = 1 => \theta =[ 13.01, 0.91]λ=1=>θ=[13.01,0.91]

Smaller λ\lambdaλ (as λ→0λ → 0λ→0)

λ→0\lambda \to 0 λ→0 → no regularization → model may overfit

weaker regularization --> Less Penalty --> Large weights θj\theta_jθj​

  • Parameter weights grow larger
  • More complex models & becomes more flexible/curvy
  • Risk of overfitting
    • Variance increases
    • Bias decreases

Small λ → Low bias, high variance (overfitting)

Example:

λ=0.01=>θ=[81.01,12.00]\lambda = 0.01 => \theta =[ 81.01, 12.00]λ=0.01=>θ=[81.01,12.00]

What Happens If λ=0\lambda = 0λ=0?

  • No regularization is applied
  • The model may overfit
  • We revert to standard least squares / logistic regression

How to Choose the Best λ

To select the optimal regularization parameter:

  1. Choose candidate λ values
  2. Train models for each λ
  3. Compute cross-validation error (without regularization)
  4. Select best λ + model
  5. Evaluate once on test set

1. Create Candidate Values

Example:

λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}\lambda \in \{0, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24\}λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}

2. Train Models

For each value of λ:

  • Train model parameters Θ
  • Possibly try different model complexities (degrees, architectures, etc.)

3. Compute Cross-Validation Error

Evaluate using:

JCV(Θ)J_{CV}(\Theta)JCV​(Θ)

Important:

  • Compute cross-validation error without regularization
  • That means use λ = 0 when evaluating

This ensures fair comparison between models.

4. Select Best Combination

Choose the model and λ that produce the lowest cross-validation error.

5. Final Evaluation

Using the best:

  • Θ
  • λ

Evaluate on the test set:

Jtest(Θ)J_{test}(\Theta)Jtest​(Θ)

This measures generalization performance.


Example: Polynomial Hypothesis

Consider the function:

θ0+θ1x+θ2x2+θ3x3+θ4x4\theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3+ \theta_4 x^4θ0​+θ1​x+θ2​x2+θ3​x3+θ4​x4

If we want the model to behave more like a quadratic function, we can reduce the influence of:

θ3x3andθ4x4\theta_3 x^3 \quad \text{and} \quad \theta_4 x^4θ3​x3andθ4​x4

Instead of removing these features, we modify the cost function.

Regularized Cost Function

We minimize:

min⁡θ  12m∑i=1m(hθ(x(i))−y(i))2+1000θ32+1000θ42\min_\theta \; \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + 1000 \theta_3^2 + 1000 \theta_4^2θmin​2m1​i=1∑m​(hθ​(x(i))−y(i))2+1000θ32​+1000θ42​

Effect of Large Penalty

Adding large penalty terms forces:

θ3≈0andθ4≈0\theta_3 \approx 0 \quad \text{and} \quad \theta_4 \approx 0θ3​≈0andθ4​≈0

This reduces the contribution of:

θ3x3andθ4x4\theta_3 x^3 \quad \text{and} \quad \theta_4 x^4θ3​x3andθ4​x4

As a result:

  • The hypothesis becomes smoother
  • Overfitting decreases
  • The curve behaves more like a quadratic function

← Previous

Bias-Variance Dilemma

Next →

Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

AI-Machine-Learning/2-3-Regularization
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.