Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 7 Regularization

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Learn how cost function regularization helps prevent overfitting in machine learning models by adding a penalty term to the cost function, controlling model complexity, and improving generalization performance.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

⚖️ Cost Function Regularization

If a model is overfitting, we can reduce the influence of certain terms by increasing their cost. This discourages large weights.

Regularization balances:

  • Bias
  • Variance

General Regularized Cost Function

We can regularize all parameters using a single summation:

min⁡θ  12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2\min_\theta \; \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2+ \lambda \sum_{j=1}^{n} \theta_j^2θmin​2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​θj2​

Where the regularization term is:

λ∑j=1nθj2\lambda \sum_{j=1}^{n} \theta_j^2λj=1∑n​θj2​
  • λ\lambdaλ is the regularization parameter that controls the strength of regularization.
  • The summation is over j=1j=1j=1 to nnn, excluding θ0\theta_0θ0​.
  • This term penalizes large values of θj\theta_jθj​, encouraging smaller weights and thus simpler models.

Regularization Parameter λ\lambdaλ

Regularization shrinks parameters. The more shrinkage you see, the larger the λ\lambdaλ

Choosing λ\lambdaλ correctly is essential for good generalization.

Lambda controls the curve of the decision boundary.

Larger λ\lambdaλ → stronger regularization

λ→∞\lambda \to \inftyλ→∞ → all parameters shrink to zero → model becomes too simple → underfitting

  • Parameter Weights θj\theta_jθj​ shrink toward zero
  • Reduces model complexity and make it rigid/linear
  • Underfitting may occur
    • Bias increases
    • Variance decreases

Example:

λ=1=>θ=[13.01,0.91]\lambda = 1 => \theta =[ 13.01, 0.91]λ=1=>θ=[13.01,0.91]

Smaller λ\lambdaλ (as λ→0λ → 0λ→0)

λ→0\lambda \to 0 λ→0 → no regularization → model may overfit

weaker regularization --> Less Penalty --> Large weights θj\theta_jθj​

  • Parameter weights grow larger
  • More complex models & becomes more flexible/curvy
  • Risk of overfitting
    • Variance increases
    • Bias decreases

Small λ → Low bias, high variance (overfitting)

Example:

λ=0.01=>θ=[81.01,12.00]\lambda = 0.01 => \theta =[ 81.01, 12.00]λ=0.01=>θ=[81.01,12.00]

What Happens If λ=0\lambda = 0λ=0?

  • No regularization is applied
  • The model may overfit
  • We revert to standard least squares / logistic regression

How to Choose the Best λ

To select the optimal regularization parameter:

  1. Choose candidate λ values
  2. Train models for each λ
  3. Compute cross-validation error (without regularization)
  4. Select best λ + model
  5. Evaluate once on test set

1. Create Candidate Values

Example:

λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}\lambda \in \{0, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24\}λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}

S2. Train Models

For each value of λ:

  • Train model parameters Θ
  • Possibly try different model complexities (degrees, architectures, etc.)

3. Compute Cross-Validation Error

Evaluate using:

JCV(Θ)J_{CV}(\Theta)JCV​(Θ)

Important:

  • Compute cross-validation error without regularization
  • That means use λ = 0 when evaluating

This ensures fair comparison between models.

4. Select Best Combination

Choose the model and λ that produce the lowest cross-validation error.

5. Final Evaluation

Using the best:

  • Θ
  • λ

Evaluate on the test set:

Jtest(Θ)J_{test}(\Theta)Jtest​(Θ)

This measures generalization performance.


Example: Polynomial Hypothesis

Consider the function:

θ0+θ1x+θ2x2+θ3x3+θ4x4\theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3+ \theta_4 x^4θ0​+θ1​x+θ2​x2+θ3​x3+θ4​x4

If we want the model to behave more like a quadratic function, we can reduce the influence of:

θ3x3andθ4x4\theta_3 x^3 \quad \text{and} \quad \theta_4 x^4θ3​x3andθ4​x4

Instead of removing these features, we modify the cost function.

Regularized Cost Function

We minimize:

min⁡θ  12m∑i=1m(hθ(x(i))−y(i))2+1000θ32+1000θ42\min_\theta \; \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + 1000 \theta_3^2 + 1000 \theta_4^2θmin​2m1​i=1∑m​(hθ​(x(i))−y(i))2+1000θ32​+1000θ42​

Effect of Large Penalty

Adding large penalty terms forces:

θ3≈0andθ4≈0\theta_3 \approx 0 \quad \text{and} \quad \theta_4 \approx 0θ3​≈0andθ4​≈0

This reduces the contribution of:

θ3x3andθ4x4\theta_3 x^3 \quad \text{and} \quad \theta_4 x^4θ3​x3andθ4​x4

As a result:

  • The hypothesis becomes smoother
  • Overfitting decreases
  • The curve behaves more like a quadratic function
AI-Machine-Learning/7-Regularization
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.