Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 8 Regularized Linear Regression

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦥 Sloths can hold their breath longer than dolphins 🐬.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for Regularized Linear Regression

Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

⚖️ Regularized Linear Regression

Regularization can be applied to both linear and logistic regression.

We first consider linear regression.


Gradient Descent with Regularization

We modify gradient descent to avoid penalizing θ0\theta_0θ0​.

We do not regularize θ0\theta_0θ0​.

Update Rules

Before Regularization (standard gradient descent): Repeat until convergence:

θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x0(i)\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}θ0​:=θ0​−αm1​i=1∑m​(hθ​(x(i))−y(i))x0(i)​

For j∈{1,2,…,n}j \in \{1,2,\dots,n\}j∈{1,2,…,n}:

With regularization:

repeat until convergence: {

For θ0\theta_0θ0​:

θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x0(i)\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}θ0​:=θ0​−αm1​i=1∑m​(hθ​(x(i))−y(i))x0(i)​

For rest j∈{1,2,…,n}j \in \{1,2,\dots,n\}j∈{1,2,…,n} add regularization:

θj:=θjα[1m∑i=1m(hθ(x(i))−y(i))xj(i)+λmθj]\theta_j := \theta_j \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m}\theta_j \right]θj​:=θj​α[m1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​+mλ​θj​]

}

Simplified Update Rule

This can be rearranged to:

The update for j≥1j \ge 1j≥1 can also be written as:

θj:=θj(1−αλm)−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}θj​:=θj​(1−αmλ​)−αm1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

Intuition

The term

(1−αλm)(1 - \alpha \frac{\lambda}{m})(1−αmλ​)

is less than 1, so each update slightly shrinks θj\theta_jθj​.

This is called weight decay.

The second term is exactly the same as standard gradient descent.


Normal Equation with Regularization

Instead of iterative gradient descent, we can use the normal equation.

Without Regularization:

θ=(XTX)−1XTy\theta = (X^T X)^{-1} X^T yθ=(XTX)−1XTy

With Regularization (Ridge Regression):

θ=(XTX+λL)−1XTy\theta = (X^T X + \lambda L)^{-1} X^T yθ=(XTX+λL)−1XTy

This discourages large parameter values and reduces overfitting.

Where Matrix LLL

L=[000⋯0010⋯0001⋯0⋮⋮⋮⋱⋮000⋯1]L = \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix}L=​000⋮0​010⋮0​001⋮0​⋯⋯⋯⋱⋯​000⋮1​​

Properties:

  • It is almost the identity matrix except the top-left element is 0.
  • Dimension: (n+1)×(n+1)(n+1) \times (n+1)(n+1)×(n+1)

First diagonal entry is 0 because no regularization for θ0\theta_0θ0​

Remaining diagonal entries are 1 because we regularize θ1\theta_1θ1​ to θn\theta_nθn​.

L=diag(0,1,1,…,1)L = \text{diag}(0,1,1,\dots,1)L=diag(0,1,1,…,1)

This ensures:

  • θ0\theta_0θ0​(bias term) is not regularized
  • All other parameters are regularized

Why Regularization Helps

If m<nm < nm<n, then XTXX^T XXTX is non-invertible.
If m=nm = nm=n, it may or may not be invertible.

where

  • mmm = number of training examples
  • nnn = number of features

In other terms if m<nm < nm<n, then:

XTXX^T XXTX

is non-invertible.

However, with regularization we add λL\lambda LλL to XTXX^T XXTX:

XTX+λLX^T X + \lambda LXTX+λL

That makes whole term invertible (for λ>0\lambda > 0λ>0).

This improves numerical stability.


Key Takeaways

Regularization:

  • Prevents overfitting
  • Shrinks large weights
  • Does not penalize θ0\theta_0θ0​

Two Methods:

  1. Gradient Descent (iterative)
  2. Normal Equation (closed-form)

AI-Machine-Learning/8-Regularized-Linear-Regression
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.