Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Regularized Logistic Regression

⚖️ Regularized Linear Regression

Regularization can be applied to both linear and logistic regression.

We first consider linear regression.

Gradient Descent with Regularization

We modify gradient descent to avoid penalizing $\theta_0$ .

We do not regularize $\theta_0$ .

Update Rules

Before Regularization (standard gradient descent): Repeat until convergence:

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}

For $j \in \{1,2,\dots,n\}$ :

With regularization:

repeat until convergence: {

For $\theta_0$ :

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}

For rest $j \in \{1,2,\dots,n\}$ add regularization:

\theta_j := \theta_j \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m}\theta_j \right]

}

Simplified Update Rule

This can be rearranged to:

The update for $j \ge 1$ can also be written as:

\theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}

Intuition

The term

(1 - \alpha \frac{\lambda}{m})

is less than 1, so each update slightly shrinks $\theta_j$ .

This is called weight decay.

The second term is exactly the same as standard gradient descent.

Normal Equation with Regularization

Instead of iterative gradient descent, we can use the normal equation.

Without Regularization:

\theta = (X^T X)^{-1} X^T y

With Regularization (Ridge Regression):

\theta = (X^T X + \lambda L)^{-1} X^T y

This discourages large parameter values and reduces overfitting.

Where Matrix $L$

L = \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix}

Properties:

It is almost the identity matrix except the top-left element is 0.
Dimension: $(n+1) \times (n+1)$

First diagonal entry is 0 because no regularization for $\theta_0$

Remaining diagonal entries are 1 because we regularize $\theta_1$ to $\theta_n$ .

L = \text{diag}(0,1,1,\dots,1)

This ensures:

$\theta_0$ (bias term) is not regularized
All other parameters are regularized

Why Regularization Helps

If $m < n$ , then $X^T X$ is non-invertible.
If $m = n$ , it may or may not be invertible.

where

$m$ = number of training examples
$n$ = number of features

In other terms if $m < n$ , then:

X^T X

is non-invertible.

However, with regularization we add $\lambda L$ to $X^T X$ :

X^T X + \lambda L

That makes whole term invertible (for $\lambda > 0$ ).

This improves numerical stability.

Key Takeaways

Regularization:

Prevents overfitting
Shrinks large weights
Does not penalize $\theta_0$

Two Methods:

Gradient Descent (iterative)
Normal Equation (closed-form)

Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Regularized Logistic Regression

⚖️ Regularized Linear Regression

Regularization can be applied to both linear and logistic regression.

We first consider linear regression.

Gradient Descent with Regularization

We modify gradient descent to avoid penalizing $\theta_0$ .

We do not regularize $\theta_0$ .

Update Rules

Before Regularization (standard gradient descent): Repeat until convergence:

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}

For $j \in \{1,2,\dots,n\}$ :

With regularization:

repeat until convergence: {

For $\theta_0$ :

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}

For rest $j \in \{1,2,\dots,n\}$ add regularization:

\theta_j := \theta_j \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m}\theta_j \right]

}

Simplified Update Rule

This can be rearranged to:

The update for $j \ge 1$ can also be written as:

\theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}

Intuition

The term

(1 - \alpha \frac{\lambda}{m})

is less than 1, so each update slightly shrinks $\theta_j$ .

This is called weight decay.

The second term is exactly the same as standard gradient descent.

Normal Equation with Regularization

Instead of iterative gradient descent, we can use the normal equation.

Without Regularization:

\theta = (X^T X)^{-1} X^T y

With Regularization (Ridge Regression):

\theta = (X^T X + \lambda L)^{-1} X^T y

This discourages large parameter values and reduces overfitting.

Where Matrix $L$

L = \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix}

Properties:

It is almost the identity matrix except the top-left element is 0.
Dimension: $(n+1) \times (n+1)$

First diagonal entry is 0 because no regularization for $\theta_0$

Remaining diagonal entries are 1 because we regularize $\theta_1$ to $\theta_n$ .

L = \text{diag}(0,1,1,\dots,1)

This ensures:

$\theta_0$ (bias term) is not regularized
All other parameters are regularized

Why Regularization Helps

If $m < n$ , then $X^T X$ is non-invertible.
If $m = n$ , it may or may not be invertible.

where

$m$ = number of training examples
$n$ = number of features

In other terms if $m < n$ , then:

X^T X

is non-invertible.

However, with regularization we add $\lambda L$ to $X^T X$ :

X^T X + \lambda L

That makes whole term invertible (for $\lambda > 0$ ).

This improves numerical stability.

Key Takeaways

Regularization:

Prevents overfitting
Shrinks large weights
Does not penalize $\theta_0$

Two Methods:

Gradient Descent (iterative)
Normal Equation (closed-form)

Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

⚖️ Regularized Linear Regression

Gradient Descent with Regularization

Update Rules

With regularization:

Simplified Update Rule

Intuition

Normal Equation with Regularization

Without Regularization:

With Regularization (Ridge Regression):

Where Matrix $L$

Why Regularization Helps

Key Takeaways

Regularization:

Two Methods:

Playstore

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

⚖️ Regularized Linear Regression

Gradient Descent with Regularization

Update Rules

With regularization:

Simplified Update Rule

Intuition

Normal Equation with Regularization

Without Regularization:

With Regularization (Ridge Regression):

Where Matrix $L$

Why Regularization Helps

Key Takeaways

Regularization:

Two Methods:

Playstore

Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

⚖️ Regularized Linear Regression

Gradient Descent with Regularization

Update Rules

With regularization:

Simplified Update Rule

Intuition

Normal Equation with Regularization

Without Regularization:

With Regularization (Ridge Regression):

Where Matrix LLL

Why Regularization Helps

Key Takeaways

Regularization:

Two Methods:

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

Regularized Linear Regression

Learn how regularization helps prevent overfitting in linear regression by adding a penalty term to the cost function, modifying the gradient descent update rules, and improving model generalization.

Written by Hitesh Sahu, a passionate developer and blogger.

⚖️ Regularized Linear Regression

Gradient Descent with Regularization

Update Rules

With regularization:

Simplified Update Rule

Intuition

Normal Equation with Regularization

Without Regularization:

With Regularization (Ridge Regression):

Where Matrix LLL

Why Regularization Helps

Key Takeaways

Regularization:

Two Methods:

Where Matrix $L$

Where Matrix $L$