Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 5 Normal Equation

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.
AI-Machine-Learning

  • AI-Machine-Learning Index

  • Machine Learning Learning Path

  • Machine Learning: Introduction and Core Algorithms

  • Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent

  • Evaluating a Hypothesis in Neural Networks

  • Bias-Variance Dilemma

  • Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

  • Polynomial Regression

  • Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

  • Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

  • XGBoost (Extreme Gradient Boosting) Explained

  • Dimensionality Reduction in Machine Learning

  • Principal Component Analysis (PCA) Explained

  • t-SNE (t-distributed Stochastic Neighbor Embedding) Explained

  • K-Means Clustering

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Anomaly Detection Using Gaussian Distribution in Machine Learning

  • Anomaly Detection Using Multivariate Gaussian Distribution

  • Recommender Systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Approaches

  • Collaborative Filtering: Building Recommender Systems with Feature Learning

  • Anomaly Detection: Identifying Rare and Unusual Patterns in Data

  • Large Scale Machine Learning: Training Models on Massive Datasets

  • Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

  • MapReduce for Large-Scale Machine Learning: Distributed Training at Scale


Cover Image for Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

Normal Equation in Linear Regression: Formula, Intuition, and Comparison with Gradient Descent

Understand the Normal Equation in linear regression, its closed-form solution, mathematical formula, advantages, limitations, and how it compares to gradient descent for model optimization.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Next →

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Normal Equation (Closed-Form Solution)

Instead of solving multiple iteration of gradient descent, Normal equation can get theta in one step

  • Θ can be directly calculated where cost function is minimal using calculus in one step instead of iterating iterative optimization:
θ=(XTX)−1XTy\theta = (X^T X)^{-1} X^T yθ=(XTX)−1XTy

Advantages

  • No learning rate required
  • Direct computation

Limitations

  • Computationally expensive for very large datasets
  • Matrix inversion can be costly

Steps:

  • Construct design matrix X using feature columns and add 1 in first column
  • Construct y vector using result values Y
  • calculate:

Θ = (XTX)-1 XTy

Mean Normalization

Feature scaling is not required for Normal Equation method

Normal Equation vs Gradient Descent:

Feature Gradient Descent Normal Equation
Complexity Complex need to debug alpha Convenient & Simple to implement
Choose Learning Rate(α) Required No need
Feature Scaling Required No need
Iteration Many Iteration Required Not required
Feature Set>=million Efficient if n is huge O(kn2) Slow if n is huge, cost of inverse matrix is O(n3)
Complex Learning Algo Can used for Complex learning algo Not supported

Faster single Hypothesis Prediction calculation given data set and Thetas **Much faster than nested for loops

Data Matrix * Parameter Vector = Prediction Vector

h(x) = Theta0 + Theta1x
[1 , x]*[Theta Vector] = [h(x)]

descentFormula

Usage:

  • Faster multiple Hypothesis Prediction calculation given data set and Thetas
  • Much-much faster than nested for loops

Data Matrix * Parameter Matrix = Prediction Matrix

  given h(x) = Theta0 + Theta1x
  [1 , x]*[Theta Matrix] = [h(x)]

descentFormula


Normal Equation with Regularization

Instead of iterative gradient descent, we can use the normal equation.

Without Regularization:

θ=(XTX)−1XTy\theta = (X^T X)^{-1} X^T yθ=(XTX)−1XTy

With Regularization (Ridge Regression):

θ=(XTX+λL)−1XTy\theta = (X^T X + \lambda L)^{-1} X^T yθ=(XTX+λL)−1XTy

This discourages large parameter values and reduces overfitting.

Where Matrix LLL

L=[000⋯0010⋯0001⋯0⋮⋮⋮⋱⋮000⋯1]L = \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix}L=​000⋮0​010⋮0​001⋮0​⋯⋯⋯⋱⋯​000⋮1​​

Properties:

  • It is almost the identity matrix except the top-left element is 0.
  • Dimension: (n+1)×(n+1)(n+1) \times (n+1)(n+1)×(n+1)

First diagonal entry is 0 because no regularization for θ0\theta_0θ0​

Remaining diagonal entries are 1 because we regularize θ1\theta_1θ1​ to θn\theta_nθn​.

L=diag(0,1,1,…,1)L = \text{diag}(0,1,1,\dots,1)L=diag(0,1,1,…,1)

This ensures:

  • θ0\theta_0θ0​(bias term) is not regularized
  • All other parameters are regularized

Why Regularization Helps

If m<nm < nm<n, then XTXX^T XXTX is non-invertible.
If m=nm = nm=n, it may or may not be invertible.

where

  • mmm = number of training examples
  • nnn = number of features

In other terms if m<nm < nm<n, then:

XTXX^T XXTX

is non-invertible.

However, with regularization we add λL\lambda LλL to XTXX^T XXTX:

XTX+λLX^T X + \lambda LXTX+λL

That makes whole term invertible (for λ>0\lambda > 0λ>0).

This improves numerical stability.


← Previous

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Next →

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

AI-Machine-Learning/2-5-Normal-Equation
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.