Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 8 Backword Propagation Intuition

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦈 Sharks existed before trees 🌳.
Cover Image for Backpropagation Intuition

Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Backpropagation Algorithm

Next →

Gradient Checking and Random Initialization

⏪ Backpropagation Intuition

Important Corrections

  • The output layer error term should be:
δ(4)=a(4)−y\delta^{(4)} = a^{(4)} - yδ(4)=a(4)−y
  • The cost function term must include proper parentheses:
J(Θ)=−1m∑t=1m∑k=1K[yk(t)log⁡((hΘ(x(t)))k)+(1−yk(t))log⁡(1−(hΘ(x(t)))k)]+regularizationJ(\Theta) = - \frac{1}{m} \sum_{t=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(t)} \log\big((h_\Theta(x^{(t)}))_k\big)+ (1 - y_k^{(t)}) \log\big(1 - (h_\Theta(x^{(t)}))_k\big) \right]+ \text{regularization}J(Θ)=−m1​t=1∑m​k=1∑K​[yk(t)​log((hΘ​(x(t)))k​)+(1−yk(t)​)log(1−(hΘ​(x(t)))k​)]+regularization

Simplified Cost (Binary Classification, No Regularization)

If we ignore multiclass and regularization, the cost for training example ttt is:

cost(t)=y(t)log⁡(hΘ(x(t)))+(1−y(t))log⁡(1−hΘ(x(t)))\text{cost}(t) = y^{(t)} \log(h_\Theta(x^{(t)})) + (1 - y^{(t)}) \log(1 - h_\Theta(x^{(t)}))cost(t)=y(t)log(hΘ​(x(t)))+(1−y(t))log(1−hΘ​(x(t)))

❗ What Is δ\deltaδ?

Intuitively:

δj(l)\delta_j^{(l)}δj(l)​

represents the error of unit jjj in layer lll.

More formally:

δj(l)=∂∂zj(l)cost(t)\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)δj(l)​=∂zj(l)​∂​cost(t)

So:

  • δ\deltaδ is the derivative of the cost with respect to zzz
  • It measures how much that unit contributed to the error
  • Larger magnitude → steeper slope → more incorrect

How Backpropagation Works

Backpropagation computes errors from right to left.

We start at the output layer:

δ(L)=a(L)−y\delta^{(L)} = a^{(L)} - yδ(L)=a(L)−y

Then propagate backward using:

δ(l)=((Θ(l))Tδ(l+1))  . ⁣∗  g′(z(l))\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; g'(z^{(l)})δ(l)=((Θ(l))Tδ(l+1)).∗g′(z(l))

For sigmoid activation:

g′(z(l))=a(l)  . ⁣∗  (1−a(l))g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})g′(z(l))=a(l).∗(1−a(l))

So equivalently:

δ(l)=((Θ(l))Tδ(l+1))  . ⁣∗  a(l)  . ⁣∗  (1−a(l))\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; a^{(l)} \;.\!* \; (1 - a^{(l)})δ(l)=((Θ(l))Tδ(l+1)).∗a(l).∗(1−a(l))

Geometric Interpretation

Think of the network as a graph:

  • Nodes = neurons
  • Edges = weights Θij\Theta_{ij}Θij​
  • Errors flow backward through edges

To compute δj(l)\delta_j^{(l)}δj(l)​:

  • Take all connections going forward from unit jjj
  • Multiply each weight by the corresponding δ\deltaδ
  • Sum them up

This is simply the chain rule applied repeatedly.

Example:

To compute:

δ2(2)\delta_2^{(2)}δ2(2)​

We sum over the next layer:

δ2(2)=Θ12(2)δ1(3)+Θ22(2)δ2(3)\delta_2^{(2)} = \Theta_{12}^{(2)} \delta_1^{(3)} + \Theta_{22}^{(2)} \delta_2^{(3)}δ2(2)​=Θ12(2)​δ1(3)​+Θ22(2)​δ2(3)​

Example

To compute:

δ2(3)\delta_2^{(3)}δ2(3)​

We sum contributions from the next layer:

δ2(3)=Θ12(3)δ1(4)\delta_2^{(3)} = \Theta_{12}^{(3)} \delta_1^{(4)}δ2(3)​=Θ12(3)​δ1(4)​

Core Insight

Backpropagation is:

  • Repeated application of the chain rule
  • Error flowing from output to input
  • Weighted by connection strengths
  • Modulated by the activation derivative

In short:

Forward pass computes predictions.
Backward pass computes gradients.

AI-DeepLearning/8-Backword-Propagation-Intuition
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.