Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Backpropagation Intuition

Important Corrections

The output layer error term should be:

\delta^{(4)} = a^{(4)} - y

The cost function term must include proper parentheses:

J(\Theta) = - \frac{1}{m} \sum_{t=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(t)} \log\big((h_\Theta(x^{(t)}))_k\big)+ (1 - y_k^{(t)}) \log\big(1 - (h_\Theta(x^{(t)}))_k\big) \right]+ \text{regularization}

Simplified Cost (Binary Classification, No Regularization)

If we ignore multiclass and regularization, the cost for training example $t$ is:

\text{cost}(t) = y^{(t)} \log(h_\Theta(x^{(t)})) + (1 - y^{(t)}) \log(1 - h_\Theta(x^{(t)}))

What Is $\delta$ ?

Intuitively:

\delta_j^{(l)}

represents the error of unit $j$ in layer $l$ .

More formally:

\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)

So:

$\delta$ is the derivative of the cost with respect to $z$
It measures how much that unit contributed to the error
Larger magnitude → steeper slope → more incorrect

How Backpropagation Works

Backpropagation computes errors from right to left.

We start at the output layer:

\delta^{(L)} = a^{(L)} - y

Then propagate backward using:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; g'(z^{(l)})

For sigmoid activation:

g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})

So equivalently:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; a^{(l)} \;.\!* \; (1 - a^{(l)})

Geometric Interpretation

Think of the network as a graph:

Nodes = neurons
Edges = weights $\Theta_{ij}$
Errors flow backward through edges

To compute $\delta_j^{(l)}$ :

Take all connections going forward from unit $j$
Multiply each weight by the corresponding $\delta$
Sum them up

This is simply the chain rule applied repeatedly.

Example: Computing a Hidden Layer Delta

To compute:

\delta_2^{(2)}

We sum over the next layer:

\delta_2^{(2)} = \Theta_{12}^{(2)} \delta_1^{(3)} + \Theta_{22}^{(2)} \delta_2^{(3)}

Another Example

To compute:

\delta_2^{(3)}

We sum contributions from the next layer:

\delta_2^{(3)} = \Theta_{12}^{(3)} \delta_1^{(4)}

Core Insight

Backpropagation is:

Repeated application of the chain rule
Error flowing from output to input
Weighted by connection strengths
Modulated by the activation derivative

In short:

Forward pass computes predictions.
Backward pass computes gradients.

Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Backpropagation Intuition

Important Corrections

The output layer error term should be:

\delta^{(4)} = a^{(4)} - y

The cost function term must include proper parentheses:

J(\Theta) = - \frac{1}{m} \sum_{t=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(t)} \log\big((h_\Theta(x^{(t)}))_k\big)+ (1 - y_k^{(t)}) \log\big(1 - (h_\Theta(x^{(t)}))_k\big) \right]+ \text{regularization}

Simplified Cost (Binary Classification, No Regularization)

If we ignore multiclass and regularization, the cost for training example $t$ is:

\text{cost}(t) = y^{(t)} \log(h_\Theta(x^{(t)})) + (1 - y^{(t)}) \log(1 - h_\Theta(x^{(t)}))

What Is $\delta$ ?

Intuitively:

\delta_j^{(l)}

represents the error of unit $j$ in layer $l$ .

More formally:

\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)

So:

$\delta$ is the derivative of the cost with respect to $z$
It measures how much that unit contributed to the error
Larger magnitude → steeper slope → more incorrect

How Backpropagation Works

Backpropagation computes errors from right to left.

We start at the output layer:

\delta^{(L)} = a^{(L)} - y

Then propagate backward using:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; g'(z^{(l)})

For sigmoid activation:

g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})

So equivalently:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; a^{(l)} \;.\!* \; (1 - a^{(l)})

Geometric Interpretation

Think of the network as a graph:

Nodes = neurons
Edges = weights $\Theta_{ij}$
Errors flow backward through edges

To compute $\delta_j^{(l)}$ :

Take all connections going forward from unit $j$
Multiply each weight by the corresponding $\delta$
Sum them up

This is simply the chain rule applied repeatedly.

Example: Computing a Hidden Layer Delta

To compute:

\delta_2^{(2)}

We sum over the next layer:

\delta_2^{(2)} = \Theta_{12}^{(2)} \delta_1^{(3)} + \Theta_{22}^{(2)} \delta_2^{(3)}

Another Example

To compute:

\delta_2^{(3)}

We sum contributions from the next layer:

\delta_2^{(3)} = \Theta_{12}^{(3)} \delta_1^{(4)}

Core Insight

Backpropagation is:

Repeated application of the chain rule
Error flowing from output to input
Weighted by connection strengths
Modulated by the activation derivative

In short:

Forward pass computes predictions.
Backward pass computes gradients.

Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Written by Hitesh Sahu, a passionate developer and blogger.

Backpropagation Intuition

Important Corrections

Simplified Cost (Binary Classification, No Regularization)

What Is $\delta$ ?

How Backpropagation Works

Geometric Interpretation

Example: Computing a Hidden Layer Delta

Another Example

Core Insight

Playstore

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Written by Hitesh Sahu, a passionate developer and blogger.

Backpropagation Intuition

Important Corrections

Simplified Cost (Binary Classification, No Regularization)

What Is $\delta$ ?

How Backpropagation Works

Geometric Interpretation

Example: Computing a Hidden Layer Delta

Another Example

Core Insight

Playstore

Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Written by Hitesh Sahu, a passionate developer and blogger.

Backpropagation Intuition

Important Corrections

Simplified Cost (Binary Classification, No Regularization)

What Is δ\deltaδ?

How Backpropagation Works

Geometric Interpretation

Example: Computing a Hidden Layer Delta

Another Example

Core Insight

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

Backpropagation Intuition

Backpropagation is the algorithm used to compute the gradients of the cost function with respect to the parameters in a neural network. This post provides an intuitive understanding of how backpropagation works and why it is essential for training deep learning models.

Written by Hitesh Sahu, a passionate developer and blogger.

Backpropagation Intuition

Important Corrections

Simplified Cost (Binary Classification, No Regularization)

What Is δ\deltaδ?

How Backpropagation Works

Geometric Interpretation

Example: Computing a Hidden Layer Delta

Another Example

Core Insight

What Is $\delta$ ?

What Is $\delta$ ?