Backpropagation Algorithm

Backpropagation is the algorithm used to minimize the neural network cost function. It computes the gradients of the cost function with respect to the parameters, allowing us to perform gradient descent and update our model.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Cost Function for Neural Networks

Gradient Checking and Random Initialization

⏪ Backpropagation Algorithm (BP)

Core Insight

Backpropagation is:

Repeated application of the chain rule
Error flowing from output to input
Weighted by connection strengths
Modulated by the activation derivative

In short:

Forward pass computes predictions.

Backward pass computes gradients.

Training flow:

flowchart TD

    A["Forward Pass"]
        --> B["Prediction"]

    B --> C["Loss Calculation"]

    C --> D["Backpropagation"]

    D --> E["Gradient Updates"]

Gradients tell each layer:

how much to adjust weights

Why we do Backward Propagation?

Backpropagation is the algorithm used to minimize the neural network cost function.

Just like gradient descent in linear and logistic regression, our goal is:

\min_\Theta J(\Theta)

That is, we want to find parameters $\Theta$ that minimize the cost function.

Where

J(\Theta) =- \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(i)} \log((h_\Theta(x^{(i)}))_k) + (1 - y_k^{(i)}) \log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} (\Theta^{(l)}_{j,i})^2

J(\Theta) = - \frac{1}{m} \sum_{t=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(t)} \log\big((h_\Theta(x^{(t)}))_k\big)+ (1 - y_k^{(t)}) \log\big(1 - (h_\Theta(x^{(t)}))_k\big) \right]+ \text{regularization}

Objective

We want to compute the partial derivatives:

\frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta)

These derivatives are used in gradient descent to update the parameters.

How Backpropagation Works

Backpropagation computes errors from right to left.

We start at the output layer:

\delta^{(L)} = a^{(L)} - y

Then propagate backward using:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; g'(z^{(l)})

For sigmoid activation:

g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})

So equivalently:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; a^{(l)} \;.\!* \; (1 - a^{(l)})

❗ Loss Function $\delta_j^{(l)}$

A loss function measures how wrong your model’s prediction is compared to the actual value.

It answers one simple question: How far off was the prediction?

Error is represented as:

$\delta_j^{(l)} = Predicted - Actual Value$

where $\delta_j^{(l)}$ represents the error of unit $j$ in layer $l$ .

Loss function converts error into a number the model can optimize.

More formally:

\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)

So:

$\delta$ is the derivative of the cost with respect to $z$
It measures how much that unit contributed to the error
Larger magnitude → steeper slope → more incorrect

Example: House Price

If actual house price = €500,000

The model makes a prediction: Model predicts = €480,000

The loss function calculates the error :

Error = €500,000 -€480,000 = €20,000

$\delta_j^{(l)} = 20000$

The optimizer adjusts the parameters to reduce that error.

The training process tries to minimize this loss.
Repeat this thousands of times → model improves.

⚖️ Loss vs Cost Function

❗ Loss → error for one example
💰 Cost → average loss over the dataset

🎢 Backpropagation Gradient Computation

Forward propagation → computes activations.
Backpropagation → computes errors ( $\delta$ values).
Errors are propagated from right to left.
Gradients are accumulated in $\Delta$ .
Regularization is added for non-bias weights.
Finally, we divide by $m$ to obtain the average gradient.

Backpropagation Algorithm

Given $m$ training set:

\{(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\}

Step 1: 🌱 Initialize Accumulators

Set:

\Delta_{i,j}^{(l)} := 0

for all $l, i, j$ .

This creates matrices of zeros to accumulate gradients.

Step 2: For each training example $t = 1$ to $m$

Backpropagation works per example, and gradients are summed (or averaged) over the dataset.

Example: For two training examples $(x^{(1)}, y^{(1)})$ and $(x^{(2)}, y^{(2)})$

compute FP for $(x^{(1)}, y^{(1)})$ , Compute BP for $(x^{(1)}, y^{(1)})$
compute FP for $(x^{(2)}, y^{(2)})$ , Compute BP for $(x^{(2)}, y^{(2)})$
Finally Average (or sum) the gradients

2.1 ⏩ Forward Propagation

Set:

a^{(1)} := x^{(t)}

Compute forward propagation for:

l = 2, 3, \dots, L

to obtain activations $a^{(l)}$ for any Network layer $l$ :

z^{(l)} = \Theta^{(l-1)} a^{(l-1)}

a^{(l)} = g(z^{(l)})

Or when look Forward

z^{(l+1)} = \Theta^{(l)} a^{(l)}

a^{(l+1)} = g\left(z^{(l+1)}\right)

Where

$a^{(l)}$ = activations of layer $l$
$z^{(l)}$ = linear combination before activation
$\Theta^{(l)}$ = weight matrix between layer $l$ and $l+1$
$g(\cdot)$ = activation function

2.2 ❗Compute Output Layer Error ( $\delta^{(L)}$ )

Using the true label $y^{(t)}$ :

\delta^{(L)} = a^{(L)} - y^{(t)}

This is the error of the output layer.

2.3 ⏪ Backpropagate the Errors

For layers:

l = L-1, L-2, \dots, 2

Compute: $\delta^{(L-1)}, \delta^{(L-2)}, \dots \delta^{(2)}$

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!* \; g'(z^{(l)})

For sigmoid activation:

g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})

So equivalently:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!* \; a^{(l)} \;.\!* \; (1 - a^{(l)})

The operator $.\!*$ denotes element-wise multiplication.

2.4 📥 Accumulate Gradients

Update:

\Delta_{i,j}^{(l)} := \Delta_{i,j}^{(l)}+ a_j^{(l)} \delta_i^{(l+1)}

Vectorized form:

\Delta^{(l)} := \Delta^{(l)}+ \delta^{(l+1)} (a^{(l)})^T

Step 3: 🎢 Compute Gradients

After processing all training examples:

For $j \ne 0$ (non-bias terms):

D_{i,j}^{(l)} = \frac{1}{m} \left( \Delta_{i,j}^{(l)} + \lambda \Theta_{i,j}^{(l)} \right)

For bias terms ( $j = 0$ ):

D_{i,j}^{(l)} = \frac{1}{m} \Delta_{i,j}^{(l)}

Final Result

The gradient of the cost function is:

\frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta) = D_{i,j}^{(l)}

The matrix $D^{(l)}$ gives the partial derivatives used in gradient descent.

Example:

Given one training example $(x, y)$

Layer 1 (Input)

⏩ Forward Propagation

a^{(1)} = x

⏪ Backward Propagation

No Error Term Associated with Input Term

Layer 2

⏩ Forward Propagation

z^{(2)} = \Theta^{(1)} a^{(1)}

a^{(2)} = g(z^{(2)})

(Add bias unit if applicable.)

⏪ Backward Propagation

\delta^{(2)} = (\Theta^{(2)T} \delta^{(3)}) \odot g'(z^{(2)})

Layer 3

⏩ Forward Propagation

z^{(3)} = \Theta^{(2)} a^{(2)}

a^{(3)} = g(z^{(3)})

⏪ Backward Propagation

\delta^{(3)} = (\Theta^{(3)T} \delta^{(4)}) \odot g'(z^{(3)})

Layer 4 (Output)

⏩ Forward Propagation

z^{(4)} = \Theta^{(3)} a^{(3)}

a^{(4)} = h_\Theta(x) = g(z^{(4)})

⏪ Backward Propagation

Output Layer Error = Calculated Value - Actual Value

\delta^{(4)} = a^{(4)} - y

Geometric Interpretation

Think of the network as a graph:

Nodes = neurons
Edges = weights $\Theta_{ij}$
Errors flow backward through edges

To compute $\delta_j^{(l)}$ :

Take all connections going forward from unit $j$
Multiply each weight by the corresponding $\delta$
Sum them up

This is simply the chain rule applied repeatedly.

Example:

To compute:

\delta_2^{(2)}

We sum over the next layer:

\delta_2^{(2)} = \Theta_{12}^{(2)} \delta_1^{(3)} + \Theta_{22}^{(2)} \delta_2^{(3)}

Example

To compute:

\delta_2^{(3)}

We sum contributions from the next layer:

\delta_2^{(3)} = \Theta_{12}^{(3)} \delta_1^{(4)}

Backpropagation Algorithm

Backpropagation is the algorithm used to minimize the neural network cost function. It computes the gradients of the cost function with respect to the parameters, allowing us to perform gradient descent and update our model.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Cost Function for Neural Networks

Gradient Checking and Random Initialization

⏪ Backpropagation Algorithm (BP)

Core Insight

Backpropagation is:

Repeated application of the chain rule
Error flowing from output to input
Weighted by connection strengths
Modulated by the activation derivative

In short:

Forward pass computes predictions.

Backward pass computes gradients.

Training flow:

flowchart TD

    A["Forward Pass"]
        --> B["Prediction"]

    B --> C["Loss Calculation"]

    C --> D["Backpropagation"]

    D --> E["Gradient Updates"]

Gradients tell each layer:

how much to adjust weights

Why we do Backward Propagation?

Backpropagation is the algorithm used to minimize the neural network cost function.

Just like gradient descent in linear and logistic regression, our goal is:

\min_\Theta J(\Theta)

That is, we want to find parameters $\Theta$ that minimize the cost function.

Where

J(\Theta) =- \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(i)} \log((h_\Theta(x^{(i)}))_k) + (1 - y_k^{(i)}) \log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} (\Theta^{(l)}_{j,i})^2

J(\Theta) = - \frac{1}{m} \sum_{t=1}^{m} \sum_{k=1}^{K} \left[ y_k^{(t)} \log\big((h_\Theta(x^{(t)}))_k\big)+ (1 - y_k^{(t)}) \log\big(1 - (h_\Theta(x^{(t)}))_k\big) \right]+ \text{regularization}

Objective

We want to compute the partial derivatives:

\frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta)

These derivatives are used in gradient descent to update the parameters.

How Backpropagation Works

Backpropagation computes errors from right to left.

We start at the output layer:

\delta^{(L)} = a^{(L)} - y

Then propagate backward using:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; g'(z^{(l)})

For sigmoid activation:

g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})

So equivalently:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!*\; a^{(l)} \;.\!* \; (1 - a^{(l)})

❗ Loss Function $\delta_j^{(l)}$

A loss function measures how wrong your model’s prediction is compared to the actual value.

It answers one simple question: How far off was the prediction?

Error is represented as:

$\delta_j^{(l)} = Predicted - Actual Value$

where $\delta_j^{(l)}$ represents the error of unit $j$ in layer $l$ .

Loss function converts error into a number the model can optimize.

More formally:

\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)

So:

$\delta$ is the derivative of the cost with respect to $z$
It measures how much that unit contributed to the error
Larger magnitude → steeper slope → more incorrect

Example: House Price

If actual house price = €500,000

The model makes a prediction: Model predicts = €480,000

The loss function calculates the error :

Error = €500,000 -€480,000 = €20,000

$\delta_j^{(l)} = 20000$

The optimizer adjusts the parameters to reduce that error.

The training process tries to minimize this loss.
Repeat this thousands of times → model improves.

⚖️ Loss vs Cost Function

❗ Loss → error for one example
💰 Cost → average loss over the dataset

🎢 Backpropagation Gradient Computation

Forward propagation → computes activations.
Backpropagation → computes errors ( $\delta$ values).
Errors are propagated from right to left.
Gradients are accumulated in $\Delta$ .
Regularization is added for non-bias weights.
Finally, we divide by $m$ to obtain the average gradient.

Backpropagation Algorithm

Given $m$ training set:

\{(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\}

Step 1: 🌱 Initialize Accumulators

Set:

\Delta_{i,j}^{(l)} := 0

for all $l, i, j$ .

This creates matrices of zeros to accumulate gradients.

Step 2: For each training example $t = 1$ to $m$

Backpropagation works per example, and gradients are summed (or averaged) over the dataset.

Example: For two training examples $(x^{(1)}, y^{(1)})$ and $(x^{(2)}, y^{(2)})$

compute FP for $(x^{(1)}, y^{(1)})$ , Compute BP for $(x^{(1)}, y^{(1)})$
compute FP for $(x^{(2)}, y^{(2)})$ , Compute BP for $(x^{(2)}, y^{(2)})$
Finally Average (or sum) the gradients

2.1 ⏩ Forward Propagation

Set:

a^{(1)} := x^{(t)}

Compute forward propagation for:

l = 2, 3, \dots, L

to obtain activations $a^{(l)}$ for any Network layer $l$ :

z^{(l)} = \Theta^{(l-1)} a^{(l-1)}

a^{(l)} = g(z^{(l)})

Or when look Forward

z^{(l+1)} = \Theta^{(l)} a^{(l)}

a^{(l+1)} = g\left(z^{(l+1)}\right)

Where

$a^{(l)}$ = activations of layer $l$
$z^{(l)}$ = linear combination before activation
$\Theta^{(l)}$ = weight matrix between layer $l$ and $l+1$
$g(\cdot)$ = activation function

2.2 ❗Compute Output Layer Error ( $\delta^{(L)}$ )

Using the true label $y^{(t)}$ :

\delta^{(L)} = a^{(L)} - y^{(t)}

This is the error of the output layer.

2.3 ⏪ Backpropagate the Errors

For layers:

l = L-1, L-2, \dots, 2

Compute: $\delta^{(L-1)}, \delta^{(L-2)}, \dots \delta^{(2)}$

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!* \; g'(z^{(l)})

For sigmoid activation:

g'(z^{(l)}) = a^{(l)} \;.\!* \; (1 - a^{(l)})

So equivalently:

\delta^{(l)} = \left( (\Theta^{(l)})^T \delta^{(l+1)} \right) \;.\!* \; a^{(l)} \;.\!* \; (1 - a^{(l)})

The operator $.\!*$ denotes element-wise multiplication.

2.4 📥 Accumulate Gradients

Update:

\Delta_{i,j}^{(l)} := \Delta_{i,j}^{(l)}+ a_j^{(l)} \delta_i^{(l+1)}

Vectorized form:

\Delta^{(l)} := \Delta^{(l)}+ \delta^{(l+1)} (a^{(l)})^T

Step 3: 🎢 Compute Gradients

After processing all training examples:

For $j \ne 0$ (non-bias terms):

D_{i,j}^{(l)} = \frac{1}{m} \left( \Delta_{i,j}^{(l)} + \lambda \Theta_{i,j}^{(l)} \right)

For bias terms ( $j = 0$ ):

D_{i,j}^{(l)} = \frac{1}{m} \Delta_{i,j}^{(l)}

Final Result

The gradient of the cost function is:

\frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta) = D_{i,j}^{(l)}

The matrix $D^{(l)}$ gives the partial derivatives used in gradient descent.

Example:

Given one training example $(x, y)$

Layer 1 (Input)

⏩ Forward Propagation

a^{(1)} = x

⏪ Backward Propagation

No Error Term Associated with Input Term

Layer 2

⏩ Forward Propagation

z^{(2)} = \Theta^{(1)} a^{(1)}

a^{(2)} = g(z^{(2)})

(Add bias unit if applicable.)

⏪ Backward Propagation

\delta^{(2)} = (\Theta^{(2)T} \delta^{(3)}) \odot g'(z^{(2)})

Layer 3

⏩ Forward Propagation

z^{(3)} = \Theta^{(2)} a^{(2)}

a^{(3)} = g(z^{(3)})

⏪ Backward Propagation

\delta^{(3)} = (\Theta^{(3)T} \delta^{(4)}) \odot g'(z^{(3)})

Layer 4 (Output)

⏩ Forward Propagation

z^{(4)} = \Theta^{(3)} a^{(3)}

a^{(4)} = h_\Theta(x) = g(z^{(4)})

⏪ Backward Propagation

Output Layer Error = Calculated Value - Actual Value

\delta^{(4)} = a^{(4)} - y

Geometric Interpretation

Think of the network as a graph:

Nodes = neurons
Edges = weights $\Theta_{ij}$
Errors flow backward through edges

To compute $\delta_j^{(l)}$ :

Take all connections going forward from unit $j$
Multiply each weight by the corresponding $\delta$
Sum them up

This is simply the chain rule applied repeatedly.

Example:

To compute:

\delta_2^{(2)}

We sum over the next layer:

\delta_2^{(2)} = \Theta_{12}^{(2)} \delta_1^{(3)} + \Theta_{22}^{(2)} \delta_2^{(3)}

Example

To compute:

\delta_2^{(3)}

We sum contributions from the next layer:

\delta_2^{(3)} = \Theta_{12}^{(3)} \delta_1^{(4)}

AI-DeepLearning

AI-DeepLearning Index

Deep Learning Path 🤖

Neural Network Hypothesis and Intuition

Forward Propagation in Neural Networks

Vectorized Neural Networks Model Representation

Examples and Intuitions I — Neural Networks as Logical Gates

Examples and Intuitions II — Building XNOR with a Hidden Layer

Multiclass Classification with Neural Networks

Cost Function for Neural Networks

Backpropagation Algorithm

Gradient Checking and Random Initialization

Training a Neural Network

Revision Cheat Sheet

Backpropagation Algorithm

Backpropagation is the algorithm used to minimize the neural network cost function. It computes the gradients of the cost function with respect to the parameters, allowing us to perform gradient descent and update our model.

Written by Hitesh Sahu, a passionate developer and blogger.

⏪ Backpropagation Algorithm (BP)

Core Insight

Why we do Backward Propagation?

Objective

How Backpropagation Works

❗ Loss Function δj(l)\delta_j^{(l)}δj(l)​

Example: House Price

⚖️ Loss vs Cost Function

🎢 Backpropagation Gradient Computation

Backpropagation Algorithm

Given mmm training set:

Step 1: 🌱 Initialize Accumulators

Step 2: For each training example t=1t = 1t=1 to mmm

2.1 ⏩ Forward Propagation

2.2 ❗Compute Output Layer Error (δ(L)\delta^{(L)}δ(L))

2.3 ⏪ Backpropagate the Errors

2.4 📥 Accumulate Gradients

Step 3: 🎢 Compute Gradients

Final Result

Example:

Layer 1 (Input)

⏩ Forward Propagation

⏪ Backward Propagation

Layer 2

⏩ Forward Propagation

⏪ Backward Propagation

Layer 3

⏩ Forward Propagation

⏪ Backward Propagation

Layer 4 (Output)

⏩ Forward Propagation

⏪ Backward Propagation

Geometric Interpretation

Example:

Example

Fetching content, this won’t take long…

🍌 Bananas are berries, but strawberries are not.

AI-DeepLearning

AI-DeepLearning Index

Deep Learning Path 🤖

Neural Network Hypothesis and Intuition

Forward Propagation in Neural Networks

Vectorized Neural Networks Model Representation

Examples and Intuitions I — Neural Networks as Logical Gates

Examples and Intuitions II — Building XNOR with a Hidden Layer

Multiclass Classification with Neural Networks

Cost Function for Neural Networks

Backpropagation Algorithm

Gradient Checking and Random Initialization

Training a Neural Network

Revision Cheat Sheet

Backpropagation Algorithm

Backpropagation is the algorithm used to minimize the neural network cost function. It computes the gradients of the cost function with respect to the parameters, allowing us to perform gradient descent and update our model.

Written by Hitesh Sahu, a passionate developer and blogger.

⏪ Backpropagation Algorithm (BP)

Core Insight

Why we do Backward Propagation?

Objective

How Backpropagation Works

❗ Loss Function δj(l)\delta_j^{(l)}δj(l)​

Example: House Price

⚖️ Loss vs Cost Function

🎢 Backpropagation Gradient Computation

❗ Loss Function $\delta_j^{(l)}$

Given $m$ training set:

Step 2: For each training example $t = 1$ to $m$

2.2 ❗Compute Output Layer Error ( $\delta^{(L)}$ )

❗ Loss Function $\delta_j^{(l)}$

Given $m$ training set:

Step 2: For each training example $t = 1$ to $m$

2.2 ❗Compute Output Layer Error ( $\delta^{(L)}$ )