⏪ Backpropagation Algorithm (BP)
Core Insight
Backpropagation is:
- Repeated application of the chain rule
- Error flowing from output to input
- Weighted by connection strengths
- Modulated by the activation derivative
In short:
Forward pass computes predictions.
Backward pass computes gradients.
Training flow:
flowchart TD
A["Forward Pass"]
--> B["Prediction"]
B --> C["Loss Calculation"]
C --> D["Backpropagation"]
D --> E["Gradient Updates"]
Gradients tell each layer:
- how much to adjust weights
Why we do Backward Propagation?
Backpropagation is the algorithm used to minimize the neural network cost function.
Just like gradient descent in linear and logistic regression, our goal is:
ΘminJ(Θ)
That is, we want to find parameters Θ that minimize the cost function.
Where
J(Θ)=−m1i=1∑mk=1∑K[yk(i)log((hΘ(x(i)))k)+(1−yk(i))log(1−(hΘ(x(i)))k)]+2mλl=1∑L−1i=1∑slj=1∑sl+1(Θj,i(l))2
J(Θ)=−m1t=1∑mk=1∑K[yk(t)log((hΘ(x(t)))k)+(1−yk(t))log(1−(hΘ(x(t)))k)]+regularization
Objective
We want to compute the partial derivatives:
∂Θi,j(l)∂J(Θ)
These derivatives are used in gradient descent to update the parameters.
How Backpropagation Works
Backpropagation computes errors from right to left.
We start at the output layer:
δ(L)=a(L)−y
Then propagate backward using:
δ(l)=((Θ(l))Tδ(l+1)).∗g′(z(l))
For sigmoid activation:
g′(z(l))=a(l).∗(1−a(l))
So equivalently:
δ(l)=((Θ(l))Tδ(l+1)).∗a(l).∗(1−a(l))
❗ Loss Function δj(l)
A loss function measures how wrong your model’s prediction is compared to the actual value.
It answers one simple question: How far off was the prediction?
Error is represented as:
δj(l)=Predicted−ActualValue
where δj(l) represents the error of unit j in layer l.
Loss function converts error into a number the model can optimize.
More formally:
δj(l)=∂zj(l)∂cost(t)
So:
- δ is the derivative of the cost with respect to z
- It measures how much that unit contributed to the error
- Larger magnitude → steeper slope → more incorrect
Example: House Price
If actual house price = €500,000
The model makes a prediction: Model predicts = €480,000
The loss function calculates the error :
Error = €500,000 -€480,000 = €20,000
So
δj(l)=20000
The optimizer adjusts the parameters to reduce that error.
- The training process tries to minimize this loss.
- Repeat this thousands of times → model improves.
⚖️ Loss vs Cost Function
- ❗ Loss → error for one example
- 💰 Cost → average loss over the dataset
🎢 Backpropagation Gradient Computation
- Forward propagation → computes activations.
- Backpropagation → computes errors (δ values).
- Errors are propagated from right to left.
- Gradients are accumulated in Δ.
- Regularization is added for non-bias weights.
- Finally, we divide by m to obtain the average gradient.
Backpropagation Algorithm
Given m training set:
{(x(1),y(1)),…,(x(m),y(m))}
Step 1: 🌱 Initialize Accumulators
Set:
Δi,j(l):=0
for all l,i,j.
This creates matrices of zeros to accumulate gradients.
Step 2: For each training example t=1 to m
Backpropagation works per example, and gradients are summed (or averaged) over the dataset.
Example: For two training examples
(x(1),y(1)) and (x(2),y(2))
- compute FP for (x(1),y(1)), Compute BP for (x(1),y(1))
- compute FP for (x(2),y(2)), Compute BP for (x(2),y(2))
- Finally Average (or sum) the gradients
2.1 ⏩ Forward Propagation
Set:
a(1):=x(t)
Compute forward propagation for:
l=2,3,…,L
to obtain activations a(l) for any Network layer l:
z(l)=Θ(l−1)a(l−1)
a(l)=g(z(l))
Or when look Forward
z(l+1)=Θ(l)a(l)
a(l+1)=g(z(l+1))
Where
- a(l) = activations of layer l
- z(l) = linear combination before activation
- Θ(l) = weight matrix between layer l and l+1
- g(⋅) = activation function
2.2 ❗Compute Output Layer Error (δ(L))
Using the true label y(t):
δ(L)=a(L)−y(t)
This is the error of the output layer.
2.3 ⏪ Backpropagate the Errors
For layers:
l=L−1,L−2,…,2
Compute: δ(L−1),δ(L−2),…δ(2)
δ(l)=((Θ(l))Tδ(l+1)).∗g′(z(l))
For sigmoid activation:
g′(z(l))=a(l).∗(1−a(l))
So equivalently:
δ(l)=((Θ(l))Tδ(l+1)).∗a(l).∗(1−a(l))
The operator .∗ denotes element-wise multiplication.
2.4 📥 Accumulate Gradients
Update:
Δi,j(l):=Δi,j(l)+aj(l)δi(l+1)
Vectorized form:
Δ(l):=Δ(l)+δ(l+1)(a(l))T
Step 3: 🎢 Compute Gradients
After processing all training examples:
For j=0 (non-bias terms):
Di,j(l)=m1(Δi,j(l)+λΘi,j(l))
For bias terms (j=0):
Di,j(l)=m1Δi,j(l)
Final Result
The gradient of the cost function is:
∂Θi,j(l)∂J(Θ)=Di,j(l)
The matrix D(l) gives the partial derivatives used in gradient descent.
Example:
Given one training example (x,y)
Layer 1 (Input)
⏩ Forward Propagation
a(1)=x
⏪ Backward Propagation
No Error Term Associated with Input Term
Layer 2
⏩ Forward Propagation
z(2)=Θ(1)a(1)
a(2)=g(z(2))
(Add bias unit if applicable.)
⏪ Backward Propagation
δ(2)=(Θ(2)Tδ(3))⊙g′(z(2))
Layer 3
⏩ Forward Propagation
z(3)=Θ(2)a(2)
a(3)=g(z(3))
⏪ Backward Propagation
δ(3)=(Θ(3)Tδ(4))⊙g′(z(3))
Layer 4 (Output)
⏩ Forward Propagation
z(4)=Θ(3)a(3)
a(4)=hΘ(x)=g(z(4))
⏪ Backward Propagation
Output Layer Error = Calculated Value - Actual Value
δ(4)=a(4)−y
Geometric Interpretation
Think of the network as a graph:
- Nodes = neurons
- Edges = weights Θij
- Errors flow backward through edges
To compute δj(l):
- Take all connections going forward from unit j
- Multiply each weight by the corresponding δ
- Sum them up
This is simply the chain rule applied repeatedly.
Example:
To compute:
δ2(2)
We sum over the next layer:
δ2(2)=Θ12(2)δ1(3)+Θ22(2)δ2(3)
Example
To compute:
δ2(3)
We sum contributions from the next layer:
δ2(3)=Θ12(3)δ1(4)