⏪ Backpropagation Algorithm (BP)
Backpropagation is the algorithm used to minimize the neural network cost function.
Just like gradient descent in linear and logistic regression, our goal is:
ΘminJ(Θ)
That is, we want to find parameters Θ that minimize the cost function.
Where
J(Θ)=−m1i=1∑mk=1∑K[yk(i)log((hΘ(x(i)))k)+(1−yk(i))log(1−(hΘ(x(i)))k)]+2mλl=1∑L−1i=1∑slj=1∑sl+1(Θj,i(l))2
Objective
We want to compute the partial derivatives:
∂Θi,j(l)∂J(Θ)
These derivatives are used in gradient descent to update the parameters.
❗ Loss Function (δj(l) )
A loss function measures how wrong your model’s prediction is compared to the actual value.
It answers one simple question:
How far off was the prediction?
If actual house price = €500,000
The training process tries to minimize this loss.
- The model makes a prediction: Model predicts = €480,000
- The loss function calculates the error : Error = €20,000
- The optimizer adjusts the parameters to reduce that error.
Repeat this thousands of times → model improves.
Error is represented as:
δj(l)=Predicted−ActualValue
δj(l)=20000
where δj(l) represents the error of unit j in layer l.
More formally:
δj(l)=∂zj(l)∂cost(t)
So:
- δ is the derivative of the cost with respect to z
- It measures how much that unit contributed to the error
- Larger magnitude → steeper slope → more incorrect
Loss function converts error into a number the model can optimize.
⚖️ Loss vs Cost Function
- Loss → error for one example
- Cost → average loss over the dataset
Gradient Computation (Backpropagation)
- Forward propagation → computes activations.
- Backpropagation → computes errors (δ values).
- Errors are propagated from right to left.
- Gradients are accumulated in Δ.
- Regularization is added for non-bias weights.
- Finally, we divide by m to obtain the average gradient.
Backpropagation Algorithm
Given m training set:
{(x(1),y(1)),…,(x(m),y(m))}
Step 1: 🌱 Initialize Accumulators
Set:
Δi,j(l):=0
for all l,i,j.
This creates matrices of zeros to accumulate gradients.
Step 2: For each training example t=1 to m
Backpropagation works per example, and gradients are summed (or averaged) over the dataset.
Example: For two training examples
(x(1),y(1)) and (x(2),y(2))
- compute FP for (x(1),y(1)), Compute BP for (x(1),y(1))
- compute FP for (x(2),y(2)), Compute BP for (x(2),y(2))
- Finally Average (or sum) the gradients
2.1 ⏩ Forward Propagation
Set:
a(1):=x(t)
Compute forward propagation for:
l=2,3,…,L
to obtain activations a(l) for any Network layer l:
z(l)=Θ(l−1)a(l−1)
a(l)=g(z(l))
Or when look Forward
z(l+1)=Θ(l)a(l)
a(l+1)=g(z(l+1))
Where
- a(l) = activations of layer l
- z(l) = linear combination before activation
- Θ(l) = weight matrix between layer l and l+1
- g(⋅) = activation function
2.2 ❗Compute Output Layer Error (δ(L))
Using the true label y(t):
δ(L)=a(L)−y(t)
This is the error of the output layer.
2.3 ⏪ Backpropagate the Errors δ(L−1),δ(L−2),…δ(2)
For layers:
l=L−1,L−2,…,2
Compute:
δ(l)=((Θ(l))Tδ(l+1)).∗g′(z(l))
For sigmoid activation:
g′(z(l))=a(l).∗(1−a(l))
So equivalently:
δ(l)=((Θ(l))Tδ(l+1)).∗a(l).∗(1−a(l))
The operator .∗ denotes element-wise multiplication.
2.4 📥 Accumulate Gradients
Update:
Δi,j(l):=Δi,j(l)+aj(l)δi(l+1)
Vectorized form:
Δ(l):=Δ(l)+δ(l+1)(a(l))T
Step 3: 🎢 Compute Gradients
After processing all training examples:
For j=0 (non-bias terms):
Di,j(l)=m1(Δi,j(l)+λΘi,j(l))
For bias terms (j=0):
Di,j(l)=m1Δi,j(l)
Final Result
The gradient of the cost function is:
∂Θi,j(l)∂J(Θ)=Di,j(l)
The matrix D(l) gives the partial derivatives used in gradient descent.
Example:
Given one training example (x,y)
Layer 1 (Input)
⏩ Forward Propagation
a(1)=x
⏪ Backward Propagation
No Error Term Associated with Input Term
Layer 2
⏩ Forward Propagation
z(2)=Θ(1)a(1)
a(2)=g(z(2))
(Add bias unit if applicable.)
⏪ Backward Propagation
δ(2)=(Θ(2)Tδ(3))⊙g′(z(2))
Layer 3
⏩ Forward Propagation
z(3)=Θ(2)a(2)
a(3)=g(z(3))
⏪ Backward Propagation
δ(3)=(Θ(3)Tδ(4))⊙g′(z(3))
Layer 4 (Output)
⏩ Forward Propagation
z(4)=Θ(3)a(3)
a(4)=hΘ(x)=g(z(4))
⏪ Backward Propagation
Output Layer Error = Calculated Value - Actual Value
δ(4)=a(4)−y