Gradient Checking and Random Initialization
Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.
🎢 Gradient Checking
Gradient checking is used to verify that backpropagation is implemented correctly.
It works by numerically approximating the derivative of the cost function.
Idea (Single Parameter Case)
Suppose is a real number and we want to compute:
One-sided difference:
Two-sided difference (preferred):
The two-sided version is more accurate. So we approximate the derivative using a two-sided difference:
where typically:
This works because as :
Numerical Approximation of the Derivative
For a single parameter , the derivative can be approximated as:
This is called the central difference approximation.
Choosing Epsilon
A small value such as:
works well in practice.
Important notes:
- If is too large → poor approximation
- If is too small → numerical precision problems
- is typically a good balance
Extension to Multiple Parameters
When we have multiple parameters, we approximate the derivative with respect to as:
This means:
- Slightly increase one parameter
- Slightly decrease the same parameter
- Measure how the cost changes
- Use that to approximate the gradient
Algorithm (Octave / MATLAB Style)
epsilon = 1e-4;
for i = 1:n
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end
This computes the approximate gradient vector.
How to Use Gradient Checking
Let:
- = gradient from backpropagation (often called
deltaVector) - = gradient from numerical checking
Then we compare:
If they match up to a few decimal places, the implementation is likely correct.
A common comparison metric:
This value should be very small (e.g., ).
Important: Disable After Checking
Use gradient checking only for debugging.
- Once backpropagation is verified, disable gradient checking.
Gradient checking is:
- Very slow
- Computationally expensive
- It should NOT be used during actual training.
Backpropagation is much more efficient:
So the correct workflow is:
- Implement backprop → compute
- Implement gradient checking → compute
- Verify they match
- Disable gradient checking
- Train normally using backprop
Summary
Gradient checking:
- Provides a numerical way to verify gradients
- Uses central difference approximation
- Should only be used for debugging
- Confirms correctness of backpropagation implementation
🎲 Random Initialization
Symmetry problem.
Initializing all theta weight to zero does not work for neural networks.
for all i, j, l
If all weights are initialized to zero:
- All neurons in a layer compute the same value.
- During backpropagation, they receive identical gradients.
- They continue updating identically.
- The network fails to learn different features.
To break symmetry, we must initialize weights randomly.
Random Initialization Strategy
Initialize Each parameter with Random R in range of
That means
To ensure weights fall within this range we calculate:
Where:
randgenerates values uniformly in the range[0,1]- Multiplying by scales to
- Subtracting shifts to
Example Code (Octave / MATLAB)
If:
Theta1is10 × 11MatrixTheta2is10 × 11MatrixTheta3is1 × 11Matrix
Then:
INIT_EPSILON = 0.12; % example value
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
rand(x,y)generates anx × ymatrix of random numbers in[0,1].- The epsilon used here is NOT related to gradient checking epsilon.
- This random initialization is required for proper learning.
Why This Works
Random initialization:
- Breaks symmetry between neurons
- Allows different neurons to learn different features
- Enables backpropagation to function correctly
- Is essential for training deep neural networks
Without random initialization, the network cannot learn effectively.
