Gradient Checking and Random Initialization
Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.
Gradient Checking
Gradient checking is used to verify that backpropagation is implemented correctly.
It works by numerically approximating the derivative of the cost function.
Numerical Approximation of the Derivative
For a single parameter , the derivative can be approximated as:
This is called the central difference approximation.
Extension to Multiple Parameters
When we have multiple parameters, we approximate the derivative with respect to as:
This means:
- Slightly increase one parameter
- Slightly decrease the same parameter
- Measure how the cost changes
- Use that to approximate the gradient
Choosing Epsilon
A small value such as:
works well in practice.
Important notes:
- If is too large → poor approximation
- If is too small → numerical precision problems
- is typically a good balance
Algorithm (Octave / MATLAB Style)
epsilon = 1e-4;
for i = 1:n
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end
This computes the approximate gradient vector.
How to Use Gradient Checking
From backpropagation, we compute:
- The gradient vector (often called
deltaVector)
Then we compare:
If they match closely, backpropagation is likely correct.
Important Practical Advice
- Use gradient checking only for debugging.
- Once backpropagation is verified, disable gradient checking.
- Gradient approximation is computationally expensive.
- It should NOT be used during actual training.
Summary
Gradient checking:
- Provides a numerical way to verify gradients
- Uses central difference approximation
- Should only be used for debugging
- Confirms correctness of backpropagation implementation
Random Initialization
Initializing all parameters to zero does not work for neural networks.
If all weights are initialized to zero:
- All neurons in a layer compute the same value.
- During backpropagation, they receive identical gradients.
- They continue updating identically.
- The network fails to learn different features.
This is called the symmetry problem.
To break symmetry, we must initialize weights randomly.
Random Initialization Strategy
Each parameter:
is initialized to a random value in the range:
How to Generate Random Values
To ensure weights fall within this range:
Where:
randgenerates values uniformly in the range[0,1]- Multiplying by
2\epsilonscales to[0, 2\epsilon] - Subtracting
\epsilonshifts to[-\epsilon, \epsilon]
Example Code (Octave / MATLAB)
If:
Theta1is10 × 11Theta2is10 × 11Theta3is1 × 11
Then:
INIT_EPSILON = 0.12; % example value
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Important Notes
rand(x,y)generates anx × ymatrix of random numbers in[0,1].- The epsilon used here is NOT related to gradient checking epsilon.
- This random initialization is required for proper learning.
Why This Works
Random initialization:
- Breaks symmetry between neurons
- Allows different neurons to learn different features
- Enables backpropagation to function correctly
- Is essential for training deep neural networks
Without random initialization, the network cannot learn effectively.
