Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Gradient Checking

Gradient checking is used to verify that backpropagation is implemented correctly.

It works by numerically approximating the derivative of the cost function.

Numerical Approximation of the Derivative

For a single parameter $\Theta$ , the derivative can be approximated as:

\frac{\partial}{\partial \Theta} J(\Theta) \approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}

This is called the central difference approximation.

Extension to Multiple Parameters

When we have multiple parameters, we approximate the derivative with respect to $\Theta_j$ as:

\frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{ J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n)- J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n) }{2\epsilon}

This means:

Slightly increase one parameter
Slightly decrease the same parameter
Measure how the cost changes
Use that to approximate the gradient

Choosing Epsilon

A small value such as:

\epsilon = 10^{-4}

works well in practice.

Important notes:

If $\epsilon$ is too large → poor approximation
If $\epsilon$ is too small → numerical precision problems
$10^{-4}$ is typically a good balance

Algorithm (Octave / MATLAB Style)

epsilon = 1e-4;

for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += epsilon;

  thetaMinus = theta;
  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end

This computes the approximate gradient vector.

How to Use Gradient Checking

From backpropagation, we compute:

The gradient vector (often called deltaVector)

Then we compare:

\text{gradApprox} \approx \text{deltaVector}

If they match closely, backpropagation is likely correct.

Important Practical Advice

Use gradient checking only for debugging.
Once backpropagation is verified, disable gradient checking.
Gradient approximation is computationally expensive.
It should NOT be used during actual training.

Summary

Gradient checking:

Provides a numerical way to verify gradients
Uses central difference approximation
Should only be used for debugging
Confirms correctness of backpropagation implementation

Random Initialization

Initializing all parameters to zero does not work for neural networks.

If all weights are initialized to zero:

All neurons in a layer compute the same value.
During backpropagation, they receive identical gradients.
They continue updating identically.
The network fails to learn different features.

This is called the symmetry problem.

To break symmetry, we must initialize weights randomly.

Random Initialization Strategy

Each parameter:

\Theta_{i,j}^{(l)}

is initialized to a random value in the range:

[-\epsilon, \epsilon]

How to Generate Random Values

To ensure weights fall within this range:

\Theta = \text{rand} \times (2\epsilon) - \epsilon

Where:

rand generates values uniformly in the range [0,1]
Multiplying by 2\epsilon scales to [0, 2\epsilon]
Subtracting \epsilon shifts to [-\epsilon, \epsilon]

Example Code (Octave / MATLAB)

If:

Theta1 is 10 × 11
Theta2 is 10 × 11
Theta3 is 1 × 11

Then:

INIT_EPSILON = 0.12;  % example value

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Important Notes

rand(x,y) generates an x × y matrix of random numbers in [0,1].
The epsilon used here is NOT related to gradient checking epsilon.
This random initialization is required for proper learning.

Why This Works

Random initialization:

Breaks symmetry between neurons
Allows different neurons to learn different features
Enables backpropagation to function correctly
Is essential for training deep neural networks

Without random initialization, the network cannot learn effectively.

AI-DeepLearning/Gradient-Check

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Gradient Checking

Gradient checking is used to verify that backpropagation is implemented correctly.

It works by numerically approximating the derivative of the cost function.

Numerical Approximation of the Derivative

For a single parameter $\Theta$ , the derivative can be approximated as:

\frac{\partial}{\partial \Theta} J(\Theta) \approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}

This is called the central difference approximation.

Extension to Multiple Parameters

When we have multiple parameters, we approximate the derivative with respect to $\Theta_j$ as:

\frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{ J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n)- J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n) }{2\epsilon}

This means:

Slightly increase one parameter
Slightly decrease the same parameter
Measure how the cost changes
Use that to approximate the gradient

Choosing Epsilon

A small value such as:

\epsilon = 10^{-4}

works well in practice.

Important notes:

If $\epsilon$ is too large → poor approximation
If $\epsilon$ is too small → numerical precision problems
$10^{-4}$ is typically a good balance

Algorithm (Octave / MATLAB Style)

epsilon = 1e-4;

for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += epsilon;

  thetaMinus = theta;
  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end

This computes the approximate gradient vector.

How to Use Gradient Checking

From backpropagation, we compute:

The gradient vector (often called deltaVector)

Then we compare:

\text{gradApprox} \approx \text{deltaVector}

If they match closely, backpropagation is likely correct.

Important Practical Advice

Use gradient checking only for debugging.
Once backpropagation is verified, disable gradient checking.
Gradient approximation is computationally expensive.
It should NOT be used during actual training.

Summary

Gradient checking:

Provides a numerical way to verify gradients
Uses central difference approximation
Should only be used for debugging
Confirms correctness of backpropagation implementation

Random Initialization

Initializing all parameters to zero does not work for neural networks.

If all weights are initialized to zero:

All neurons in a layer compute the same value.
During backpropagation, they receive identical gradients.
They continue updating identically.
The network fails to learn different features.

This is called the symmetry problem.

To break symmetry, we must initialize weights randomly.

Random Initialization Strategy

Each parameter:

\Theta_{i,j}^{(l)}

is initialized to a random value in the range:

[-\epsilon, \epsilon]

How to Generate Random Values

To ensure weights fall within this range:

\Theta = \text{rand} \times (2\epsilon) - \epsilon

Where:

rand generates values uniformly in the range [0,1]
Multiplying by 2\epsilon scales to [0, 2\epsilon]
Subtracting \epsilon shifts to [-\epsilon, \epsilon]

Example Code (Octave / MATLAB)

If:

Theta1 is 10 × 11
Theta2 is 10 × 11
Theta3 is 1 × 11

Then:

INIT_EPSILON = 0.12;  % example value

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Important Notes

rand(x,y) generates an x × y matrix of random numbers in [0,1].
The epsilon used here is NOT related to gradient checking epsilon.
This random initialization is required for proper learning.

Why This Works

Random initialization:

Breaks symmetry between neurons
Allows different neurons to learn different features
Enables backpropagation to function correctly
Is essential for training deep neural networks

Without random initialization, the network cannot learn effectively.

AI-DeepLearning/Gradient-Check