epsilon = 1e-4;

for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += epsilon;

  thetaMinus = theta;
  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end

This computes the approximate gradient vector.

How to Use Gradient Checking

Let:

$\delta$ = gradient from backpropagation (often called deltaVector)
$gradApprox$ = gradient from numerical checking

Then we compare:

\text{gradApprox} \approx \text{deltaVector}

If they match up to a few decimal places, the implementation is likely correct.

A common comparison metric:

\frac{\| DVec - gradApprox \|} {\| DVec + gradApprox \|}

This value should be very small (e.g., $< 10^{-7}$ ).

Important: Disable After Checking

Use gradient checking only for debugging.

Once backpropagation is verified, disable gradient checking.

Gradient checking is:

Very slow
Computationally expensive
It should NOT be used during actual training.

Backpropagation is much more efficient:

\delta^{(4)}, \delta^{(3)}, \delta^{(2)}, \dots

So the correct workflow is:

Implement backprop → compute $DVec$
Implement gradient checking → compute $gradApprox$
Verify they match
Disable gradient checking
Train normally using backprop

Summary

Gradient checking:

Provides a numerical way to verify gradients
Uses central difference approximation
Should only be used for debugging
Confirms correctness of backpropagation implementation

🎲 Random Initialization

Symmetry problem.

Initializing all theta weight to zero does not work for neural networks.

\Theta_{i,j}^{(l)} = 0

for all i, j, l

If all weights are initialized to zero:

All neurons in a layer compute the same value.
During backpropagation, they receive identical gradients.
They continue updating identically.
The network fails to learn different features.

To break symmetry, we must initialize weights randomly.

Random Initialization Strategy

Initialize Each parameter with Random R in range of $[-\epsilon, \epsilon]$

\Theta_{i,j}^{(l)} = R \in [-\epsilon, \epsilon]

That means

-\epsilon \le \Theta_{i,j}^{(l)} \le \epsilon

To ensure weights fall within this range we calculate:

\Theta = \text{rand} \times (2\epsilon) - \epsilon

Where:

rand generates values uniformly in the range [0,1]
Multiplying by $2\epsilon$ scales to $[0, 2\epsilon]$
Subtracting $\epsilon$ shifts to $[-\epsilon, \epsilon]$

Example Code (Octave / MATLAB)

If:

Theta1 is 10 × 11 Matrix
Theta2 is 10 × 11 Matrix
Theta3 is 1 × 11 Matrix

Then:

INIT_EPSILON = 0.12;  % example value

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) generates an x × y matrix of random numbers in [0,1].
The epsilon used here is NOT related to gradient checking epsilon.
This random initialization is required for proper learning.

Why This Works

Random initialization:

Breaks symmetry between neurons
Allows different neurons to learn different features
Enables backpropagation to function correctly
Is essential for training deep neural networks

Without random initialization, the network cannot learn effectively.

← Previous

Backpropagation Algorithm

Revision Cheat Sheet

AI-DeepLearning/9-Gradient-Check

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

AI-DeepLearning

AI-DeepLearning Index

Deep Learning Path 🤖

Neural Network Hypothesis and Intuition

Forward Propagation in Neural Networks

Vectorized Neural Networks Model Representation

Examples and Intuitions I — Neural Networks as Logical Gates

Examples and Intuitions II — Building XNOR with a Hidden Layer

Multiclass Classification with Neural Networks

Cost Function for Neural Networks

Backpropagation Algorithm

Gradient Checking and Random Initialization

Training a Neural Network

Revision Cheat Sheet

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Backpropagation Algorithm

Revision Cheat Sheet

🎢 Gradient Checking

Gradient checking is used to verify that backpropagation is implemented correctly.

It works by numerically approximating the derivative of the cost function.

Idea (Single Parameter Case)

Suppose $\theta$ is a real number and we want to compute:

\frac{d}{d\theta} J(\theta)

One-sided difference:

\frac{J(\theta + \epsilon) - J(\theta)}{\epsilon}

Two-sided difference (preferred):

\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}

The two-sided version is more accurate. So we approximate the derivative using a two-sided difference:

\frac{d}{d\theta} J(\theta) \approx \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}

where typically:

\epsilon \approx 10^{-4}

This works because as $\epsilon \to 0$ :

\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon} \to \frac{d}{d\theta} J(\theta)

Numerical Approximation of the Derivative

For a single parameter $\Theta$ , the derivative can be approximated as:

\frac{\partial}{\partial \Theta} J(\Theta) \approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}

This is called the central difference approximation.

Choosing Epsilon $\epsilon$

A small value such as:

\epsilon = 10^{-4}

works well in practice.

Important notes:

If $\epsilon$ is too large → poor approximation
If $\epsilon$ is too small → numerical precision problems
$10^{-4}$ is typically a good balance

Extension to Multiple Parameters

When we have multiple parameters, we approximate the derivative with respect to $\Theta_j$ as:

\frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{ J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n)- J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n) }{2\epsilon}

This means:

Slightly increase one parameter
Slightly decrease the same parameter
Measure how the cost changes
Use that to approximate the gradient

Algorithm (Octave / MATLAB Style)

epsilon = 1e-4;

for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += epsilon;

  thetaMinus = theta;
  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end

This computes the approximate gradient vector.

How to Use Gradient Checking

Let:

$\delta$ = gradient from backpropagation (often called deltaVector)
$gradApprox$ = gradient from numerical checking

Then we compare:

\text{gradApprox} \approx \text{deltaVector}

If they match up to a few decimal places, the implementation is likely correct.

A common comparison metric:

\frac{\| DVec - gradApprox \|} {\| DVec + gradApprox \|}

This value should be very small (e.g., $< 10^{-7}$ ).

Important: Disable After Checking

Use gradient checking only for debugging.

Once backpropagation is verified, disable gradient checking.

Gradient checking is:

Very slow
Computationally expensive
It should NOT be used during actual training.

Backpropagation is much more efficient:

\delta^{(4)}, \delta^{(3)}, \delta^{(2)}, \dots

So the correct workflow is:

Implement backprop → compute $DVec$
Implement gradient checking → compute $gradApprox$
Verify they match
Disable gradient checking
Train normally using backprop

Summary

Gradient checking:

Provides a numerical way to verify gradients
Uses central difference approximation
Should only be used for debugging
Confirms correctness of backpropagation implementation

🎲 Random Initialization

Symmetry problem.

Initializing all theta weight to zero does not work for neural networks.

\Theta_{i,j}^{(l)} = 0

for all i, j, l

If all weights are initialized to zero:

All neurons in a layer compute the same value.
During backpropagation, they receive identical gradients.
They continue updating identically.
The network fails to learn different features.

To break symmetry, we must initialize weights randomly.

Random Initialization Strategy

Initialize Each parameter with Random R in range of $[-\epsilon, \epsilon]$

\Theta_{i,j}^{(l)} = R \in [-\epsilon, \epsilon]

That means

-\epsilon \le \Theta_{i,j}^{(l)} \le \epsilon

To ensure weights fall within this range we calculate:

\Theta = \text{rand} \times (2\epsilon) - \epsilon

Where:

rand generates values uniformly in the range [0,1]
Multiplying by $2\epsilon$ scales to $[0, 2\epsilon]$
Subtracting $\epsilon$ shifts to $[-\epsilon, \epsilon]$

Example Code (Octave / MATLAB)

If:

Theta1 is 10 × 11 Matrix
Theta2 is 10 × 11 Matrix
Theta3 is 1 × 11 Matrix

Then:

INIT_EPSILON = 0.12;  % example value

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) generates an x × y matrix of random numbers in [0,1].
The epsilon used here is NOT related to gradient checking epsilon.
This random initialization is required for proper learning.

Why This Works

Random initialization:

Breaks symmetry between neurons
Allows different neurons to learn different features
Enables backpropagation to function correctly
Is essential for training deep neural networks

Without random initialization, the network cannot learn effectively.

← Previous

Backpropagation Algorithm

Revision Cheat Sheet

AI-DeepLearning/9-Gradient-Check

Fetching content, this won’t take long…

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

AI-DeepLearning

AI-DeepLearning Index

Deep Learning Path 🤖

Neural Network Hypothesis and Intuition

Forward Propagation in Neural Networks

Vectorized Neural Networks Model Representation

Examples and Intuitions I — Neural Networks as Logical Gates

Examples and Intuitions II — Building XNOR with a Hidden Layer

Multiclass Classification with Neural Networks

Cost Function for Neural Networks

Backpropagation Algorithm

Gradient Checking and Random Initialization

Training a Neural Network

Revision Cheat Sheet

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Written by Hitesh Sahu, a passionate developer and blogger.

🎢 Gradient Checking

Idea (Single Parameter Case)

Numerical Approximation of the Derivative

Choosing Epsilon ϵ\epsilonϵ

Extension to Multiple Parameters

Algorithm (Octave / MATLAB Style)

How to Use Gradient Checking

Important: Disable After Checking

Summary

🎲 Random Initialization

Symmetry problem.

Random Initialization Strategy

Example Code (Octave / MATLAB)

Why This Works

Fetching content, this won’t take long…

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

AI-DeepLearning

AI-DeepLearning Index

Deep Learning Path 🤖

Neural Network Hypothesis and Intuition

Forward Propagation in Neural Networks

Vectorized Neural Networks Model Representation

Examples and Intuitions I — Neural Networks as Logical Gates

Examples and Intuitions II — Building XNOR with a Hidden Layer

Multiclass Classification with Neural Networks

Cost Function for Neural Networks

Backpropagation Algorithm

Gradient Checking and Random Initialization

Training a Neural Network

Revision Cheat Sheet

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Written by Hitesh Sahu, a passionate developer and blogger.

🎢 Gradient Checking

Idea (Single Parameter Case)

Numerical Approximation of the Derivative

Choosing Epsilon ϵ\epsilonϵ

Extension to Multiple Parameters

Algorithm (Octave / MATLAB Style)

How to Use Gradient Checking

Important: Disable After Checking

Summary

🎲 Random Initialization

Symmetry problem.

Random Initialization Strategy

Example Code (Octave / MATLAB)

Why This Works

Choosing Epsilon $\epsilon$

Choosing Epsilon $\epsilon$