Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. Gradient Check

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for Gradient Checking and Random Initialization

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Gradient Checking

Gradient checking is used to verify that backpropagation is implemented correctly.

It works by numerically approximating the derivative of the cost function.


Numerical Approximation of the Derivative

For a single parameter Θ\ThetaΘ, the derivative can be approximated as:

∂∂ΘJ(Θ)≈J(Θ+ϵ)−J(Θ−ϵ)2ϵ\frac{\partial}{\partial \Theta} J(\Theta) \approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}∂Θ∂​J(Θ)≈2ϵJ(Θ+ϵ)−J(Θ−ϵ)​

This is called the central difference approximation.


Extension to Multiple Parameters

When we have multiple parameters, we approximate the derivative with respect to Θj\Theta_jΘj​ as:

∂∂ΘjJ(Θ)≈J(Θ1,…,Θj+ϵ,…,Θn)−J(Θ1,…,Θj−ϵ,…,Θn)2ϵ\frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{ J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n)- J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n) }{2\epsilon}∂Θj​∂​J(Θ)≈2ϵJ(Θ1​,…,Θj​+ϵ,…,Θn​)−J(Θ1​,…,Θj​−ϵ,…,Θn​)​

This means:

  • Slightly increase one parameter
  • Slightly decrease the same parameter
  • Measure how the cost changes
  • Use that to approximate the gradient

Choosing Epsilon

A small value such as:

ϵ=10−4\epsilon = 10^{-4}ϵ=10−4

works well in practice.

Important notes:

  • If ϵ\epsilonϵ is too large → poor approximation
  • If ϵ\epsilonϵ is too small → numerical precision problems
  • 10−410^{-4}10−4 is typically a good balance

Algorithm (Octave / MATLAB Style)

epsilon = 1e-4;

for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += epsilon;

  thetaMinus = theta;
  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end

This computes the approximate gradient vector.


How to Use Gradient Checking

From backpropagation, we compute:

  • The gradient vector (often called deltaVector)

Then we compare:

gradApprox≈deltaVector\text{gradApprox} \approx \text{deltaVector}gradApprox≈deltaVector

If they match closely, backpropagation is likely correct.


Important Practical Advice

  • Use gradient checking only for debugging.
  • Once backpropagation is verified, disable gradient checking.
  • Gradient approximation is computationally expensive.
  • It should NOT be used during actual training.

Summary

Gradient checking:

  • Provides a numerical way to verify gradients
  • Uses central difference approximation
  • Should only be used for debugging
  • Confirms correctness of backpropagation implementation

Random Initialization

Initializing all parameters to zero does not work for neural networks.

If all weights are initialized to zero:

  • All neurons in a layer compute the same value.
  • During backpropagation, they receive identical gradients.
  • They continue updating identically.
  • The network fails to learn different features.

This is called the symmetry problem.

To break symmetry, we must initialize weights randomly.


Random Initialization Strategy

Each parameter:

Θi,j(l)\Theta_{i,j}^{(l)}Θi,j(l)​

is initialized to a random value in the range:

[−ϵ,ϵ][-\epsilon, \epsilon][−ϵ,ϵ]

How to Generate Random Values

To ensure weights fall within this range:

Θ=rand×(2ϵ)−ϵ\Theta = \text{rand} \times (2\epsilon) - \epsilonΘ=rand×(2ϵ)−ϵ

Where:

  • rand generates values uniformly in the range [0,1]
  • Multiplying by 2\epsilon scales to [0, 2\epsilon]
  • Subtracting \epsilon shifts to [-\epsilon, \epsilon]

Example Code (Octave / MATLAB)

If:

  • Theta1 is 10 × 11
  • Theta2 is 10 × 11
  • Theta3 is 1 × 11

Then:

INIT_EPSILON = 0.12;  % example value

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Important Notes

  • rand(x,y) generates an x × y matrix of random numbers in [0,1].
  • The epsilon used here is NOT related to gradient checking epsilon.
  • This random initialization is required for proper learning.

Why This Works

Random initialization:

  • Breaks symmetry between neurons
  • Allows different neurons to learn different features
  • Enables backpropagation to function correctly
  • Is essential for training deep neural networks

Without random initialization, the network cannot learn effectively.

AI-DeepLearning/Gradient-Check
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.