Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 9 Gradient Check

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Gradient Checking and Random Initialization

Gradient Checking and Random Initialization

Gradient checking is a technique to verify the correctness of your backpropagation implementation. Random initialization is crucial for breaking symmetry and allowing the network to learn effectively.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Backpropagation Intuition

Next →

Dimensionality Reduction in Machine Learning

🎢 Gradient Checking

Gradient checking is used to verify that backpropagation is implemented correctly.

It works by numerically approximating the derivative of the cost function.

Idea (Single Parameter Case)

Suppose θ\thetaθ is a real number and we want to compute:

ddθJ(θ)\frac{d}{d\theta} J(\theta)dθd​J(θ)

One-sided difference:

J(θ+ϵ)−J(θ)ϵ\frac{J(\theta + \epsilon) - J(\theta)}{\epsilon}ϵJ(θ+ϵ)−J(θ)​

Two-sided difference (preferred):

J(θ+ϵ)−J(θ−ϵ)2ϵ\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}2ϵJ(θ+ϵ)−J(θ−ϵ)​

The two-sided version is more accurate. So we approximate the derivative using a two-sided difference:

ddθJ(θ)≈J(θ+ϵ)−J(θ−ϵ)2ϵ\frac{d}{d\theta} J(\theta) \approx \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}dθd​J(θ)≈2ϵJ(θ+ϵ)−J(θ−ϵ)​

where typically:

ϵ≈10−4\epsilon \approx 10^{-4}ϵ≈10−4

This works because as ϵ→0\epsilon \to 0ϵ→0:

J(θ+ϵ)−J(θ−ϵ)2ϵ→ddθJ(θ)\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon} \to \frac{d}{d\theta} J(\theta)2ϵJ(θ+ϵ)−J(θ−ϵ)​→dθd​J(θ)

Numerical Approximation of the Derivative

For a single parameter Θ\ThetaΘ, the derivative can be approximated as:

∂∂ΘJ(Θ)≈J(Θ+ϵ)−J(Θ−ϵ)2ϵ\frac{\partial}{\partial \Theta} J(\Theta) \approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}∂Θ∂​J(Θ)≈2ϵJ(Θ+ϵ)−J(Θ−ϵ)​

This is called the central difference approximation.

Choosing Epsilon ϵ\epsilonϵ

A small value such as:

ϵ=10−4\epsilon = 10^{-4}ϵ=10−4

works well in practice.

Important notes:

  • If ϵ\epsilonϵ is too large → poor approximation
  • If ϵ\epsilonϵ is too small → numerical precision problems
  • 10−410^{-4}10−4 is typically a good balance

Extension to Multiple Parameters

When we have multiple parameters, we approximate the derivative with respect to Θj\Theta_jΘj​ as:

∂∂ΘjJ(Θ)≈J(Θ1,…,Θj+ϵ,…,Θn)−J(Θ1,…,Θj−ϵ,…,Θn)2ϵ\frac{\partial}{\partial \Theta_j} J(\Theta) \approx \frac{ J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n)- J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n) }{2\epsilon}∂Θj​∂​J(Θ)≈2ϵJ(Θ1​,…,Θj​+ϵ,…,Θn​)−J(Θ1​,…,Θj​−ϵ,…,Θn​)​

This means:

  • Slightly increase one parameter
  • Slightly decrease the same parameter
  • Measure how the cost changes
  • Use that to approximate the gradient

Algorithm (Octave / MATLAB Style)

epsilon = 1e-4;

for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += epsilon;

  thetaMinus = theta;
  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end

This computes the approximate gradient vector.


How to Use Gradient Checking

Let:

  • δ\delta δ = gradient from backpropagation (often called deltaVector)
  • gradApproxgradApproxgradApprox = gradient from numerical checking

Then we compare:

gradApprox≈deltaVector\text{gradApprox} \approx \text{deltaVector}gradApprox≈deltaVector

If they match up to a few decimal places, the implementation is likely correct.

A common comparison metric:

∥DVec−gradApprox∥∥DVec+gradApprox∥\frac{\| DVec - gradApprox \|} {\| DVec + gradApprox \|}∥DVec+gradApprox∥∥DVec−gradApprox∥​

This value should be very small (e.g., <10−7< 10^{-7}<10−7).

Important: Disable After Checking

Use gradient checking only for debugging.

  • Once backpropagation is verified, disable gradient checking.

Gradient checking is:

  • Very slow
  • Computationally expensive
  • It should NOT be used during actual training.

Backpropagation is much more efficient:

δ(4),δ(3),δ(2),…\delta^{(4)}, \delta^{(3)}, \delta^{(2)}, \dotsδ(4),δ(3),δ(2),…

So the correct workflow is:

  1. Implement backprop → compute DVecDVecDVec
  2. Implement gradient checking → compute gradApproxgradApproxgradApprox
  3. Verify they match
  4. Disable gradient checking
  5. Train normally using backprop

Summary

Gradient checking:

  • Provides a numerical way to verify gradients
  • Uses central difference approximation
  • Should only be used for debugging
  • Confirms correctness of backpropagation implementation

🎲 Random Initialization

Symmetry problem.

Initializing all theta weight to zero does not work for neural networks.

Θi,j(l)=0\Theta_{i,j}^{(l)} = 0 Θi,j(l)​=0

for all i, j, l

If all weights are initialized to zero:

  • All neurons in a layer compute the same value.
  • During backpropagation, they receive identical gradients.
  • They continue updating identically.
  • The network fails to learn different features.

To break symmetry, we must initialize weights randomly.

Random Initialization Strategy

Initialize Each parameter with Random R in range of [−ϵ,ϵ][-\epsilon, \epsilon][−ϵ,ϵ]

Θi,j(l)=R∈[−ϵ,ϵ]\Theta_{i,j}^{(l)} = R \in [-\epsilon, \epsilon]Θi,j(l)​=R∈[−ϵ,ϵ]

That means

−ϵ≤Θi,j(l)≤ϵ-\epsilon \le \Theta_{i,j}^{(l)} \le \epsilon−ϵ≤Θi,j(l)​≤ϵ

To ensure weights fall within this range we calculate:

Θ=rand×(2ϵ)−ϵ\Theta = \text{rand} \times (2\epsilon) - \epsilonΘ=rand×(2ϵ)−ϵ

Where:

  • rand generates values uniformly in the range [0,1]
  • Multiplying by 2ϵ2\epsilon2ϵ scales to [0,2ϵ][0, 2\epsilon][0,2ϵ]
  • Subtracting ϵ\epsilonϵ shifts to [−ϵ,ϵ][-\epsilon, \epsilon][−ϵ,ϵ]

Example Code (Octave / MATLAB)

If:

  • Theta1 is 10 × 11 Matrix
  • Theta2 is 10 × 11 Matrix
  • Theta3 is 1 × 11 Matrix

Then:

INIT_EPSILON = 0.12;  % example value

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
  • rand(x,y) generates an x × y matrix of random numbers in [0,1].
  • The epsilon used here is NOT related to gradient checking epsilon.
  • This random initialization is required for proper learning.

Why This Works

Random initialization:

  • Breaks symmetry between neurons
  • Allows different neurons to learn different features
  • Enables backpropagation to function correctly
  • Is essential for training deep neural networks

Without random initialization, the network cannot learn effectively.


AI-DeepLearning/9-Gradient-Check
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.