Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 15 Stochastic Gradient Descent

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦥 Sloths can hold their breath longer than dolphins 🐬.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.
Cover Image for Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Understand how Stochastic Gradient Descent works and why it is widely used in large-scale machine learning. Learn how SGD updates model parameters using one training example at a time to improve computational efficiency and scalability.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Large Scale Machine Learning: Training Models on Massive Datasets

Next →

MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an optimization algorithm used to train machine learning models efficiently on very large datasets.

Instead of computing gradients using the entire dataset, SGD updates the parameters using one training example at a time.

Method Gradient Computation
Batch Gradient Descent Uses all training examples
Stochastic Gradient Descent Uses one example
Mini-batch Gradient Descent Uses small batch

Learning to throw a ball into a basket. 🏀

1. Batch Gradient Descent

Big Slow Way

  • Throw 100 balls.
  • After throwing all 100, you look at every throw.

Then you adjust your aim a little.

Then again:

  • Throw another 100 balls
  • Look at all of them Adjust again

This works, but it is very slow because you wait until you see all the throws before learning.

Mathematically

  • Uses all training examples for each update
  • Each step requires summing over m examples

Batch Gradient Descent computes gradients over all training examples:

Update rule:

θj:=θj−α∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}θj​:=θj​−αi=1∑m​(hθ​(x(i))−y(i))xj(i)​

theta:=theta−alpha∗gradienttheta := theta - alpha * gradienttheta:=theta−alpha∗gradient

Very slow for large datasets

Computes exact gradient for update

If

m=100,000,000m = 100{,}000{,}000m=100,000,000

then every update requires processing 100 million examples, which is computationally expensive.

Cost

With batch gradient descent, the cost decreases smoothly.

--

2. Stochastic Gradient Descent

Faster Way To Learn

  • Before beginning the main loop of stochastic gradient descent, it is a good idea to "shuffle" your training data into a random order.

Idea

  • You throw one ball.
  • If it misses, you adjust your aim immediately.

Then you throw another ball.

  • Miss again?
  • Adjust again.

Mathematically

Instead of summing over all examples, update parameters after each training example.

For each example (i):

θj:=θj−α(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}θj​:=θj​−α(hθ​(x(i))−y(i))xj(i)​

Each step uses only one training example.

Stochastic Gradient Descent:

  • Uses noisy approximation
  • Much faster updates
  • Works well for large scale learning

Cost Function

With SGD, the cost may oscillate, but overall it trends downward.

The algorithm jumps around the minimum instead of moving smoothly.

  • Instead of going smoothly to the perfect aim, your learning path looks a little zig-zag.

SGD Algorithm

Training set:

(x(1),y(1)),(x(2),y(2)),…,(x(m),y(m))(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots , (x^{(m)},y^{(m)})(x(1),y(1)),(x(2),y(2)),…,(x(m),y(m))

Algorithm:

  1. Randomly shuffle the training set

  2. For each epoch (pass through dataset) for i = 1 to m: θj:=θj−α(hθ(x(i))−y(i))xj(i)θ_j := θ_j - α (hθ(x(i)) − y(i)) x_j(i)θj​:=θj​−α(hθ(x(i))−y(i))xj​(i)

  3. Repeat for multiple passes through the data

Each pass through the dataset is called an epoch.

Advantages of SGD

  • Much faster for large datasets
  • Works well with online learning
  • Requires less memory
  • Can start improving model immediately

Disadvantages

  • Updates are noisy
  • Convergence is less stable
  • Requires careful learning rate tuning

3. Mini-Batch Gradient Descent

A compromise between the two approaches.

Method How much data it uses per update
Batch Gradient Descent Take all training examples at a time
Stochastic Gradient Descent SGD Take 1 example at a time
Mini-Batch Gradient Descent Take batcj of n example at a time

Use small batches. So instead of 1 update per epoch, you get 100 updates.

b=100b =100b=100

Learning becomes much faster.

batch 1 (100 examples) → update
batch 2 (100 examples) → update
batch 3 (100 examples) → update
... 
batch n (100 examples) → update ...

Mini-batch GD is the most common method in deep learning today.

  • Faster than Batch Gradient Descent
  • More stable than SGD
  • Works well with GPUs
  • Efficient for large datasets

Mathematically

Update rule:

θ:=θ−α1b∑i=1i+b−1∇θJ(θ;x(i),y(i))\theta := \theta - \alpha \frac{1}{b} \sum_{i=1}^{i+b-1} \nabla_\theta J(\theta; x^{(i)},y^{(i)})θ:=θ−αb1​i=1∑i+b−1​∇θ​J(θ;x(i),y(i))

Where

  • bbb = mini-batch size
  • α\alphaα = learning rate
  • hθ(x)h_\theta(x)hθ​(x) = hypothesis function
  • yyy = true value

SGD enables training models on massive datasets with millions or billions of examples.

AI-Machine-Learning/15-Stochastic-Gradient-Descent
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.