Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Understand how Stochastic Gradient Descent works and why it is widely used in large-scale machine learning. Learn how SGD updates model parameters using one training example at a time to improve computational efficiency and scalability.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Large Scale Machine Learning: Training Models on Massive Datasets

MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an optimization algorithm used to train machine learning models efficiently on very large datasets.

Instead of computing gradients using the entire dataset, SGD updates the parameters using one training example at a time.

Method	Gradient Computation
Batch Gradient Descent	Uses all training examples
Stochastic Gradient Descent	Uses one example
Mini-batch Gradient Descent	Uses small batch

Learning to throw a ball into a basket. 🏀

1. Batch Gradient Descent

Big Slow Way

Throw 100 balls.
After throwing all 100, you look at every throw.

Then you adjust your aim a little.

Then again:

Throw another 100 balls
Look at all of them Adjust again

This works, but it is very slow because you wait until you see all the throws before learning.

Mathematically

Uses all training examples for each update
Each step requires summing over m examples

Batch Gradient Descent computes gradients over all training examples:

Update rule:

\theta_j := \theta_j - \alpha \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}

$theta := theta - alpha * gradient$

Very slow for large datasets

Computes exact gradient for update

m = 100{,}000{,}000

then every update requires processing 100 million examples, which is computationally expensive.

Cost

With batch gradient descent, the cost decreases smoothly.

2. Stochastic Gradient Descent

Faster Way To Learn

Before beginning the main loop of stochastic gradient descent, it is a good idea to "shuffle" your training data into a random order.

Idea

You throw one ball.
If it misses, you adjust your aim immediately.

Then you throw another ball.

Miss again?
Adjust again.

Mathematically

Instead of summing over all examples, update parameters after each training example.

For each example (i):

\theta_j := \theta_j - \alpha \left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}

Each step uses only one training example.

Stochastic Gradient Descent:

Uses noisy approximation
Much faster updates
Works well for large scale learning

Cost Function

With SGD, the cost may oscillate, but overall it trends downward.

The algorithm jumps around the minimum instead of moving smoothly.

Instead of going smoothly to the perfect aim, your learning path looks a little zig-zag.

SGD Algorithm

Training set:

(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots , (x^{(m)},y^{(m)})

Algorithm:

Randomly shuffle the training set
For each epoch (pass through dataset) for i = 1 to m: $θ_j := θ_j - α (hθ(x(i)) − y(i)) x_j(i)$
Repeat for multiple passes through the data

Each pass through the dataset is called an epoch.

Advantages of SGD

Much faster for large datasets
Works well with online learning
Requires less memory
Can start improving model immediately

Disadvantages

Updates are noisy
Convergence is less stable
Requires careful learning rate tuning

3. Mini-Batch Gradient Descent

A compromise between the two approaches.

Method	How much data it uses per update
Batch Gradient Descent	Take all training examples at a time
Stochastic Gradient Descent SGD	Take 1 example at a time
Mini-Batch Gradient Descent	Take batcj of n example at a time

Use small batches. So instead of 1 update per epoch, you get 100 updates.

$b =100$

Learning becomes much faster.

batch 1 (100 examples) → update
batch 2 (100 examples) → update
batch 3 (100 examples) → update
... 
batch n (100 examples) → update ...

Mini-batch GD is the most common method in deep learning today.

Faster than Batch Gradient Descent
More stable than SGD
Works well with GPUs
Efficient for large datasets

Mathematically

Update rule:

\theta := \theta - \alpha \frac{1}{b} \sum_{i=1}^{i+b-1} \nabla_\theta J(\theta; x^{(i)},y^{(i)})

Where

$b$ = mini-batch size
$\alpha$ = learning rate
$h_\theta(x)$ = hypothesis function
$y$ = true value

SGD enables training models on massive datasets with millions or billions of examples.

Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Understand how Stochastic Gradient Descent works and why it is widely used in large-scale machine learning. Learn how SGD updates model parameters using one training example at a time to improve computational efficiency and scalability.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Large Scale Machine Learning: Training Models on Massive Datasets

MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an optimization algorithm used to train machine learning models efficiently on very large datasets.

Instead of computing gradients using the entire dataset, SGD updates the parameters using one training example at a time.

Method	Gradient Computation
Batch Gradient Descent	Uses all training examples
Stochastic Gradient Descent	Uses one example
Mini-batch Gradient Descent	Uses small batch

Learning to throw a ball into a basket. 🏀

1. Batch Gradient Descent

Big Slow Way

Throw 100 balls.
After throwing all 100, you look at every throw.

Then you adjust your aim a little.

Then again:

Throw another 100 balls
Look at all of them Adjust again

This works, but it is very slow because you wait until you see all the throws before learning.

Mathematically

Uses all training examples for each update
Each step requires summing over m examples

Batch Gradient Descent computes gradients over all training examples:

Update rule:

\theta_j := \theta_j - \alpha \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}

$theta := theta - alpha * gradient$

Very slow for large datasets

Computes exact gradient for update

m = 100{,}000{,}000

then every update requires processing 100 million examples, which is computationally expensive.

Cost

With batch gradient descent, the cost decreases smoothly.

2. Stochastic Gradient Descent

Faster Way To Learn

Before beginning the main loop of stochastic gradient descent, it is a good idea to "shuffle" your training data into a random order.

Idea

You throw one ball.
If it misses, you adjust your aim immediately.

Then you throw another ball.

Miss again?
Adjust again.

Mathematically

Instead of summing over all examples, update parameters after each training example.

For each example (i):

\theta_j := \theta_j - \alpha \left(h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)}

Each step uses only one training example.

Stochastic Gradient Descent:

Uses noisy approximation
Much faster updates
Works well for large scale learning

Cost Function

With SGD, the cost may oscillate, but overall it trends downward.

The algorithm jumps around the minimum instead of moving smoothly.

Instead of going smoothly to the perfect aim, your learning path looks a little zig-zag.

SGD Algorithm

Training set:

(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots , (x^{(m)},y^{(m)})

Algorithm:

Randomly shuffle the training set
For each epoch (pass through dataset) for i = 1 to m: $θ_j := θ_j - α (hθ(x(i)) − y(i)) x_j(i)$
Repeat for multiple passes through the data

Each pass through the dataset is called an epoch.

Advantages of SGD

Much faster for large datasets
Works well with online learning
Requires less memory
Can start improving model immediately

Disadvantages

Updates are noisy
Convergence is less stable
Requires careful learning rate tuning

3. Mini-Batch Gradient Descent

A compromise between the two approaches.

Method	How much data it uses per update
Batch Gradient Descent	Take all training examples at a time
Stochastic Gradient Descent SGD	Take 1 example at a time
Mini-Batch Gradient Descent	Take batcj of n example at a time

Use small batches. So instead of 1 update per epoch, you get 100 updates.

$b =100$

Learning becomes much faster.

batch 1 (100 examples) → update
batch 2 (100 examples) → update
batch 3 (100 examples) → update
... 
batch n (100 examples) → update ...

Mini-batch GD is the most common method in deep learning today.

Faster than Batch Gradient Descent
More stable than SGD
Works well with GPUs
Efficient for large datasets

Mathematically

Update rule:

\theta := \theta - \alpha \frac{1}{b} \sum_{i=1}^{i+b-1} \nabla_\theta J(\theta; x^{(i)},y^{(i)})

Where

$b$ = mini-batch size
$\alpha$ = learning rate
$h_\theta(x)$ = hypothesis function
$y$ = true value

SGD enables training models on massive datasets with millions or billions of examples.

Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Understand how Stochastic Gradient Descent works and why it is widely used in large-scale machine learning. Learn how SGD updates model parameters using one training example at a time to improve computational efficiency and scalability.

Written by Hitesh Sahu, a passionate developer and blogger.

Stochastic Gradient Descent (SGD)

Learning to throw a ball into a basket. 🏀

1. Batch Gradient Descent

Mathematically

Very slow for large datasets

Cost

2. Stochastic Gradient Descent

Idea

Mathematically

Cost Function

SGD Algorithm

Advantages of SGD

Disadvantages

3. Mini-Batch Gradient Descent

Mathematically

Playstore

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Understand how Stochastic Gradient Descent works and why it is widely used in large-scale machine learning. Learn how SGD updates model parameters using one training example at a time to improve computational efficiency and scalability.

Written by Hitesh Sahu, a passionate developer and blogger.

Stochastic Gradient Descent (SGD)

Learning to throw a ball into a basket. 🏀

1. Batch Gradient Descent

Mathematically

Very slow for large datasets

Cost

2. Stochastic Gradient Descent

Idea

Mathematically

Cost Function

SGD Algorithm

Advantages of SGD

Disadvantages

3. Mini-Batch Gradient Descent

Mathematically

Playstore