Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets
Understand how Stochastic Gradient Descent works and why it is widely used in large-scale machine learning. Learn how SGD updates model parameters using one training example at a time to improve computational efficiency and scalability.
Large Scale Machine Learning: Training Models on Massive Datasets
MapReduce for Large-Scale Machine Learning: Distributed Training at Scale
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent is an optimization algorithm used to train machine learning models efficiently on very large datasets.
Instead of computing gradients using the entire dataset, SGD updates the parameters using one training example at a time.
| Method | Gradient Computation |
|---|---|
| Batch Gradient Descent | Uses all training examples |
| Stochastic Gradient Descent | Uses one example |
| Mini-batch Gradient Descent | Uses small batch |
Learning to throw a ball into a basket. 🏀
1. Batch Gradient Descent
Big Slow Way
- Throw 100 balls.
- After throwing all 100, you look at every throw.
Then you adjust your aim a little.
Then again:
- Throw another 100 balls
- Look at all of them Adjust again
This works, but it is very slow because you wait until you see all the throws before learning.
Mathematically
- Uses all training examples for each update
- Each step requires summing over m examples
Batch Gradient Descent computes gradients over all training examples:
Update rule:
Very slow for large datasets
Computes exact gradient for update
If
then every update requires processing 100 million examples, which is computationally expensive.
Cost
With batch gradient descent, the cost decreases smoothly.
--
2. Stochastic Gradient Descent
Faster Way To Learn
- Before beginning the main loop of stochastic gradient descent, it is a good idea to "shuffle" your training data into a random order.
Idea
- You throw one ball.
- If it misses, you adjust your aim immediately.
Then you throw another ball.
- Miss again?
- Adjust again.
Mathematically
Instead of summing over all examples, update parameters after each training example.
For each example (i):
Each step uses only one training example.
Stochastic Gradient Descent:
- Uses noisy approximation
- Much faster updates
- Works well for large scale learning
Cost Function
With SGD, the cost may oscillate, but overall it trends downward.
The algorithm jumps around the minimum instead of moving smoothly.
- Instead of going smoothly to the perfect aim, your learning path looks a little zig-zag.
SGD Algorithm
Training set:
Algorithm:
-
Randomly shuffle the training set
-
For each epoch (pass through dataset) for i = 1 to m:
-
Repeat for multiple passes through the data
Each pass through the dataset is called an epoch.
Advantages of SGD
- Much faster for large datasets
- Works well with online learning
- Requires less memory
- Can start improving model immediately
Disadvantages
- Updates are noisy
- Convergence is less stable
- Requires careful learning rate tuning
3. Mini-Batch Gradient Descent
A compromise between the two approaches.
| Method | How much data it uses per update |
|---|---|
| Batch Gradient Descent | Take all training examples at a time |
| Stochastic Gradient Descent SGD | Take 1 example at a time |
| Mini-Batch Gradient Descent | Take batcj of n example at a time |
Use small batches. So instead of 1 update per epoch, you get 100 updates.
Learning becomes much faster.
batch 1 (100 examples) → update
batch 2 (100 examples) → update
batch 3 (100 examples) → update
...
batch n (100 examples) → update ...
Mini-batch GD is the most common method in deep learning today.
- Faster than Batch Gradient Descent
- More stable than SGD
- Works well with GPUs
- Efficient for large datasets
Mathematically
Update rule:
Where
- = mini-batch size
- = learning rate
- = hypothesis function
- = true value
SGD enables training models on massive datasets with millions or billions of examples.
