MapReduce for Large-Scale Machine Learning: Distributed Training at Scale
Learn how the MapReduce framework enables distributed computation for large-scale machine learning. Understand how it helps parallelize gradient computation and process massive datasets efficiently across multiple machines.
Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets
Evaluating a Hypothesis in Neural Networks
MapReduce for Large-Scale Machine Learning
Fancy name for divide and conquer
MapReduce speeds up large-scale machine learning by splitting data across many machines, computing partial results in parallel, and then combining them.
Problem
Algorithms like Gradient Descent and Stochastic Gradient Descent run on one computer.
But sometimes datasets are too large for a single machine (millions or billions of examples).
To solve this, we use MapReduce, a technique for parallel computing across multiple machines.
- Introduced by Jeff Dean and Sanjay Ghemawat at Google.
Key Idea
Imagine you have a huge pile of LEGO pieces on the floor 🧱. You want to count how many pieces there are.
But there are too many for you to count alone. It would take forever.
So you ask your 3 friends to help.
Many machine learning algorithms require computing a sum over all training examples.
Example: Gradient descent requires computing a sum across the dataset.
When m is very large, computing this sum on one machine becomes slow.
MapReduce speeds this up by splitting the work across multiple machines.
Step 1 — Split the dataset
You divide the LEGO pile into 4 smaller piles.
You count pile 1
- Friend 1 counts pile 2
- Friend 2 counts pile 3
- Friend 3 counts pile 4
Step 2 — Map Step (Parallel Computation)
Now everyone counts at the same time. This part is called Map.
Example:
400 training examples
Divide across 4 machines
Machine 1 → examples 1–100 Machine 2 → examples 101–200 Machine 3 → examples 201–300 Machine 4 → examples 301–400
In Machine Learning
Each machine computes partial sums of the gradient.
Example for machine 1:
Each machine computes its own temp value.
Step 3 — Reduce Step (Combine Results)
Idea
After everyone finishes counting, you bring the numbers together.
Example:
You counted 100
- Friend 1 counted 120
- Friend 2 counted 90
- Friend 3 counted 110
Now you add them. This step is called Reduce.
100 + 120 + 90 + 110 = 420
Mathematically
A central server collects the partial results.
It combines them:
Then performs the gradient update.
So splitting the computation does not change the math, it only speeds it up.
Speed Improvement vs Latency
If we use 4 machines:
- Each machine does ¼ of the work
- Ideal speedup → 4×
In practice, it is slightly less due to:
- Network communication
- Coordination overhead
When MapReduce Works Well?
MapReduce works when the algorithm can be written as:
Many ML algorithms have this structure:
- Linear Regression
- Logistic Regression
- Gradient Descent
- Cost function computation
- Gradient computation
Example:
Logistic Regression with MapReduce
For logistic regression we must compute:
Cost Function
Gradient
Both are sums over the dataset, so they can be parallelized using MapReduce.
MapReduce on a Single Computer
Run MapReduce on Muti-core in Same machine
You don't always need multiple machines.
Modern CPUs have multiple cores.
Example: 1 computer 4 CPU cores
Split the dataset:
Core 1 → 25% Core 2 → 25% Core 3 → 25% Core 4 → 25%
Each core computes part of the sum, then the results are combined.
Advantage:
- No network latency
- Faster communication
MapReduce with Software Libs
Libraries Can Sometimes Do This Automatically
Some numerical libraries already parallelize operations internally:
BLASLAPACKNumPy(linked with MKL)
If the code is vectorized, the library may automatically use multiple CPU cores.
Tools for MapReduce
Popular implementations include:
- Hadoop
- Apache Spark
These allow large ML jobs to run on clusters with hundreds or thousands of machines.
