Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 16 Parallel Learning

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

MapReduce for Large-Scale Machine Learning: Distributed Training at Scale

Learn how the MapReduce framework enables distributed computation for large-scale machine learning. Understand how it helps parallelize gradient computation and process massive datasets efficiently across multiple machines.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets

Next →

Evaluating a Hypothesis in Neural Networks

MapReduce for Large-Scale Machine Learning

Fancy name for divide and conquer

MapReduce speeds up large-scale machine learning by splitting data across many machines, computing partial results in parallel, and then combining them.

Problem

Algorithms like Gradient Descent and Stochastic Gradient Descent run on one computer.

But sometimes datasets are too large for a single machine (millions or billions of examples).

To solve this, we use MapReduce, a technique for parallel computing across multiple machines.

  • Introduced by Jeff Dean and Sanjay Ghemawat at Google.

Key Idea

Imagine you have a huge pile of LEGO pieces on the floor 🧱. You want to count how many pieces there are.

But there are too many for you to count alone. It would take forever.

So you ask your 3 friends to help.

Many machine learning algorithms require computing a sum over all training examples.

Example: Gradient descent requires computing a sum across the dataset.

θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m}\sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}θj​:=θj​−αm1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

When m is very large, computing this sum on one machine becomes slow.

MapReduce speeds this up by splitting the work across multiple machines.

Step 1 — Split the dataset

You divide the LEGO pile into 4 smaller piles.

You count pile 1

  • Friend 1 counts pile 2
  • Friend 2 counts pile 3
  • Friend 3 counts pile 4

Step 2 — Map Step (Parallel Computation)

Now everyone counts at the same time. This part is called Map.

Example:

400 training examples

Divide across 4 machines

Machine 1 → examples 1–100 Machine 2 → examples 101–200 Machine 3 → examples 201–300 Machine 4 → examples 301–400

In Machine Learning

Each machine computes partial sums of the gradient.

Example for machine 1:

temp1j=∑i=1100(hθ(x(i))−y(i))xj(i)temp_{1j} = \sum_{i=1}^{100} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}temp1j​=i=1∑100​(hθ​(x(i))−y(i))xj(i)​

Each machine computes its own temp value.

Step 3 — Reduce Step (Combine Results)

Idea

After everyone finishes counting, you bring the numbers together.

Example:

You counted 100

  • Friend 1 counted 120
  • Friend 2 counted 90
  • Friend 3 counted 110

Now you add them. This step is called Reduce.

100 + 120 + 90 + 110 = 420

Mathematically

A central server collects the partial results.

It combines them:

tempj=temp1j+temp2j+temp3j+temp4jtemp_j = temp_{1j} + temp_{2j} + temp_{3j} + temp_{4j}tempj​=temp1j​+temp2j​+temp3j​+temp4j​

Then performs the gradient update.

∑i=1400f(xi)=∑i=1100f(xi)+∑i=101200f(xi)+∑i=201300f(xi)+∑i=301400f(xi)\sum_{i=1}^{400} f(x_i) = \sum_{i=1}^{100} f(x_i)+ \sum_{i=101}^{200} f(x_i)+ \sum_{i=201}^{300} f(x_i)+ \sum_{i=301}^{400} f(x_i)i=1∑400​f(xi​)=i=1∑100​f(xi​)+i=101∑200​f(xi​)+i=201∑300​f(xi​)+i=301∑400​f(xi​)

So splitting the computation does not change the math, it only speeds it up.


Speed Improvement vs Latency

If we use 4 machines:

  • Each machine does ¼ of the work
  • Ideal speedup → 4×

In practice, it is slightly less due to:

  • Network communication
  • Coordination overhead

When MapReduce Works Well?

MapReduce works when the algorithm can be written as:

∑i=1mf(x(i))\sum_{i=1}^{m} f(x^{(i)})i=1∑m​f(x(i))

Many ML algorithms have this structure:

  • Linear Regression
  • Logistic Regression
  • Gradient Descent
  • Cost function computation
  • Gradient computation

Example:

Logistic Regression with MapReduce

For logistic regression we must compute:

Cost Function

J(θ)=∑i=1mCost(hθ(x(i)),y(i))J(\theta) = \sum_{i=1}^{m} Cost(h_\theta(x^{(i)}), y^{(i)})J(θ)=i=1∑m​Cost(hθ​(x(i)),y(i))

Gradient

∂J∂θj=∑i=1m(hθ(x(i))−y(i))xj(i)\frac{\partial J}{\partial \theta_j} = \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}∂θj​∂J​=i=1∑m​(hθ​(x(i))−y(i))xj(i)​

Both are sums over the dataset, so they can be parallelized using MapReduce.


MapReduce on a Single Computer

Run MapReduce on Muti-core in Same machine

You don't always need multiple machines.

Modern CPUs have multiple cores.

Example: 1 computer 4 CPU cores

Split the dataset:

Core 1 → 25% Core 2 → 25% Core 3 → 25% Core 4 → 25%

Each core computes part of the sum, then the results are combined.

Advantage:

  • No network latency
  • Faster communication

MapReduce with Software Libs

Libraries Can Sometimes Do This Automatically

Some numerical libraries already parallelize operations internally:

  • BLAS
  • LAPACK
  • NumPy (linked with MKL)

If the code is vectorized, the library may automatically use multiple CPU cores.

Tools for MapReduce

Popular implementations include:

  • Hadoop
  • Apache Spark

These allow large ML jobs to run on clusters with hundreds or thousands of machines.

AI-Machine-Learning/16-Parallel-Learning
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.