Large Scale Machine Learning: Training Models on Massive Datasets
Explore techniques for scaling machine learning algorithms to large datasets, including stochastic gradient descent and mini-batch gradient descent. Learn how to efficiently train linear models, logistic regression, and neural networks on millions of examples.
Collaborative Filtering: Building Recommender Systems with Feature Learning
Stochastic Gradient Descent (SGD): Efficient Optimization for Large Datasets
Large Scale Machine Learning
Machine learning methods designed to train models on very large datasets.
Modern ML systems perform much better today largely because we now have massive datasets available for training.
A common saying in machine learning:
It’s often not who has the best algorithm, but who has the most data.
Why Large Datasets Help
High performance ML systems often require:
- Low-bias algorithms
- Large amounts of training data
When a model has enough capacity and is trained on more data, it can often learn more accurate patterns.
Computational Challenge
Large datasets create computational problems.
Example training set size:
Training models like linear regression or logistic regression requires computing gradients over all training examples.
Batch gradient descent update:
theta := theta - alpha * gradient
If (m = 100) million, computing this sum becomes extremely expensive.
Key Question Before Scaling
Before building infrastructure for massive datasets, ask:
Do we actually need that much data?
Maybe training with 1,000 examples already gives similar performance.
We check this using learning curves.
Learning Curve Analysis
High Variance
- Training error: low
- Cross-validation error: high
Interpretation:
- Model is overfitting
- Adding more training data helps
More data → performance improves
Large datasets are useful here.
High Bias
- Training error: high
- Cross-validation error: high
Interpretation:
- Model is underfitting
Adding more data will not help much
Instead try:
- adding features
- increasing model complexity
- adding hidden units in neural networks
When Large Data is Worth It
Large datasets are helpful when:
- model has low bias
- model suffers from high variance
- performance keeps improving with more data
Techniques for Large Scale ML
To handle massive datasets efficiently, two key methods are used:
1. Stochastic Gradient Descent (SGD)
Instead of computing gradients over the entire dataset, update parameters one example at a time.
This dramatically reduces computation.
2. MapReduce
A distributed computing framework that allows:
- parallel processing
- training across many machines
Used for extremely large datasets.
Summary
Large scale machine learning focuses on efficient training on huge datasets.
Key ideas:
- More data often improves performance
- But large datasets introduce computational challenges
- Use learning curves to verify if more data helps
- Use scalable algorithms such as:
- Stochastic Gradient Descent
- MapReduce
These techniques allow models like:
- Linear Regression
- Logistic Regression
- Neural Networks
to train on hundreds of millions of examples.
