Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models
Learn how Support Vector Machines (SVM) build powerful classification models by finding the optimal separating hyperplane that maximizes the margin between classes. Discover how the margin, regularization parameter C, and kernel functions help SVM handle both linear and non-linear data while improving generalization performance.
Bias-Variance Dilemma
Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models
Support Vector Machine (SVM)
SVM = Find the safest separating line. Advantages:
- Works very well on small and medium datasets
- Global optimum (convex optimization)
- Strong theoretical guarantees
- Very powerful with kernels
One very powerful algorithm widely used in both industry.
Compared to Logistic Regression and Neural Networks, SVMs sometimes provide a cleaner and more powerful way to learn complex non-linear decision boundaries.
Logistic Regression:
Draws a line to separate classes.
Teddy Teddy |line| Car Car
SVM:
Draws the widest possible gap between classes.
Teddy Teddy | line | Car Car
Teddy (closest teddy) |line| (closest car) Car
- Margin: Extra space around data points, SVM tries to maximize it.
- Support Vectors: data point close to line
| Logistic Regression | SVM |
|---|---|
| Predicts probability | Predicts class |
| Log loss | Hinge loss |
| Smooth curve penalty | Linear margin penalty |
| Uses λ regularization | Uses C parameter |
| Focus on likelihood | Focus on margin |
When to use which algorithm
- : number of features
- : training examples
| Situation | Recommended |
|---|---|
| very large, small | Logistic regression or Linear SVM |
| small, medium | SVM with Gaussian kernel |
| small, huge | Logistic regression |
Example
Spam detection
- features = 10000
- examples = 500
→ Linear SVM
Image dataset
- features = 50
- examples = 10000
→ Gaussian SVM
SVM Cost Function
Starting From Logistic Regression
Logistic regression uses the hypothesis:
where
Interpretation:
- If →
- If →
So logistic regression tries to make:
- when
- when
Logistic Regression Cost Function
For a single training example ((x, y)):
Two cases:
When ( y = 1 )
Substitute the sigmoid:
When ( z ) becomes large:
- Cost becomes very small
This encourages the algorithm to push ( z ) large for positive examples.
When
Which becomes:
When :
- Cost becomes very small
Hinge loss,
Instead of using the smooth logistic loss, SVM uses piecewise linear function called Hinge loss
Cost when ( y = 1 )
Cost when ( y = 0 )
Properties:
- If classification is correct and confident, cost = 0
- If classification is wrong or too close to boundary, cost increases linearly
SVM Optimization Objective
The optimization objective becomes:
Can be Simplified to:
Parameterization Using
Instead of the logistic regression form:
SVM uses:
Where:
- = training error
- = regularization term
- = Classification error
Where:
🎛️ Classification error
Intuitively:
Large
Strict about fitting training data
- high variance, lower bias
- Effect: can overfit
Small
Smoother boundary
- low variance, higher bias
- Effect: may underfit
Interpretation:
| Parameter | Effect |
|---|---|
| Large (C) | Focus on minimizing training error |
| Small (C) | Strong regularization |
🎛️ Regularization Term
Second term → (keeps parameters small)
You normally never implement this yourself. Libraries solve it using optimized algorithms.
SVM Hypothesis Function
Which side of the line is the data point class on.
Unlike logistic regression, SVM does not output probabilities.
Class 1 (Teddy Bears)
●
● ●
● ●
↑
│ Margin
--------------------│-------------------- ← Decision Boundary (θᵀx = 0)
│
↓
○ ○
○ ○
○
Class 0 (Cars)
is Line separating classes
← upper margin
← lower margin
Prediction rule:
Only the closest points influence the model.
All other points do not affect the boundary once they are outside the margin.
● ← closest positive point
○ ← closest negative point
+1 Margin
--------------------------- ●
●
----------- Decision Boundary -----------
○
--------------------------- ○
-1 Margin
Non Linear SVM
For Non Linear Data points a straight line cannot separate data points.
○ ○ ○ ○
○ ○
○ ● ● ○
○ ○
○ ○ ○ ○
One option is to create many feature that can separate points but that would be computational expensive
So we create our own custom feature.
⛳ Landmarks ()
Instead of polynomial features, we choose special points in space called landmarks.
Feature = How close is a point close to each landmark?
Meaning:
- If is very close to landmark → value ≈ 1
- If is far from → value ≈ 0
l2
○ ○
● ●
○ ○
l1
Kernel (Similarity Function)
Kernel help SVM draw curved decision boundaries instead of straight lines.
We measure similarity using a Gaussian kernel
Important
- When using Gaussian kernel, features must be scaled.
Defines:
“How close am I to this landmark?”
So each landmark produces a feature:
⛰︎ sigma
This controls the width of the Gaussian kernel.
Sigma controls how wide the influence of a landmark is.
🏔️ If is small
very local influence = narrow peak
- similarity drops quickly
- very flexible boundary
- high variance, lower Bias
🗻 If is large
broader influence = wide peak
- similarity falls slowly
- smoother decision boundary
- high bias, lower variance
peak (value = 1)
▲
/ \
/ \
/ \
/ \
Values decrease as we move away
- Highest point = the landmark
- Values decrease as we move away
→ class 1
So:
- Close to important landmarks → positive
- Far from them → negative
Don’t implement SVM yourself
SVM training requires solving a complex optimization problem.
In practice you never write this yourself.
Just use a library like:
- LIBSVM
- LIBLINEAR
- Scikit-learn (Python)
- TensorFlow / PyTorch wrappers
Example in Python:
from sklearn import svm
model = svm.SVC(kernel="rbf", C=1)
model.fit(X_train, y_train)
So the hard math is already implemented.
