Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Learn how Support Vector Machines (SVM) build powerful classification models by finding the optimal separating hyperplane that maximizes the margin between classes. Discover how the margin, regularization parameter C, and kernel functions help SVM handle both linear and non-linear data while improving generalization performance.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Bias-Variance Dilemma

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Support Vector Machine (SVM)

SVM = Find the safest separating line. Advantages:

Works very well on small and medium datasets
Global optimum (convex optimization)
Strong theoretical guarantees
Very powerful with kernels

One very powerful algorithm widely used in both industry.

Compared to Logistic Regression and Neural Networks, SVMs sometimes provide a cleaner and more powerful way to learn complex non-linear decision boundaries.

Logistic Regression:

Draws a line to separate classes.

Teddy   Teddy   |line|   Car   Car

SVM:

Draws the widest possible gap between classes.

Teddy   Teddy      | line |      Car   Car

Teddy   (closest teddy) |line| (closest car)   Car

Margin: Extra space around data points, SVM tries to maximize it.
Support Vectors: data point close to line

Logistic Regression	SVM
Predicts probability	Predicts class
Log loss	Hinge loss
Smooth curve penalty	Linear margin penalty
Uses λ regularization	Uses C parameter
Focus on likelihood	Focus on margin

When to use which algorithm

$N$ : number of features
$M$ : training examples

Situation	Recommended
$N$ very large, $M$ small	Logistic regression or Linear SVM
$N$ small, $M$ medium	SVM with Gaussian kernel
$N$ small, $M$ huge	Logistic regression

Example

Spam detection

features = 10000
examples = 500

→ Linear SVM

Image dataset

features = 50
examples = 10000

→ Gaussian SVM

SVM Cost Function

Starting From Logistic Regression

Logistic regression uses the hypothesis:

h_\theta(x) = \frac{1}{1 + e^{-z}}

where

z = \theta^T x

Interpretation:

If $z \gg 0$ → $h_\theta(x) \approx 1$
If $z \ll 0$ → $h_\theta(x) \approx 0$

So logistic regression tries to make:

$\theta^T x \gg 0$ when $y = 1$
$\theta^T x \ll 0$ when $y = 0$

Logistic Regression Cost Function

For a single training example ((x, y)):

Cost = -y \log(h_\theta(x)) - (1-y)\log(1-h_\theta(x))

Two cases:

When ( y = 1 )

Cost = -\log(h_\theta(x))

Substitute the sigmoid:

Cost = -\log\left(\frac{1}{1+e^{-z}}\right)

When ( z ) becomes large:

$h_\theta(x) \to 1$
Cost becomes very small

This encourages the algorithm to push ( z ) large for positive examples.

When $y = 0$

Cost = -\log(1 - h_\theta(x))

Which becomes:

Cost = -\log\left(\frac{e^{-z}}{1+e^{-z}}\right)

When $z \ll 0$ :

$h_\theta(x) \to 0$
Cost becomes very small

Hinge loss,

Instead of using the smooth logistic loss, SVM uses piecewise linear function called Hinge loss

Cost when ( y = 1 )

$cost_1(z) = \max(0, 1 - z)$

Cost when ( y = 0 )

$cost_0(z) = \max(0, 1 + z)$

Properties:

If classification is correct and confident, cost = 0
If classification is wrong or too close to boundary, cost increases linearly

SVM Optimization Objective

The optimization objective becomes:

\min_\theta \left[ \sum_{i=1}^{m} y^{(i)}\,cost_1(\theta^T x^{(i)})+ (1-y^{(i)})\,cost_0(\theta^T x^{(i)}) \right]+ \frac{1}{2}\sum_{j=1}^{n}\theta_j^2

Can be Simplified to:

\min_\theta \left[ C \sum_{i=1}^{m} Loss(\theta^T x^{(i)}, y^{(i)})+ \sum_{j=1}^{n} \theta_j^2 \right]

Parameterization Using $C$

Instead of the logistic regression form:

$A + \lambda B$

SVM uses:

$C \cdot A + B$

Where:

$A$ = training error
$B$ = regularization term
$C$ = Classification error

Where:

🎛️ Classification error $(C)$

Intuitively:

C \approx \frac{1}{\lambda}

Large $C$

Strict about fitting training data

high variance, lower bias
Effect: can overfit

Small $C$

Smoother boundary

low variance, higher bias
Effect: may underfit

Interpretation:

Parameter	Effect
Large (C)	Focus on minimizing training error
Small (C)	Strong regularization

🎛️ Regularization Term $(B)$

Second term → (keeps parameters small)

$\sum_{j=1}^{n} \theta_j^2$

You normally never implement this yourself. Libraries solve it using optimized algorithms.

SVM Hypothesis Function $h(\theta)$

Which side of the line is the data point class on.

Unlike logistic regression, SVM does not output probabilities.

              Class 1 (Teddy Bears)
                    ●
               ●         ●

          ●                 ●
                    ↑
                    │  Margin
--------------------│--------------------   ← Decision Boundary (θᵀx = 0)
                    │
                    ↓

          ○                 ○
               ○       ○

                    ○
               Class 0 (Cars)

$\theta^T x = 0$ is Line separating classes

$\theta^T x = +1$ ← upper margin

$\theta^T x = -1$ ← lower margin

Prediction rule:

y = \begin{cases} 1 & \text{if } \theta^T x \ge 0 \\ 0 & \text{otherwise} \end{cases}

Only the closest points influence the model.

All other points do not affect the boundary once they are outside the margin.


●  ← closest positive point
○  ← closest negative point

         +1 Margin
---------------------------   ●
           ●

----------- Decision Boundary -----------

           ○
---------------------------   ○
      -1 Margin

Non Linear SVM

For Non Linear Data points a straight line cannot separate data points.


      ○ ○ ○ ○
    ○         ○
   ○    ● ●    ○
    ○         ○
      ○ ○ ○ ○

One option is to create many feature that can separate points but that would be computational expensive

So we create our own custom feature.

⛳ Landmarks ( $l_i$ )

Instead of polynomial features, we choose special points in space called landmarks.

Feature = How close is a point $X$ close to each landmark?

Meaning:

If $X$ is very close to landmark $L$ → value ≈ 1
If $X$ is far from $L$ → value ≈ 0

Kernel (Similarity Function)

Kernel help SVM draw curved decision boundaries instead of straight lines.

We measure similarity using a Gaussian kernel

$f(x,l)=e^{-||x-l||^2/(2\sigma^2)}$

Important

When using Gaussian kernel, features must be scaled.

Defines:

“How close am I to this landmark?”

So each landmark produces a feature:

$f1 = similarity(x , l1)$
$f2 = similarity(x , l2)$
$f3 = similarity(x , l3)$

⛰︎ sigma $\sigma$

This controls the width of the Gaussian kernel.

Sigma controls how wide the influence of a landmark is.

🏔️ If $\sigma^2$ is small

very local influence = narrow peak

similarity drops quickly
very flexible boundary
high variance, lower Bias

🗻 If $\sigma^2$ is large

broader influence = wide peak

similarity falls slowly
smoother decision boundary
high bias, lower variance

          peak (value = 1)
              ▲
             / \
            /   \
           /     \
          /       \

Values decrease as we move away

Highest point = the landmark
Values decrease as we move away

$\theta_0 + \theta_1f1 + \theta_2f2 + \theta_3f3 ≥ 0$ → class 1 $\theta_0 + \theta_1f1 + \theta_2f2 + \theta_3f3 < 0 → class 0$

So:

Close to important landmarks → positive
Far from them → negative

Don’t implement SVM yourself

SVM training requires solving a complex optimization problem.

In practice you never write this yourself.

Just use a library like:

LIBSVM
LIBLINEAR
Scikit-learn (Python)
TensorFlow / PyTorch wrappers

Example in Python:

from sklearn import svm

model = svm.SVC(kernel="rbf", C=1)
model.fit(X_train, y_train)

So the hard math is already implemented.

AI-Machine-Learning/6-Support-Vector-Machine

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Learn how Support Vector Machines (SVM) build powerful classification models by finding the optimal separating hyperplane that maximizes the margin between classes. Discover how the margin, regularization parameter C, and kernel functions help SVM handle both linear and non-linear data while improving generalization performance.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Bias-Variance Dilemma

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Support Vector Machine (SVM)

SVM = Find the safest separating line. Advantages:

Works very well on small and medium datasets
Global optimum (convex optimization)
Strong theoretical guarantees
Very powerful with kernels

One very powerful algorithm widely used in both industry.

Compared to Logistic Regression and Neural Networks, SVMs sometimes provide a cleaner and more powerful way to learn complex non-linear decision boundaries.

Logistic Regression:

Draws a line to separate classes.

Teddy   Teddy   |line|   Car   Car

SVM:

Draws the widest possible gap between classes.

Teddy   Teddy      | line |      Car   Car

Teddy   (closest teddy) |line| (closest car)   Car

Margin: Extra space around data points, SVM tries to maximize it.
Support Vectors: data point close to line

Logistic Regression	SVM
Predicts probability	Predicts class
Log loss	Hinge loss
Smooth curve penalty	Linear margin penalty
Uses λ regularization	Uses C parameter
Focus on likelihood	Focus on margin

When to use which algorithm

$N$ : number of features
$M$ : training examples

Situation	Recommended
$N$ very large, $M$ small	Logistic regression or Linear SVM
$N$ small, $M$ medium	SVM with Gaussian kernel
$N$ small, $M$ huge	Logistic regression

Example

Spam detection

features = 10000
examples = 500

→ Linear SVM

Image dataset

features = 50
examples = 10000

→ Gaussian SVM

SVM Cost Function

Starting From Logistic Regression

Logistic regression uses the hypothesis:

h_\theta(x) = \frac{1}{1 + e^{-z}}

where

z = \theta^T x

Interpretation:

If $z \gg 0$ → $h_\theta(x) \approx 1$
If $z \ll 0$ → $h_\theta(x) \approx 0$

So logistic regression tries to make:

$\theta^T x \gg 0$ when $y = 1$
$\theta^T x \ll 0$ when $y = 0$

Logistic Regression Cost Function

For a single training example ((x, y)):

Cost = -y \log(h_\theta(x)) - (1-y)\log(1-h_\theta(x))

Two cases:

When ( y = 1 )

Cost = -\log(h_\theta(x))

Substitute the sigmoid:

Cost = -\log\left(\frac{1}{1+e^{-z}}\right)

When ( z ) becomes large:

$h_\theta(x) \to 1$
Cost becomes very small

This encourages the algorithm to push ( z ) large for positive examples.

When $y = 0$

Cost = -\log(1 - h_\theta(x))

Which becomes:

Cost = -\log\left(\frac{e^{-z}}{1+e^{-z}}\right)

When $z \ll 0$ :

$h_\theta(x) \to 0$
Cost becomes very small

Hinge loss,

Instead of using the smooth logistic loss, SVM uses piecewise linear function called Hinge loss

Cost when ( y = 1 )

$cost_1(z) = \max(0, 1 - z)$

Cost when ( y = 0 )

$cost_0(z) = \max(0, 1 + z)$

Properties:

If classification is correct and confident, cost = 0
If classification is wrong or too close to boundary, cost increases linearly

SVM Optimization Objective

The optimization objective becomes:

\min_\theta \left[ \sum_{i=1}^{m} y^{(i)}\,cost_1(\theta^T x^{(i)})+ (1-y^{(i)})\,cost_0(\theta^T x^{(i)}) \right]+ \frac{1}{2}\sum_{j=1}^{n}\theta_j^2

Can be Simplified to:

\min_\theta \left[ C \sum_{i=1}^{m} Loss(\theta^T x^{(i)}, y^{(i)})+ \sum_{j=1}^{n} \theta_j^2 \right]

Parameterization Using $C$

Instead of the logistic regression form:

$A + \lambda B$

SVM uses:

$C \cdot A + B$

Where:

$A$ = training error
$B$ = regularization term
$C$ = Classification error

Where:

🎛️ Classification error $(C)$

Intuitively:

C \approx \frac{1}{\lambda}

Large $C$

Strict about fitting training data

high variance, lower bias
Effect: can overfit

Small $C$

Smoother boundary

low variance, higher bias
Effect: may underfit

Interpretation:

Parameter	Effect
Large (C)	Focus on minimizing training error
Small (C)	Strong regularization

🎛️ Regularization Term $(B)$

Second term → (keeps parameters small)

$\sum_{j=1}^{n} \theta_j^2$

You normally never implement this yourself. Libraries solve it using optimized algorithms.

SVM Hypothesis Function $h(\theta)$

Which side of the line is the data point class on.

Unlike logistic regression, SVM does not output probabilities.

              Class 1 (Teddy Bears)
                    ●
               ●         ●

          ●                 ●
                    ↑
                    │  Margin
--------------------│--------------------   ← Decision Boundary (θᵀx = 0)
                    │
                    ↓

          ○                 ○
               ○       ○

                    ○
               Class 0 (Cars)

$\theta^T x = 0$ is Line separating classes

$\theta^T x = +1$ ← upper margin

$\theta^T x = -1$ ← lower margin

Prediction rule:

y = \begin{cases} 1 & \text{if } \theta^T x \ge 0 \\ 0 & \text{otherwise} \end{cases}

Only the closest points influence the model.

All other points do not affect the boundary once they are outside the margin.


●  ← closest positive point
○  ← closest negative point

         +1 Margin
---------------------------   ●
           ●

----------- Decision Boundary -----------

           ○
---------------------------   ○
      -1 Margin

Non Linear SVM

For Non Linear Data points a straight line cannot separate data points.


      ○ ○ ○ ○
    ○         ○
   ○    ● ●    ○
    ○         ○
      ○ ○ ○ ○

One option is to create many feature that can separate points but that would be computational expensive

So we create our own custom feature.

⛳ Landmarks ( $l_i$ )

Instead of polynomial features, we choose special points in space called landmarks.

Feature = How close is a point $X$ close to each landmark?

Meaning:

If $X$ is very close to landmark $L$ → value ≈ 1
If $X$ is far from $L$ → value ≈ 0

Kernel (Similarity Function)

Kernel help SVM draw curved decision boundaries instead of straight lines.

We measure similarity using a Gaussian kernel

$f(x,l)=e^{-||x-l||^2/(2\sigma^2)}$

Important

When using Gaussian kernel, features must be scaled.

Defines:

“How close am I to this landmark?”

So each landmark produces a feature:

$f1 = similarity(x , l1)$
$f2 = similarity(x , l2)$
$f3 = similarity(x , l3)$

⛰︎ sigma $\sigma$

This controls the width of the Gaussian kernel.

Sigma controls how wide the influence of a landmark is.

🏔️ If $\sigma^2$ is small

very local influence = narrow peak

similarity drops quickly
very flexible boundary
high variance, lower Bias

🗻 If $\sigma^2$ is large

broader influence = wide peak

similarity falls slowly
smoother decision boundary
high bias, lower variance

          peak (value = 1)
              ▲
             / \
            /   \
           /     \
          /       \

Values decrease as we move away

Highest point = the landmark
Values decrease as we move away

$\theta_0 + \theta_1f1 + \theta_2f2 + \theta_3f3 ≥ 0$ → class 1 $\theta_0 + \theta_1f1 + \theta_2f2 + \theta_3f3 < 0 → class 0$

So:

Close to important landmarks → positive
Far from them → negative

Don’t implement SVM yourself

SVM training requires solving a complex optimization problem.

In practice you never write this yourself.

Just use a library like:

LIBSVM
LIBLINEAR
Scikit-learn (Python)
TensorFlow / PyTorch wrappers

Example in Python:

from sklearn import svm

model = svm.SVC(kernel="rbf", C=1)
model.fit(X_train, y_train)

So the hard math is already implemented.

AI-Machine-Learning/6-Support-Vector-Machine

Fetching content, this won’t take long…

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Written by Hitesh Sahu, a passionate developer and blogger.

Support Vector Machine (SVM)

When to use which algorithm

SVM Cost Function

Starting From Logistic Regression

Logistic Regression Cost Function

When ( y = 1 )

When y=0y = 0y=0

Hinge loss,

Cost when ( y = 1 )

cost1(z)=max⁡(0,1−z)cost_1(z) = \max(0, 1 - z)cost1​(z)=max(0,1−z)

Cost when ( y = 0 )

cost0(z)=max⁡(0,1+z)cost_0(z) = \max(0, 1 + z)cost0​(z)=max(0,1+z)

SVM Optimization Objective

Parameterization Using CCC

C⋅A+BC \cdot A + BC⋅A+B

🎛️ Classification error (C)(C)(C)

Large CCC

Small CCC

🎛️ Regularization Term (B)(B)(B)

SVM Hypothesis Function h(θ)h(\theta)h(θ)

Prediction rule:

Non Linear SVM

⛳ Landmarks (lil_ili​)

Kernel (Similarity Function)

f(x,l)=e−∣∣x−l∣∣2/(2σ2) f(x,l)=e^{-||x-l||^2/(2\sigma^2)}f(x,l)=e−∣∣x−l∣∣2/(2σ2)

⛰︎ sigma σ\sigmaσ

🏔️ If σ2\sigma^2σ2 is small

🗻 If σ2\sigma^2σ2 is large

Don’t implement SVM yourself

Fetching content, this won’t take long…

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Written by Hitesh Sahu, a passionate developer and blogger.

Support Vector Machine (SVM)

When to use which algorithm

SVM Cost Function

Starting From Logistic Regression

Logistic Regression Cost Function

When ( y = 1 )

When y=0y = 0y=0

Hinge loss,

Cost when ( y = 1 )

cost1(z)=max⁡(0,1−z)cost_1(z) = \max(0, 1 - z)cost1​(z)=max(0,1−z)

Cost when ( y = 0 )

cost0(z)=max⁡(0,1+z)cost_0(z) = \max(0, 1 + z)cost0​(z)=max(0,1+z)

SVM Optimization Objective

Parameterization Using CCC

C⋅A+BC \cdot A + BC⋅A+B

🎛️ Classification error (C)(C)(C)

Large CCC

Small CCC

🎛️ Regularization Term (B)(B)(B)

SVM Hypothesis Function h(θ)h(\theta)h(θ)

Prediction rule:

Non Linear SVM

⛳ Landmarks (lil_ili​)

Kernel (Similarity Function)

f(x,l)=e−∣∣x−l∣∣2/(2σ2) f(x,l)=e^{-||x-l||^2/(2\sigma^2)}f(x,l)=e−∣∣x−l∣∣2/(2σ2)

⛰︎ sigma σ\sigmaσ

🏔️ If σ2\sigma^2σ2 is small

🗻 If σ2\sigma^2σ2 is large

Don’t implement SVM yourself

When $y = 0$

$cost_1(z) = \max(0, 1 - z)$

$cost_0(z) = \max(0, 1 + z)$

Parameterization Using $C$

$C \cdot A + B$

🎛️ Classification error $(C)$

Large $C$

Small $C$

🎛️ Regularization Term $(B)$

SVM Hypothesis Function $h(\theta)$

⛳ Landmarks ( $l_i$ )

$f(x,l)=e^{-||x-l||^2/(2\sigma^2)}$

⛰︎ sigma $\sigma$

🏔️ If $\sigma^2$ is small

🗻 If $\sigma^2$ is large

When $y = 0$

$cost_1(z) = \max(0, 1 - z)$

$cost_0(z) = \max(0, 1 + z)$

Parameterization Using $C$

$C \cdot A + B$

🎛️ Classification error $(C)$

Large $C$

Small $C$

🎛️ Regularization Term $(B)$

SVM Hypothesis Function $h(\theta)$

⛳ Landmarks ( $l_i$ )

$f(x,l)=e^{-||x-l||^2/(2\sigma^2)}$

⛰︎ sigma $\sigma$

🏔️ If $\sigma^2$ is small

🗻 If $\sigma^2$ is large