Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 6 Support Vector Machine

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Support Vector Machines (SVM): Maximizing Margins for Robust Machine Learning Models

Learn how Support Vector Machines (SVM) build powerful classification models by finding the optimal separating hyperplane that maximizes the margin between classes. Discover how the margin, regularization parameter C, and kernel functions help SVM handle both linear and non-linear data while improving generalization performance.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Bias-Variance Dilemma

Next →

Cost Function Regularization: Balancing Bias and Variance in Machine Learning Models

Support Vector Machine (SVM)

SVM = Find the safest separating line. Advantages:

  • Works very well on small and medium datasets
  • Global optimum (convex optimization)
  • Strong theoretical guarantees
  • Very powerful with kernels

One very powerful algorithm widely used in both industry.

Compared to Logistic Regression and Neural Networks, SVMs sometimes provide a cleaner and more powerful way to learn complex non-linear decision boundaries.

Logistic Regression:

Draws a line to separate classes.

Teddy   Teddy   |line|   Car   Car

SVM:

Draws the widest possible gap between classes.

Teddy   Teddy      | line |      Car   Car

Teddy   (closest teddy) |line| (closest car)   Car
  • Margin: Extra space around data points, SVM tries to maximize it.
  • Support Vectors: data point close to line
Logistic Regression SVM
Predicts probability Predicts class
Log loss Hinge loss
Smooth curve penalty Linear margin penalty
Uses λ regularization Uses C parameter
Focus on likelihood Focus on margin

When to use which algorithm

  • NNN : number of features
  • MMM : training examples
Situation Recommended
NNN very large, MMM small Logistic regression or Linear SVM
NNN small, MMM medium SVM with Gaussian kernel
NNN small, MMM huge Logistic regression

Example

Spam detection

  • features = 10000
  • examples = 500

→ Linear SVM

Image dataset

  • features = 50
  • examples = 10000

→ Gaussian SVM


SVM Cost Function

Starting From Logistic Regression

Logistic regression uses the hypothesis:

hθ(x)=11+e−zh_\theta(x) = \frac{1}{1 + e^{-z}}hθ​(x)=1+e−z1​

where

z=θTxz = \theta^T xz=θTx

Interpretation:

  • If z≫0z \gg 0z≫0 → hθ(x)≈1h_\theta(x) \approx 1hθ​(x)≈1
  • If z≪0z \ll 0z≪0 → hθ(x)≈0h_\theta(x) \approx 0hθ​(x)≈0

So logistic regression tries to make:

  • θTx≫0\theta^T x \gg 0θTx≫0 when y=1y = 1y=1
  • θTx≪0\theta^T x \ll 0θTx≪0 when y=0y = 0y=0

Logistic Regression Cost Function

For a single training example ((x, y)):

Cost=−ylog⁡(hθ(x))−(1−y)log⁡(1−hθ(x))Cost = -y \log(h_\theta(x)) - (1-y)\log(1-h_\theta(x))Cost=−ylog(hθ​(x))−(1−y)log(1−hθ​(x))

Two cases:

When ( y = 1 )

Cost=−log⁡(hθ(x))Cost = -\log(h_\theta(x))Cost=−log(hθ​(x))

Substitute the sigmoid:

Cost=−log⁡(11+e−z)Cost = -\log\left(\frac{1}{1+e^{-z}}\right)Cost=−log(1+e−z1​)

When ( z ) becomes large:

  • hθ(x)→1h_\theta(x) \to 1hθ​(x)→1
  • Cost becomes very small

This encourages the algorithm to push ( z ) large for positive examples.

When y=0y = 0y=0

Cost=−log⁡(1−hθ(x))Cost = -\log(1 - h_\theta(x))Cost=−log(1−hθ​(x))

Which becomes:

Cost=−log⁡(e−z1+e−z)Cost = -\log\left(\frac{e^{-z}}{1+e^{-z}}\right)Cost=−log(1+e−ze−z​)

When z≪0z \ll 0z≪0:

  • hθ(x)→0h_\theta(x) \to 0hθ​(x)→0
  • Cost becomes very small

Hinge loss,

Instead of using the smooth logistic loss, SVM uses piecewise linear function called Hinge loss

Cost when ( y = 1 )

cost1(z)=max⁡(0,1−z)cost_1(z) = \max(0, 1 - z)cost1​(z)=max(0,1−z)

Cost when ( y = 0 )

cost0(z)=max⁡(0,1+z)cost_0(z) = \max(0, 1 + z)cost0​(z)=max(0,1+z)

Properties:

  • If classification is correct and confident, cost = 0
  • If classification is wrong or too close to boundary, cost increases linearly

SVM Optimization Objective

The optimization objective becomes:

min⁡θ[∑i=1my(i) cost1(θTx(i))+(1−y(i)) cost0(θTx(i))]+12∑j=1nθj2\min_\theta \left[ \sum_{i=1}^{m} y^{(i)}\,cost_1(\theta^T x^{(i)})+ (1-y^{(i)})\,cost_0(\theta^T x^{(i)}) \right]+ \frac{1}{2}\sum_{j=1}^{n}\theta_j^2θmin​[i=1∑m​y(i)cost1​(θTx(i))+(1−y(i))cost0​(θTx(i))]+21​j=1∑n​θj2​

Can be Simplified to:

min⁡θ[C∑i=1mLoss(θTx(i),y(i))+∑j=1nθj2]\min_\theta \left[ C \sum_{i=1}^{m} Loss(\theta^T x^{(i)}, y^{(i)})+ \sum_{j=1}^{n} \theta_j^2 \right]θmin​[Ci=1∑m​Loss(θTx(i),y(i))+j=1∑n​θj2​]

Parameterization Using CCC

Instead of the logistic regression form:

A+λBA + \lambda BA+λB

SVM uses:

C⋅A+BC \cdot A + BC⋅A+B

Where:

  • AAA = training error
  • BBB = regularization term
  • CCC = Classification error

Where:

🎛️ Classification error (C)(C)(C)

Intuitively:

C≈1λC \approx \frac{1}{\lambda}C≈λ1​

Large CCC

Strict about fitting training data

  • high variance, lower bias
  • Effect: can overfit

Small CCC

Smoother boundary

  • low variance, higher bias
  • Effect: may underfit

Interpretation:

Parameter Effect
Large (C) Focus on minimizing training error
Small (C) Strong regularization

🎛️ Regularization Term (B)(B)(B)

Second term → (keeps parameters small)

∑j=1nθj2\sum_{j=1}^{n} \theta_j^2∑j=1n​θj2​

You normally never implement this yourself. Libraries solve it using optimized algorithms.

SVM Hypothesis Function h(θ)h(\theta)h(θ)

Which side of the line is the data point class on.

Unlike logistic regression, SVM does not output probabilities.

              Class 1 (Teddy Bears)
                    ●
               ●         ●

          ●                 ●
                    ↑
                    │  Margin
--------------------│--------------------   ← Decision Boundary (θᵀx = 0)
                    │
                    ↓

          ○                 ○
               ○       ○

                    ○
               Class 0 (Cars)

θTx=0\theta^T x = 0θTx=0 is Line separating classes

θTx=+1\theta^T x = +1θTx=+1 ← upper margin

θTx=−1\theta^T x = -1θTx=−1 ← lower margin

Prediction rule:

y={1if θTx≥00otherwisey = \begin{cases} 1 & \text{if } \theta^T x \ge 0 \\ 0 & \text{otherwise} \end{cases}y={10​if θTx≥0otherwise​

Only the closest points influence the model.

All other points do not affect the boundary once they are outside the margin.


●  ← closest positive point
○  ← closest negative point

         +1 Margin
---------------------------   ●
           ●

----------- Decision Boundary -----------

           ○
---------------------------   ○
      -1 Margin

Non Linear SVM

For Non Linear Data points a straight line cannot separate data points.


      ○ ○ ○ ○
    ○         ○
   ○    ● ●    ○
    ○         ○
      ○ ○ ○ ○

One option is to create many feature that can separate points but that would be computational expensive

So we create our own custom feature.

⛳ Landmarks (lil_ili​)

Instead of polynomial features, we choose special points in space called landmarks.

Feature = How close is a point XXX close to each landmark?

Meaning:

  • If XXX is very close to landmark LLL → value ≈ 1
  • If XXX is far from LLL → value ≈ 0
        l2

   ○        ○
      ● ●
   ○        ○

        l1

Kernel (Similarity Function)

Kernel help SVM draw curved decision boundaries instead of straight lines.

We measure similarity using a Gaussian kernel

f(x,l)=e−∣∣x−l∣∣2/(2σ2) f(x,l)=e^{-||x-l||^2/(2\sigma^2)}f(x,l)=e−∣∣x−l∣∣2/(2σ2)

Important

  • When using Gaussian kernel, features must be scaled.

Defines:

“How close am I to this landmark?”

So each landmark produces a feature:

  • f1=similarity(x,l1)f1 = similarity(x , l1) f1=similarity(x,l1)
  • f2=similarity(x,l2)f2 = similarity(x , l2)f2=similarity(x,l2)
  • f3=similarity(x,l3)f3 = similarity(x , l3)f3=similarity(x,l3)

⛰︎ sigma σ\sigmaσ

This controls the width of the Gaussian kernel.

Sigma controls how wide the influence of a landmark is.

🏔️ If σ2\sigma^2σ2 is small

very local influence = narrow peak

  • similarity drops quickly
  • very flexible boundary
  • high variance, lower Bias

🗻 If σ2\sigma^2σ2 is large

broader influence = wide peak

  • similarity falls slowly
  • smoother decision boundary
  • high bias, lower variance
          peak (value = 1)
              ▲
             / \
            /   \
           /     \
          /       \

Values decrease as we move away

  • Highest point = the landmark
  • Values decrease as we move away

θ0+θ1f1+θ2f2+θ3f3≥0\theta_0 + \theta_1f1 + \theta_2f2 + \theta_3f3 ≥ 0 θ0​+θ1​f1+θ2​f2+θ3​f3≥0 → class 1 θ0+θ1f1+θ2f2+θ3f3<0→class0\theta_0 + \theta_1f1 + \theta_2f2 + \theta_3f3 < 0 → class 0θ0​+θ1​f1+θ2​f2+θ3​f3<0→class0

So:

  • Close to important landmarks → positive
  • Far from them → negative

Don’t implement SVM yourself

SVM training requires solving a complex optimization problem.

In practice you never write this yourself.

Just use a library like:

  • LIBSVM
  • LIBLINEAR
  • Scikit-learn (Python)
  • TensorFlow / PyTorch wrappers

Example in Python:

from sklearn import svm

model = svm.SVC(kernel="rbf", C=1)
model.fit(X_train, y_train)

So the hard math is already implemented.

AI-Machine-Learning/6-Support-Vector-Machine
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.