Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 1 1 Model Evaluation

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Invalid Date

Share This on

← Previous

Revision Cheat Sheet

Next →

Machine Learning Learning Path

Model Selection

Model selection is about balancing:

Accuracy + latency + cost + real-world performance

rather than optimizing a single metric.

Model Size

SLM vs LLM

Type Description
SLM (Small Language Model) Smaller models optimized for specific tasks, lower latency, and reduced compute requirements
LLM (Large Language Model) Large general-purpose models capable of handling multiple tasks and broad reasoning

Typical Tradeoff

Feature SLM LLM
Compute Cost Lower Higher
Latency Faster Slower
Generalization Limited Strong
Domain Specialization Strong Moderate
Memory Usage Lower Higher

1. Model Accuracy

How often a model predicts correctly on unseen data.

Example:

95 correct predictions out of 100
→ Accuracy = 95%

2. BLEU: Bilingual Evaluation Understudy Score

Measure precision overlap between generated text and reference text.

Simplified BLEU Formula

BLEU=BP⋅exp⁡(∑n=1Nwnlog⁡pn)BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)BLEU=BP⋅exp(n=1∑N​wn​logpn​)

Where:

  • BPBPBP = brevity penalty
  • pnp_npn​ = n-gram precision
  • wnw_nwn​ = weights

So

Higher overlap → higher BLEU score.

  • we don't punish long candidates, and only punish short candidates.

Used mainly for:

  • machine translation
  • text generation evaluation

Example:

Reference "The cat sits on the mat"
Generated "The cat is on the mat"

3. ROUGE Score

How much important reference content was captured.

ROUGE stands for:

Recall-Oriented Understudy for Gisting Evaluation

Simplified ROUGE Formula

ROUGE=Overlapping WordsTotal Reference WordsROUGE = \frac{Overlapping\ Words}{Total\ Reference\ Words}ROUGE=Total Reference WordsOverlapping Words​

Higher scores indicating higher similarity between the automatically produced summary and the reference.

Focus:

  • recall
  • content coverage

Used mainly for:

  • Summarization Text

BLEU vs ROUGE

Metric Focus Common Use
BLEU Precision Translation
ROUGE Recall Summarization

4. Cosine Similarity

Measure Semantic similarity between vector embeddings.

It compares the angle between vectors.

cos⁡(θ)=A⋅B∥A∥∥B∥\cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}cos(θ)=∥A∥∥B∥A⋅B​

Range:

Value Meaning
1 Very similar
0 Unrelated
-1 Opposite direction

Embedding Similarity Example

flowchart TD

    A["'Fast GPU computing'"]
        --> C["Embedding Space"]

    B["'Parallel GPU processing'"]
        --> C

    C --> D["High Cosine Similarity"]

5. Cross-Validation

Cross-validation evaluates models using multiple data splits.

Purpose:

  • estimate generalization performance
  • reduce overfitting risk

6. K-Fold Cross Validation

Each fold becomes the validation set once.

Dataset={Fold1,Fold2,Fold3,...,Foldk}Dataset = \{Fold_1, Fold_2, Fold_3, ..., Fold_k\}Dataset={Fold1​,Fold2​,Fold3​,...,Foldk​}

Training strategy:

Train=k−1 foldsTrain = k - 1\ foldsTrain=k−1 folds Validation=1 foldValidation = 1\ foldValidation=1 fold
flowchart LR

    A["Fold 1"]
    B["Fold 2"]
    C["Fold 3"]
    D["Fold 4"]
    E["Fold 5"]

    F["Train on 4 folds<br/>Validate on 1 fold"]

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

Benefits:

  • better performance estimation
  • improved robustness
  • reduced dataset bias

Useful when:

  • datasets are small
  • evaluation data is limited

7. A/B Testing

A/B testing compares two model versions using real users.

Purpose:

  • measure production performance
  • validate improvements safely

A/B Testing Workflow

flowchart TD

    A["Users"]
        --> B["Traffic Split"]

    B --> C["Model A"]

    B --> D["Model B"]

    C --> E["Metrics Collection"]
    D --> E

Common A/B Testing Metrics

Metric Example
Click-through rate Recommendation systems
Latency AI inference
User satisfaction Chatbots
Conversion rate AI assistants
Engagement Content generation

Offline vs Online Evaluation

Type Description
Offline Evaluation Uses datasets and metrics
Online Evaluation Uses real user traffic

Simplified Mental Model

Concept Purpose
Accuracy Correct predictions
BLEU Translation quality
ROUGE Summarization quality
Cosine Similarity Semantic similarity
Cross-validation Reliable evaluation
A/B Testing Real-world comparison

AI-GenAI/1-1-Model-Evaluation
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.