Written by Hitesh Sahu, a passionate developer and blogger.

Invalid Date

Share This on

Revision Cheat Sheet

Machine Learning Learning Path

Model Selection

Model selection is about balancing:

Accuracy + latency + cost + real-world performance

rather than optimizing a single metric.

Model Size

SLM vs LLM

Type	Description
SLM (Small Language Model)	Smaller models optimized for specific tasks, lower latency, and reduced compute requirements
LLM (Large Language Model)	Large general-purpose models capable of handling multiple tasks and broad reasoning

Typical Tradeoff

Feature	SLM	LLM
Compute Cost	Lower	Higher
Latency	Faster	Slower
Generalization	Limited	Strong
Domain Specialization	Strong	Moderate
Memory Usage	Lower	Higher

1. Model Accuracy

How often a model predicts correctly on unseen data.

Example:

95 correct predictions out of 100
→ Accuracy = 95%

2. `BLEU`: Bilingual Evaluation Understudy Score

Measure precision overlap between generated text and reference text.

Simplified BLEU Formula

BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

Where:

$BP$ = brevity penalty
$p_n$ = n-gram precision
$w_n$ = weights

Higher overlap → higher BLEU score.

we don't punish long candidates, and only punish short candidates.

Used mainly for:

machine translation
text generation evaluation

Example:

Reference	"The cat sits on the mat"
Generated	"The cat is on the mat"

3. `ROUGE` Score

How much important reference content was captured.

ROUGE stands for:

Recall-Oriented Understudy for Gisting Evaluation

Simplified ROUGE Formula

ROUGE = \frac{Overlapping\ Words}{Total\ Reference\ Words}

Higher scores indicating higher similarity between the automatically produced summary and the reference.

Focus:

recall
content coverage

Used mainly for:

Summarization Text

BLEU vs ROUGE

Metric	Focus	Common Use
BLEU	Precision	Translation
ROUGE	Recall	Summarization

4. Cosine Similarity

Measure Semantic similarity between vector embeddings.

It compares the angle between vectors.

\cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}

Range:

Value	Meaning
1	Very similar
0	Unrelated
-1	Opposite direction

Embedding Similarity Example

flowchart TD

    A["'Fast GPU computing'"]
        --> C["Embedding Space"]

    B["'Parallel GPU processing'"]
        --> C

    C --> D["High Cosine Similarity"]

5. Cross-Validation

Cross-validation evaluates models using multiple data splits.

Purpose:

estimate generalization performance
reduce overfitting risk

6. K-Fold Cross Validation

Each fold becomes the validation set once.

Dataset = \{Fold_1, Fold_2, Fold_3, ..., Fold_k\}

Training strategy:

Train = k - 1\ folds

Validation = 1\ fold

flowchart LR

    A["Fold 1"]
    B["Fold 2"]
    C["Fold 3"]
    D["Fold 4"]
    E["Fold 5"]

    F["Train on 4 folds<br/>Validate on 1 fold"]

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

Benefits:

better performance estimation
improved robustness
reduced dataset bias

Useful when:

datasets are small
evaluation data is limited

7. A/B Testing

A/B testing compares two model versions using real users.

Purpose:

measure production performance
validate improvements safely

A/B Testing Workflow

flowchart TD

    A["Users"]
        --> B["Traffic Split"]

    B --> C["Model A"]

    B --> D["Model B"]

    C --> E["Metrics Collection"]
    D --> E

Common A/B Testing Metrics

Metric	Example
Click-through rate	Recommendation systems
Latency	AI inference
User satisfaction	Chatbots
Conversion rate	AI assistants
Engagement	Content generation

Offline vs Online Evaluation

Type	Description
Offline Evaluation	Uses datasets and metrics
Online Evaluation	Uses real user traffic

Simplified Mental Model

Concept	Purpose
`Accuracy`	Correct predictions
`BLEU`	Translation quality
`ROUGE`	Summarization quality
`Cosine Similarity`	Semantic similarity
`Cross-validation`	Reliable evaluation
`A/B Testing`	Real-world comparison

Written by Hitesh Sahu, a passionate developer and blogger.

Invalid Date

Share This on

← Previous

Revision Cheat Sheet

Machine Learning Learning Path

Model Selection

Model selection is about balancing:

Accuracy + latency + cost + real-world performance

rather than optimizing a single metric.

Model Size

SLM vs LLM

Type	Description
SLM (Small Language Model)	Smaller models optimized for specific tasks, lower latency, and reduced compute requirements
LLM (Large Language Model)	Large general-purpose models capable of handling multiple tasks and broad reasoning

Typical Tradeoff

Feature	SLM	LLM
Compute Cost	Lower	Higher
Latency	Faster	Slower
Generalization	Limited	Strong
Domain Specialization	Strong	Moderate
Memory Usage	Lower	Higher

1. Model Accuracy

How often a model predicts correctly on unseen data.

Example:

95 correct predictions out of 100
→ Accuracy = 95%

2. `BLEU`: Bilingual Evaluation Understudy Score

Measure precision overlap between generated text and reference text.

Simplified BLEU Formula

BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

Where:

$BP$ = brevity penalty
$p_n$ = n-gram precision
$w_n$ = weights

Higher overlap → higher BLEU score.

we don't punish long candidates, and only punish short candidates.

Used mainly for:

machine translation
text generation evaluation

Example:

Reference	"The cat sits on the mat"
Generated	"The cat is on the mat"

3. `ROUGE` Score

How much important reference content was captured.

ROUGE stands for:

Recall-Oriented Understudy for Gisting Evaluation

Simplified ROUGE Formula

ROUGE = \frac{Overlapping\ Words}{Total\ Reference\ Words}

Higher scores indicating higher similarity between the automatically produced summary and the reference.

Focus:

recall
content coverage

Used mainly for:

Summarization Text

BLEU vs ROUGE

Metric	Focus	Common Use
BLEU	Precision	Translation
ROUGE	Recall	Summarization

4. Cosine Similarity

Measure Semantic similarity between vector embeddings.

It compares the angle between vectors.

\cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}

Range:

Value	Meaning
1	Very similar
0	Unrelated
-1	Opposite direction

Embedding Similarity Example

flowchart TD

    A["'Fast GPU computing'"]
        --> C["Embedding Space"]

    B["'Parallel GPU processing'"]
        --> C

    C --> D["High Cosine Similarity"]

5. Cross-Validation

Cross-validation evaluates models using multiple data splits.

Purpose:

estimate generalization performance
reduce overfitting risk

6. K-Fold Cross Validation

Each fold becomes the validation set once.

Dataset = \{Fold_1, Fold_2, Fold_3, ..., Fold_k\}

Training strategy:

Train = k - 1\ folds

Validation = 1\ fold

flowchart LR

    A["Fold 1"]
    B["Fold 2"]
    C["Fold 3"]
    D["Fold 4"]
    E["Fold 5"]

    F["Train on 4 folds<br/>Validate on 1 fold"]

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

Benefits:

better performance estimation
improved robustness
reduced dataset bias

Useful when:

datasets are small
evaluation data is limited

7. A/B Testing

A/B testing compares two model versions using real users.

Purpose:

measure production performance
validate improvements safely

A/B Testing Workflow

flowchart TD

    A["Users"]
        --> B["Traffic Split"]

    B --> C["Model A"]

    B --> D["Model B"]

    C --> E["Metrics Collection"]
    D --> E

Common A/B Testing Metrics

Metric	Example
Click-through rate	Recommendation systems
Latency	AI inference
User satisfaction	Chatbots
Conversion rate	AI assistants
Engagement	Content generation

Offline vs Online Evaluation

Type	Description
Offline Evaluation	Uses datasets and metrics
Online Evaluation	Uses real user traffic

Simplified Mental Model

Concept	Purpose
`Accuracy`	Correct predictions
`BLEU`	Translation quality
`ROUGE`	Summarization quality
`Cosine Similarity`	Semantic similarity
`Cross-validation`	Reliable evaluation
`A/B Testing`	Real-world comparison

Written by Hitesh Sahu, a passionate developer and blogger.

Model Selection

Model Size

Typical Tradeoff

1. Model Accuracy

2. BLEU: Bilingual Evaluation Understudy Score

Simplified BLEU Formula

3. ROUGE Score

Simplified ROUGE Formula

BLEU vs ROUGE

4. Cosine Similarity

Embedding Similarity Example

5. Cross-Validation

6. K-Fold Cross Validation

7. A/B Testing

A/B Testing Workflow

Common A/B Testing Metrics

Offline vs Online Evaluation

Simplified Mental Model

Fetching content, this won’t take long…

🍌 Bananas are berries, but strawberries are not.

Written by Hitesh Sahu, a passionate developer and blogger.

Model Selection

Model Size

Typical Tradeoff

1. Model Accuracy

2. BLEU: Bilingual Evaluation Understudy Score

Simplified BLEU Formula

3. ROUGE Score

Simplified ROUGE Formula

BLEU vs ROUGE

4. Cosine Similarity

Embedding Similarity Example

5. Cross-Validation

6. K-Fold Cross Validation

7. A/B Testing

A/B Testing Workflow

Common A/B Testing Metrics

Offline vs Online Evaluation

Simplified Mental Model

2. `BLEU`: Bilingual Evaluation Understudy Score

3. `ROUGE` Score

2. `BLEU`: Bilingual Evaluation Understudy Score

3. `ROUGE` Score