Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🦈 Sharks existed before trees 🌳.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-GenAI

How to Choose the Right AI Model for Your Use Case

A practical guide to selecting the right AI and LLM models based on use case, latency, cost, accuracy, infrastructure, and deployment requirements.

LLM

Generative AI

NVIDIA

AI Infrastructure

AI Inference

← Previous

NVIDIA AI-LLM Developers Certification Path

Ethical AI vs Responsible AI vs Trustworthy AI

Model Selection

Model selection is about balancing:

Accuracy + latency + cost + real-world performance

A poor model choice can lead to:

High infrastructure costs
Slow inference performance
Increased GPU usage
Poor response quality
Difficult deployment and scaling
Security and compliance concerns

How to select the right model?

1. Define the need

What is the use case: classification, generation, summarization, etc.

Different AI tasks require different architectures.

Use Case	Recommended Model Type
Chatbots	Large Language Models (LLMs)
Image Generation	Diffusion Models
Speech Recognition	ASR Models
Recommendations	Ranking Models
Fraud Detection	Classification Models
Code Generation	Code LLMs
Search & Q/A	Retrieval-Augmented Generation (RAG)

2. Shortlist candidates

Research existing models that fit the requirements.

Consider open-source vs. closed-source models.
Compare model sizes and capabilities.
Evaluate the model's performance on relevant benchmarks and tasks.
arena

3. Evaluate the model

Use metrics like accuracy, precision, recall, F1 score, etc

4. Test Selected Model

Test the model on a small sample of your data to see how it performs in practice.

Model Size

Model Scaling Law

According to AI scaling laws, increasing parameters and data size predictably

improves performance but also
increases inference latency

Different tasks require different model sizes.

Model Size	Capabilities	Example Tasks
`1B` parameters	Basic tasks	Sentiment classification, simple Q&A
`10B` parameters	Moderate reasoning	Chatbots, content generation
`100B+` parameters	Complex reasoning	Brainstorming assistants, code generation

Model size directly impacts:

GPU memory requirements
Latency
Training cost
Inference throughput

Model Size	Typical Usage
SLM (Small Language Model) (1B–7B)	Edge AI, fast inference
Medium (8B–30B)	Enterprise assistants
LLM (Large Language Model) (40B+)	Research and advanced reasoning

Type	Description
SLM (Small Language Model)	Smaller models optimized for specific tasks, lower latency, and reduced compute requirements
LLM (Large Language Model)	Large general-purpose models capable of handling multiple tasks and broad reasoning

Typical Tradeoff

Feature	SLM	LLM
Compute Cost	Lower	Higher
Latency	Faster	Slower
Generalization	Limited	Strong
Domain Specialization	Strong	Moderate
Memory Usage	Lower	Higher

Model Latency

Time taken for a model to generate a response after receiving input.

Latency matters heavily in production systems.

Application	Preferred Latency
Chatbots	< 2 seconds
Real-time AI Agents	< 1 second
Batch Processing	Minutes acceptable
Document Analysis	Moderate latency

Model Evaluation Metrics

Common metrics for evaluating AI models:

Concept	Purpose
`Accuracy`	Correct predictions
`BLEU`	Translation quality
`ROUGE`	Summarization quality
`Cosine Similarity`	Semantic similarity
`Cross-validation`	Reliable evaluation
`A/B Testing`	Real-world comparison

1. Model Accuracy

How often a model predicts correctly on unseen data.

Example:

95 correct predictions out of 100
→ Accuracy = 95%

2. 🔣 `BLEU`: Bilingual Evaluation Understudy Score

Measure precision overlap between generated text and reference text.

White Paper https://www.aclweb.org/anthology/P02-1040.pdf

Simplified BLEU Formula

BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

Where:

$BP$ = brevity penalty
$p_n$ = n-gram precision
$w_n$ = weights

Higher overlap → higher BLEU score.

we don't punish long candidates, and only punish short candidates.

Used mainly for:

machine translation
text generation evaluation

Example:

Reference	"The cat sits on the mat"
Generated	"The cat is on the mat"

3. 📋 `ROUGE` Score

How much important reference content was captured.

ROUGE stands for:

Recall-Oriented Understudy for Gisting Evaluation

Simplified ROUGE Formula

ROUGE = \frac{Overlapping\ Words}{Total\ Reference\ Words}

Higher scores indicating higher similarity between the automatically produced summary and the reference.

Focus:

recall
content coverage

Used mainly for:

Summarization Text

BLEU vs ROUGE

Metric	Focus	Common Use
BLEU	Precision	Translation
ROUGE	Recall	Summarization

4. ↗️ Cosine Similarity

Measure Semantic similarity between vector embeddings.

It compares the angle between vectors.

cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}

Range:

Value	Meaning
1	Very similar
0	Unrelated
-1	Opposite direction

Embedding Similarity Example

  flowchart TD
    A["Fast GPU computing"]--> C["Embedding Space"]
    B["Parallel GPU processing"]--> C
    C --> D["High Cosine Similarity"]

5. Cross-Validation

Cross-validation evaluates models using multiple data splits.

Purpose:

estimate generalization performance
reduce overfitting risk

6. K-Fold Cross Validation

Each fold becomes the validation set once.

Dataset = \{Fold_1, Fold_2, Fold_3, ..., Fold_k\}

Training strategy:

Train = k - 1\ folds

Validation = 1\ fold

flowchart LR

    A["Fold 1"]
    B["Fold 2"]
    C["Fold 3"]
    D["Fold 4"]
    E["Fold 5"]

    F["Train on 4 folds<br/>Validate on 1 fold"]

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

Benefits:

better performance estimation
improved robustness
reduced dataset bias

Useful when:

datasets are small
evaluation data is limited

7. 🧪 A/B Testing

A/B testing compares two model versions using real users.

Purpose:

measure production performance
validate improvements safely

A/B Testing Workflow

flowchart TD

    A["Users"]
        --> B["Traffic Split"]

    B --> C["Model A"]

    B --> D["Model B"]

    C --> E["Metrics Collection"]
    D --> E

Common A/B Testing Metrics

Metric	Example
Click-through rate	Recommendation systems
Latency	AI inference
User satisfaction	Chatbots
Conversion rate	AI assistants
Engagement	Content generation

F1 Score

The F1 score is a machine learning evaluation metric that balances:

Precision
Recall

Precision

How many predicted positives were actually correct?

Precision = \frac{TP}{TP + FP}

Where:

TP = True Positives
FP = False Positives

Recall

How many real positives did the model successfully find?

Recall = \frac{TP}{TP + FN}

Where:

FN = False Negatives

Example:

Precision = 0.80
Recall = 0.50

Then:

F1 = 2 \cdot \frac{0.8 \cdot 0.5}{0.8 + 0.5}

F1 \approx 0.615

So:

F1 ≈ 61.5%

Interpretation of F1 Score

F1 Score	Meaning
1.0	Perfect model
0.9+	Excellent
0.8	Strong
0.7	Decent
<0.5	Weak

`GLUE`: General Language Understanding Evaluation Benchmark

GLUE is a collection of NLP tasks used to measure how well a language model understands language across different problems.

GLUE combines multiple NLP tasks such as:

Task	Purpose
Sentiment Analysis	Detect positive/negative meaning
Text Similarity	Compare sentence meanings
Natural Language Inference	Determine logical relationships
Question Answering	Understand context
Linguistic Acceptability	Judge grammar correctness

Each task has its own metric:

Accuracy
F1 Score
Correlation
Matthews correlation

Final Score is an aggregate of all task scores.

Score Interpretation

Human Baseline is around 87.1. This framing is dated — GLUE was saturated by many models within a year of release (RoBERTa, T5, etc. exceeded the human baseline by 2019 without being "superhuman" at general language understanding), which is why the harder SuperGLUE benchmark replaced it. Beating GLUE's human baseline indicates strong performance on that specific task suite, not general superhuman language ability.

Score	Meaning
60–70	Basic NLP capability
70–80	Strong traditional NLP
80–90	State-of-the-art transformer range
90+	Extremely advanced performance

The final GLUE score is usually the average performance across all tasks.

Perplexity

Perplexity measures how confused a language model is while reading text.

Lower confusion → better prediction quality.

Intuitively:

Lower perplexity means the model is less surprised by the next word.

A language model predicts the probability of the next token.

A good model predicts with high confidence, a bad model distributes probability randomly.
Perplexity measures this uncertainty.

Perplexity helps evaluate:

language fluency
prediction quality
training progress
model comparison

Used heavily in:

NLP research
LM training
transformer evaluation

Low Perplexity is Good

High probability → low perplexity.

Model predicts confidently: I drink coffee every morning.

High Perplexity is Bad

Low probabilities → high perplexity.

Unexpected or random text: Banana quantum bicycle democracy lava.

Model becomes uncertain.

Mathematical Definition

PP(W)=\sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1,...,w_{i-1})}}

Equivalent log form:

PP(W)=\exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|w_{<i}) \right)

Where:

$N$ = number of tokens
$P(w_i|w_{<i})$ = predicted probability of next token

Interpretation

Perplexity	Meaning
1	Perfect prediction
Low (e.g. 10–20)	Strong predictive ability
High (e.g. 100+)	Poor predictions / uncertainty

Example

Suppose a model predicts:

Word	Probability
cat	0.5
dog	0.3
banana	0.01

Higher probability assigned to correct words lowers perplexity.

Relationship to Entropy

Perplexity is closely related to cross-entropy.

$PP = 2^H$

Where:

$$H$$ = entropy

So perplexity is essentially:

exponentiated uncertainty

Limitations

Low perplexity does NOT always mean:

factual correctness
reasoning ability
truthfulness
safety
usefulness

A model can:

memorize text
predict fluent nonsense
hallucinate confidently

This is why modern LLM evaluation also uses:

MMLU
HELM
TruthfulQA
reasoning benchmarks

Modern Context

Perplexity was extremely important for:

RNNs
LSTMs
early transformers

Today, frontier LLM evaluation focuses more on:

reasoning
instruction following
factuality
coding ability
agent behavior

because perplexity alone is insufficient for measuring intelligence.

Offline vs Online Evaluation

Type	Description
Offline Evaluation	Uses datasets and metrics
Online Evaluation	Uses real user traffic

Closed vs Open Source Models

There are two major deployment strategies.

Closed Source Models

Examples:

OpenAI
Anthropic
Google

Advantages:

Strong performance: often better than open source
Easy API integration
No infrastructure to manage

Disadvantages:

Vendor lock-in
Data privacy concerns
Often more expensive per token than self-hosted or open-weight models at scale

Open Source Models

Examples:

LLaMA
Mistral
Falcon

Advantages:

Full control
On-prem deployment
Better privacy

Disadvantages:

Infrastructure complexity
Weaker models (sometimes)

Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

NVIDIA AI-LLM Developers Certification Path

Ethical AI vs Responsible AI vs Trustworthy AI

AI-GenAI/2-1-Model-Evaluation

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🦈 Sharks existed before trees 🌳.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-GenAI

How to Choose the Right AI Model for Your Use Case

A practical guide to selecting the right AI and LLM models based on use case, latency, cost, accuracy, infrastructure, and deployment requirements.

LLM

Generative AI

NVIDIA

AI Infrastructure

AI Inference

← Previous

NVIDIA AI-LLM Developers Certification Path

Ethical AI vs Responsible AI vs Trustworthy AI

Model Selection

Model selection is about balancing:

Accuracy + latency + cost + real-world performance

A poor model choice can lead to:

High infrastructure costs
Slow inference performance
Increased GPU usage
Poor response quality
Difficult deployment and scaling
Security and compliance concerns

How to select the right model?

1. Define the need

What is the use case: classification, generation, summarization, etc.

Different AI tasks require different architectures.

Use Case	Recommended Model Type
Chatbots	Large Language Models (LLMs)
Image Generation	Diffusion Models
Speech Recognition	ASR Models
Recommendations	Ranking Models
Fraud Detection	Classification Models
Code Generation	Code LLMs
Search & Q/A	Retrieval-Augmented Generation (RAG)

2. Shortlist candidates

Research existing models that fit the requirements.

Consider open-source vs. closed-source models.
Compare model sizes and capabilities.
Evaluate the model's performance on relevant benchmarks and tasks.
arena

3. Evaluate the model

Use metrics like accuracy, precision, recall, F1 score, etc

4. Test Selected Model

Test the model on a small sample of your data to see how it performs in practice.

Model Size

Model Scaling Law

According to AI scaling laws, increasing parameters and data size predictably

improves performance but also
increases inference latency

Different tasks require different model sizes.

Model Size	Capabilities	Example Tasks
`1B` parameters	Basic tasks	Sentiment classification, simple Q&A
`10B` parameters	Moderate reasoning	Chatbots, content generation
`100B+` parameters	Complex reasoning	Brainstorming assistants, code generation

Model size directly impacts:

GPU memory requirements
Latency
Training cost
Inference throughput

Model Size	Typical Usage
SLM (Small Language Model) (1B–7B)	Edge AI, fast inference
Medium (8B–30B)	Enterprise assistants
LLM (Large Language Model) (40B+)	Research and advanced reasoning

Type	Description
SLM (Small Language Model)	Smaller models optimized for specific tasks, lower latency, and reduced compute requirements
LLM (Large Language Model)	Large general-purpose models capable of handling multiple tasks and broad reasoning

Typical Tradeoff

Feature	SLM	LLM
Compute Cost	Lower	Higher
Latency	Faster	Slower
Generalization	Limited	Strong
Domain Specialization	Strong	Moderate
Memory Usage	Lower	Higher

Model Latency

Time taken for a model to generate a response after receiving input.

Latency matters heavily in production systems.

Application	Preferred Latency
Chatbots	< 2 seconds
Real-time AI Agents	< 1 second
Batch Processing	Minutes acceptable
Document Analysis	Moderate latency

Model Evaluation Metrics

Common metrics for evaluating AI models:

Concept	Purpose
`Accuracy`	Correct predictions
`BLEU`	Translation quality
`ROUGE`	Summarization quality
`Cosine Similarity`	Semantic similarity
`Cross-validation`	Reliable evaluation
`A/B Testing`	Real-world comparison

1. Model Accuracy

How often a model predicts correctly on unseen data.

Example:

95 correct predictions out of 100
→ Accuracy = 95%

2. 🔣 `BLEU`: Bilingual Evaluation Understudy Score

Measure precision overlap between generated text and reference text.

White Paper https://www.aclweb.org/anthology/P02-1040.pdf

Simplified BLEU Formula

BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

Where:

$BP$ = brevity penalty
$p_n$ = n-gram precision
$w_n$ = weights

Higher overlap → higher BLEU score.

we don't punish long candidates, and only punish short candidates.

Used mainly for:

machine translation
text generation evaluation

Example:

Reference	"The cat sits on the mat"
Generated	"The cat is on the mat"

3. 📋 `ROUGE` Score

How much important reference content was captured.

ROUGE stands for:

Recall-Oriented Understudy for Gisting Evaluation

Simplified ROUGE Formula

ROUGE = \frac{Overlapping\ Words}{Total\ Reference\ Words}

Higher scores indicating higher similarity between the automatically produced summary and the reference.

Focus:

recall
content coverage

Used mainly for:

Summarization Text

BLEU vs ROUGE

Metric	Focus	Common Use
BLEU	Precision	Translation
ROUGE	Recall	Summarization

4. ↗️ Cosine Similarity

Measure Semantic similarity between vector embeddings.

It compares the angle between vectors.

cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}

Range:

Value	Meaning
1	Very similar
0	Unrelated
-1	Opposite direction

Embedding Similarity Example

  flowchart TD
    A["Fast GPU computing"]--> C["Embedding Space"]
    B["Parallel GPU processing"]--> C
    C --> D["High Cosine Similarity"]

5. Cross-Validation

Cross-validation evaluates models using multiple data splits.

Purpose:

estimate generalization performance
reduce overfitting risk

6. K-Fold Cross Validation

Each fold becomes the validation set once.

Dataset = \{Fold_1, Fold_2, Fold_3, ..., Fold_k\}

Training strategy:

Train = k - 1\ folds

Validation = 1\ fold

flowchart LR

    A["Fold 1"]
    B["Fold 2"]
    C["Fold 3"]
    D["Fold 4"]
    E["Fold 5"]

    F["Train on 4 folds<br/>Validate on 1 fold"]

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

Benefits:

better performance estimation
improved robustness
reduced dataset bias

Useful when:

datasets are small
evaluation data is limited

7. 🧪 A/B Testing

A/B testing compares two model versions using real users.

Purpose:

measure production performance
validate improvements safely

A/B Testing Workflow

flowchart TD

    A["Users"]
        --> B["Traffic Split"]

    B --> C["Model A"]

    B --> D["Model B"]

    C --> E["Metrics Collection"]
    D --> E

Common A/B Testing Metrics

Metric	Example
Click-through rate	Recommendation systems
Latency	AI inference
User satisfaction	Chatbots
Conversion rate	AI assistants
Engagement	Content generation

F1 Score

The F1 score is a machine learning evaluation metric that balances:

Precision
Recall

Precision

How many predicted positives were actually correct?

Precision = \frac{TP}{TP + FP}

Where:

TP = True Positives
FP = False Positives

Recall

How many real positives did the model successfully find?

Recall = \frac{TP}{TP + FN}

Where:

FN = False Negatives

Example:

Precision = 0.80
Recall = 0.50

Then:

F1 = 2 \cdot \frac{0.8 \cdot 0.5}{0.8 + 0.5}

F1 \approx 0.615

So:

F1 ≈ 61.5%

Interpretation of F1 Score

F1 Score	Meaning
1.0	Perfect model
0.9+	Excellent
0.8	Strong
0.7	Decent
<0.5	Weak

`GLUE`: General Language Understanding Evaluation Benchmark

GLUE is a collection of NLP tasks used to measure how well a language model understands language across different problems.

GLUE combines multiple NLP tasks such as:

Task	Purpose
Sentiment Analysis	Detect positive/negative meaning
Text Similarity	Compare sentence meanings
Natural Language Inference	Determine logical relationships
Question Answering	Understand context
Linguistic Acceptability	Judge grammar correctness

Each task has its own metric:

Accuracy
F1 Score
Correlation
Matthews correlation

Final Score is an aggregate of all task scores.

Score Interpretation

Score	Meaning
60–70	Basic NLP capability
70–80	Strong traditional NLP
80–90	State-of-the-art transformer range
90+	Extremely advanced performance

The final GLUE score is usually the average performance across all tasks.

Perplexity

Perplexity measures how confused a language model is while reading text.

Lower confusion → better prediction quality.

Intuitively:

Lower perplexity means the model is less surprised by the next word.

A language model predicts the probability of the next token.

A good model predicts with high confidence, a bad model distributes probability randomly.
Perplexity measures this uncertainty.

Perplexity helps evaluate:

language fluency
prediction quality
training progress
model comparison

Used heavily in:

NLP research
LM training
transformer evaluation

Low Perplexity is Good

High probability → low perplexity.

Model predicts confidently: I drink coffee every morning.

High Perplexity is Bad

Low probabilities → high perplexity.

Unexpected or random text: Banana quantum bicycle democracy lava.

Model becomes uncertain.

Mathematical Definition

PP(W)=\sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1,...,w_{i-1})}}

Equivalent log form:

PP(W)=\exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|w_{<i}) \right)

Where:

$N$ = number of tokens
$P(w_i|w_{<i})$ = predicted probability of next token

Interpretation

Perplexity	Meaning
1	Perfect prediction
Low (e.g. 10–20)	Strong predictive ability
High (e.g. 100+)	Poor predictions / uncertainty

Example

Suppose a model predicts:

Word	Probability
cat	0.5
dog	0.3
banana	0.01

Higher probability assigned to correct words lowers perplexity.

Relationship to Entropy

Perplexity is closely related to cross-entropy.

$PP = 2^H$

Where:

$$H$$ = entropy

So perplexity is essentially:

exponentiated uncertainty

Limitations

Low perplexity does NOT always mean:

factual correctness
reasoning ability
truthfulness
safety
usefulness

A model can:

memorize text
predict fluent nonsense
hallucinate confidently

This is why modern LLM evaluation also uses:

MMLU
HELM
TruthfulQA
reasoning benchmarks

Modern Context

Perplexity was extremely important for:

RNNs
LSTMs
early transformers

Today, frontier LLM evaluation focuses more on:

reasoning
instruction following
factuality
coding ability
agent behavior

because perplexity alone is insufficient for measuring intelligence.

Offline vs Online Evaluation

Type	Description
Offline Evaluation	Uses datasets and metrics
Online Evaluation	Uses real user traffic

Closed vs Open Source Models

There are two major deployment strategies.

Closed Source Models

Examples:

OpenAI
Anthropic
Google

Advantages:

Strong performance: often better than open source
Easy API integration
No infrastructure to manage

Disadvantages:

Vendor lock-in
Data privacy concerns
Often more expensive per token than self-hosted or open-weight models at scale

Open Source Models

Examples:

LLaMA
Mistral
Falcon

Advantages:

Full control
On-prem deployment
Better privacy

Disadvantages:

Infrastructure complexity
Weaker models (sometimes)

Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

NVIDIA AI-LLM Developers Certification Path

Ethical AI vs Responsible AI vs Trustworthy AI

AI-GenAI/2-1-Model-Evaluation

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

How to Choose the Right AI Model for Your Use Case

A practical guide to selecting the right AI and LLM models based on use case, latency, cost, accuracy, infrastructure, and deployment requirements.

Model Selection

How to select the right model?

1. Define the need

2. Shortlist candidates

3. Evaluate the model

4. Test Selected Model

Model Size

Model Scaling Law

Typical Tradeoff

Model Latency

Model Evaluation Metrics

1. Model Accuracy

2. 🔣 BLEU: Bilingual Evaluation Understudy Score

Simplified BLEU Formula

3. 📋 ROUGE Score

Simplified ROUGE Formula

BLEU vs ROUGE

4. ↗️ Cosine Similarity

Embedding Similarity Example

5. Cross-Validation

6. K-Fold Cross Validation

7. 🧪 A/B Testing

A/B Testing Workflow

Common A/B Testing Metrics

F1 Score

Precision

Recall

Example:

Interpretation of F1 Score

GLUE: General Language Understanding Evaluation Benchmark

Score Interpretation

Perplexity

Low Perplexity is Good

High Perplexity is Bad

Mathematical Definition

Interpretation

Example

Relationship to Entropy

Limitations

Modern Context

Offline vs Online Evaluation

Closed vs Open Source Models

Closed Source Models

Open Source Models

Written by Hitesh Sahu, a passionate developer and blogger.

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

2. 🔣 `BLEU`: Bilingual Evaluation Understudy Score

3. 📋 `ROUGE` Score

`GLUE`: General Language Understanding Evaluation Benchmark

2. 🔣 `BLEU`: Bilingual Evaluation Understudy Score

3. 📋 `ROUGE` Score

`GLUE`: General Language Understanding Evaluation Benchmark