NVIDIA AI-LLM Developers Certification Path
Ethical AI vs Responsible AI vs Trustworthy AI
Model Selection
Model selection is about balancing:
Accuracy + latency + cost + real-world performance
A poor model choice can lead to:
- High infrastructure costs
- Slow inference performance
- Increased GPU usage
- Poor response quality
- Difficult deployment and scaling
- Security and compliance concerns
How to select the right model?
1. Define the need
- What is the use case: classification, generation, summarization, etc.
Different AI tasks require different architectures.
| Use Case | Recommended Model Type |
|---|---|
| Chatbots | Large Language Models (LLMs) |
| Image Generation | Diffusion Models |
| Speech Recognition | ASR Models |
| Recommendations | Ranking Models |
| Fraud Detection | Classification Models |
| Code Generation | Code LLMs |
| Search & Q/A | Retrieval-Augmented Generation (RAG) |
2. Shortlist candidates
Research existing models that fit the requirements.
- Consider open-source vs. closed-source models.
- Compare model sizes and capabilities.
- Evaluate the model's performance on relevant benchmarks and tasks.
- arena
3. Evaluate the model
- Use metrics like accuracy, precision, recall, F1 score, etc
4. Test Selected Model
- Test the model on a small sample of your data to see how it performs in practice.
Model Size
Model Scaling Law
According to AI scaling laws, increasing parameters and data size predictably
- improves performance but also
- increases inference latency
Different tasks require different model sizes.
| Model Size | Capabilities | Example Tasks |
|---|---|---|
1B parameters |
Basic tasks | Sentiment classification, simple Q&A |
10B parameters |
Moderate reasoning | Chatbots, content generation |
100B+ parameters |
Complex reasoning | Brainstorming assistants, code generation |
Model size directly impacts:
- GPU memory requirements
- Latency
- Training cost
- Inference throughput
| Model Size | Typical Usage |
|---|---|
| SLM (Small Language Model) (1B–7B) | Edge AI, fast inference |
| Medium (8B–30B) | Enterprise assistants |
| LLM (Large Language Model) (40B+) | Research and advanced reasoning |
| Type | Description |
|---|---|
| SLM (Small Language Model) | Smaller models optimized for specific tasks, lower latency, and reduced compute requirements |
| LLM (Large Language Model) | Large general-purpose models capable of handling multiple tasks and broad reasoning |
Typical Tradeoff
| Feature | SLM | LLM |
|---|---|---|
| Compute Cost | Lower | Higher |
| Latency | Faster | Slower |
| Generalization | Limited | Strong |
| Domain Specialization | Strong | Moderate |
| Memory Usage | Lower | Higher |
Model Latency
Time taken for a model to generate a response after receiving input.
Latency matters heavily in production systems.
| Application | Preferred Latency |
|---|---|
| Chatbots | < 2 seconds |
| Real-time AI Agents | < 1 second |
| Batch Processing | Minutes acceptable |
| Document Analysis | Moderate latency |
Model Evaluation Metrics
Common metrics for evaluating AI models:
| Concept | Purpose |
|---|---|
Accuracy |
Correct predictions |
BLEU |
Translation quality |
ROUGE |
Summarization quality |
Cosine Similarity |
Semantic similarity |
Cross-validation |
Reliable evaluation |
A/B Testing |
Real-world comparison |
1. Model Accuracy
How often a model predicts correctly on unseen data.
Example:
95 correct predictions out of 100
→ Accuracy = 95%
2. 🔣 BLEU: Bilingual Evaluation Understudy Score
Measure precision overlap between generated text and reference text.
White Paper https://www.aclweb.org/anthology/P02-1040.pdf
Simplified BLEU Formula
Where:
- = brevity penalty
- = n-gram precision
- = weights
So
Higher overlap → higher BLEU score.
- we don't punish long candidates, and only punish short candidates.
Used mainly for:
- machine translation
- text generation evaluation
Example:
| Reference | "The cat sits on the mat" |
|---|---|
| Generated | "The cat is on the mat" |
3. 📋 ROUGE Score
How much important reference content was captured.
ROUGE stands for:
Recall-Oriented Understudy for Gisting Evaluation
Simplified ROUGE Formula
Higher scores indicating higher similarity between the automatically produced summary and the reference.
Focus:
- recall
- content coverage
Used mainly for:
- Summarization Text
BLEU vs ROUGE
| Metric | Focus | Common Use |
|---|---|---|
| BLEU | Precision | Translation |
| ROUGE | Recall | Summarization |
4. ↗️ Cosine Similarity
Measure Semantic similarity between vector embeddings.
It compares the angle between vectors.
Range:
| Value | Meaning |
|---|---|
| 1 | Very similar |
| 0 | Unrelated |
| -1 | Opposite direction |
Embedding Similarity Example
flowchart TD
A["Fast GPU computing"]--> C["Embedding Space"]
B["Parallel GPU processing"]--> C
C --> D["High Cosine Similarity"]
5. Cross-Validation
Cross-validation evaluates models using multiple data splits.
Purpose:
- estimate generalization performance
- reduce overfitting risk
6. K-Fold Cross Validation
Each fold becomes the validation set once.
Training strategy:
flowchart LR
A["Fold 1"]
B["Fold 2"]
C["Fold 3"]
D["Fold 4"]
E["Fold 5"]
F["Train on 4 folds<br/>Validate on 1 fold"]
A --> F
B --> F
C --> F
D --> F
E --> F
Benefits:
- better performance estimation
- improved robustness
- reduced dataset bias
Useful when:
- datasets are small
- evaluation data is limited
7. 🧪 A/B Testing
A/B testing compares two model versions using real users.
Purpose:
- measure production performance
- validate improvements safely
A/B Testing Workflow
flowchart TD
A["Users"]
--> B["Traffic Split"]
B --> C["Model A"]
B --> D["Model B"]
C --> E["Metrics Collection"]
D --> E
Common A/B Testing Metrics
| Metric | Example |
|---|---|
| Click-through rate | Recommendation systems |
| Latency | AI inference |
| User satisfaction | Chatbots |
| Conversion rate | AI assistants |
| Engagement | Content generation |
F1 Score
The F1 score is a machine learning evaluation metric that balances:
- Precision
- Recall
Precision
How many predicted positives were actually correct?
Where:
- TP = True Positives
- FP = False Positives
Recall
How many real positives did the model successfully find?
Where:
- FN = False Negatives
Example:
- Precision = 0.80
- Recall = 0.50
Then:
So:
- F1 ≈ 61.5%
Interpretation of F1 Score
| F1 Score | Meaning |
|---|---|
| 1.0 | Perfect model |
| 0.9+ | Excellent |
| 0.8 | Strong |
| 0.7 | Decent |
| <0.5 | Weak |
GLUE: General Language Understanding Evaluation Benchmark
GLUE is a collection of NLP tasks used to measure how well a language model understands language across different problems.
GLUE combines multiple NLP tasks such as:
| Task | Purpose |
|---|---|
| Sentiment Analysis | Detect positive/negative meaning |
| Text Similarity | Compare sentence meanings |
| Natural Language Inference | Determine logical relationships |
| Question Answering | Understand context |
| Linguistic Acceptability | Judge grammar correctness |
Each task has its own metric:
- Accuracy
- F1 Score
- Correlation
- Matthews correlation
Final Score is an aggregate of all task scores.
Score Interpretation
Human Baseline is around 87.1 so any model scoring above that is considered superhuman on GLUE.
| Score | Meaning |
|---|---|
| 60–70 | Basic NLP capability |
| 70–80 | Strong traditional NLP |
| 80–90 | State-of-the-art transformer range |
| 90+ | Extremely advanced performance |
The final GLUE score is usually the average performance across all tasks.
Perplexity
Perplexity measures how confused a language model is while reading text.
Lower confusion → better prediction quality.
Intuitively:
Lower perplexity means the model is less surprised by the next word.
A language model predicts the probability of the next token.
A good model predicts with high confidence, a bad model distributes probability randomly.
Perplexity measures this uncertainty.
Perplexity helps evaluate:
- language fluency
- prediction quality
- training progress
- model comparison
Used heavily in:
- NLP research
- LM training
- transformer evaluation
Low Perplexity is Good
High probability → low perplexity.
Model predicts confidently: I drink coffee every morning.
High Perplexity is Bad
Low probabilities → high perplexity.
Unexpected or random text: Banana quantum bicycle democracy lava.
Model becomes uncertain.
Mathematical Definition
Equivalent log form:
Where:
- = number of tokens
- = predicted probability of next token
Interpretation
| Perplexity | Meaning |
|---|---|
| 1 | Perfect prediction |
| Low (e.g. 10–20) | Strong predictive ability |
| High (e.g. 100+) | Poor predictions / uncertainty |
Example
Suppose a model predicts:
| Word | Probability |
|---|---|
| cat | 0.5 |
| dog | 0.3 |
| banana | 0.01 |
Higher probability assigned to correct words lowers perplexity.
Relationship to Entropy
Perplexity is closely related to cross-entropy.
Where:
- \(H\) = entropy
So perplexity is essentially:
exponentiated uncertainty
Limitations
Low perplexity does NOT always mean:
- factual correctness
- reasoning ability
- truthfulness
- safety
- usefulness
A model can:
- memorize text
- predict fluent nonsense
- hallucinate confidently
This is why modern LLM evaluation also uses:
- MMLU
- HELM
- TruthfulQA
- reasoning benchmarks
Modern Context
Perplexity was extremely important for:
- RNNs
- LSTMs
- early transformers
Today, frontier LLM evaluation focuses more on:
- reasoning
- instruction following
- factuality
- coding ability
- agent behavior
because perplexity alone is insufficient for measuring intelligence.
Offline vs Online Evaluation
| Type | Description |
|---|---|
| Offline Evaluation | Uses datasets and metrics |
| Online Evaluation | Uses real user traffic |
Closed vs Open Source Models
There are two major deployment strategies.
Closed Source Models
Examples:
- OpenAI
- Anthropic
Advantages:
- Strong performance: often better than open source
- Easy API integration
- Less expensive
Disadvantages:
- Vendor lock-in
- Data privacy concerns
Open Source Models
Examples:
- LLaMA
- Mistral
- Falcon
Advantages:
- Full control
- On-prem deployment
- Better privacy
Disadvantages:
- Infrastructure complexity
- Weaker models (sometimes)
