LLMs & Foundation Models Explained
A practical guide to Large Language Models (LLMs) and foundation models, covering architectures, training concepts, fine-tuning, inference, embeddings, RAG, and real-world AI application development.
NVIDIA NGC Catalog: GPU Optimized Containers, AI Models and Enterprise AI Infrastructure
🎿 Beginner’s Guide to Skiing
What is Large Language Model (LLM)
A Large Language Model is a sophisticated mathematical function that predicts what word comes next for any piece of text"
LLM is a type of foundation model specifically designed to understand and generate human language.
White Paper: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
🧱 Foundation Models (FMs)
Large-scale models trained on broad data that can be adapted to a wide range of downstream tasks.
- Examples:
GPT-3,BERT,DALL-E,Stable Diffusion
Characteristics:
- Trained on massive datasets (text, images, code)
- Capable of zero-shot and few-shot learning
- Serve as a base for fine-tuning on specific tasks
Autoregressive language model
A type of language model that generates text by predicting the next word in a sequence based on the previous words.
- Example: GPT-3, LLaMA, Mistral, Falcon
🧠 Large Language Models (LLMs)
A subset of foundation models that are specifically designed to understand and generate human language.
| Provider | Type | Description |
|---|---|---|
| AWS Bedrock | aws_bedrock | AWS Bedrock API |
| Azure OpenAI | azure_openai | Azure OpenAI API |
| Hugging Face | huggingface | Hugging Face API |
| Hugging Face Inference | huggingface_inference | Hugging Face Inference API, Endpoints, and TGI |
| LiteLLM | litellm | LiteLLM API |
| NVIDIA NIM | nim | NVIDIA Inference Microservice (NIM) |
| OCI Generative AI | oci | OCI Generative AI |
| OpenAI | openai | OpenAI API |
Examples: GPT-3, BERT, T5
- Characteristics:
- Trained on vast amounts of text data
- Able to recognize and interpret human language
- Flexible: can perform tasks like text generation, translation, summarization, and question-answering
Parameter Tuning in LLM
Use low temperature and low top-p for agents, planners, tool calls, and structured outputs.
Use higher temperature and top-p for creative content generation and brainstorming.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.5",
input="Generate deployment steps for a Grafana dashboard import.",
temperature=0.1,
top_p=0.2
)
print(response.output_text)
Choosing Top-P & Temperature
| Use Case | 🌡️ Temperature | Top-p 🎲 |
|---|---|---|
| JSON generation | 0.0 | 0.1 |
| Tool calling | 0.0 | 0.1 |
| Code generation | 0.1 | 0.2 |
| RAG QA | 0.2 | 0.8 |
| Summarization | 0.3 | 0.9 |
| Creative writing | 0.8 | 0.95 |
| Brainstorming | 1.0 | 1.0 |
1. Temperature 🌡️
Changes the shape of the probability distribution.
Temperature = changes the probabilities of tokens.
2. Top-p / Nucleus Sampling 🎲
Top K dynamically adjusts the number of tokens considered based on their cumulative probability.
Top=P sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold P (a value between 0 and 1).
Top-p = changes which tokens are allowed to participate in sampling.
Example
Tokens:
| Token | Probability |
|---|---|
| "the" | 40% |
| "a" | 25% |
| "this" | 15% |
| "that" | 10% |
| Others | 10% |
Top-p = 1.0
All tokens remain eligible. Maximum diversity.
Eg: Where every bean begins a new adventure.
Top-p = 0.9
The model will consider the smallest number of tokens whose combined probability is 90%.
the 40%
a 25% -> 65%
this 15% -> 80%
that 10% -> 90%
The remaining low-probability tokens are discarded.
Eg. Wake up to a cup of inspiration.
Top-p = 0.3
the 40%
Only the highest-probability token is eligible.
Output becomes highly deterministic.
Eg. Fresh coffee, every day.
Typical Values
| Top-p | Behavior | Meaning |
|---|---|---|
| 0.1 - 0.3 | Very deterministic | Less Choices |
| 0.5 - 0.7 | Controlled | Fresh coffee, every day. |
| 0.8 - 0.95 | Balanced | Medium |
| 1.0 | Maximum diversity | Creative |
3. Max Tokens 🗨
Limits the number of output tokens the model can generate.
Example: Max Tokens = 50
The model stops after approximately 50 output tokens.
Small Value
Max Tokens = 20
Response may be cut off.
The migration plan consists of three phases:
1. Assessment
2. Pilot
3...
Large Value:
Max Tokens = 2000
The model can provide a detailed answer.
| Parameter | Controls | Effect |
|---|---|---|
| Top-p | Randomness / token selection | How creative or deterministic the output is |
| Max Tokens | Response length | How long the model is allowed to generate |
Prompting Techniques in Generative AI 💬
| Technique | Mental Model | Examples Provided? |
|---|---|---|
| Zero-Shot | "Just do it" | No |
| One-Shot | "Here is one example" | One |
| Few-Shot | "Learn from these examples" | Multiple |
| CoT | "Think step by step" | Optional |
| System Prompt | "Behave like this" | Persistent instruction |
1. Zero-Shot Prompting 👨🦯
Blindly asking LLM to generate text without giving a direction or example
- useful for simple and well-defined tasks
- Most common
- fast inference
Example: "Suggest newborn baby name"
Expected: Random baby names
Aarav — peaceful, calm
Vihaan — dawn, new beginning
Ivaan — God’s gracious gift
Reyansh — ray of light
...
2. One-Shot Prompting ☝
When a single example clarifies task format or style; helps guide the model with minimal context
The model receives:
- one example
- then the real task
Best For
- formatting guidance
- classification tasks
- lightweight context steering
Example: "Suggest newborn baby name starting with A eg: Aaryan"
Expected: Indian Baby names starting with A
Aarav — peaceful, calm
Aadvik — unique
Ayaan — gift of God
Atharv — wisdom, knowledge
...
3. Few-Shot Prompting 📝
When multiple examples are needed to teach the model patterns or nuanced behavior
The model learns patterns from multiple examples.
Best For
- nuanced tasks
- structured outputs
- custom formatting
- behavior steering
Example: "Suggest newborn baby name starting with A eg: Aaryan"
Expected:
Aarav — peaceful
Aaryan — noble
Ayaan — gift of God
...
4. Chain-of-Thought (CoT) 🔗
When reasoning or multi-step logic is required; improves reasoning accuracy by generating intermediate steps Chain-of-Thought prompting encourages:
- intermediate reasoning
- multi-step thinking
CoT improves:
- reasoning accuracy
- logical consistency
- math performance
- planning tasks
Especially useful for:
- LLM agents
- coding tasks
- complex workflows
Example
Question:
If a train travels 60 km/h for 2 hours,
how far does it travel?
Let's think step by step.
Expected reasoning:
Distance = Speed × Time
60 × 2 = 120 km
5. System Prompting 📜
When you want to control model behavior, tone, safety, or output formatting consistently
System prompts define:
- model behavior
- personality
- rules
- tone
- response style
Best For
- chatbots
- enterprise AI
- compliance
- formatting rules
- safety policies
Example
You are a professional support assistant.
Always respond politely and concisely.
🔁 Transfer Learning
Using a model pretrained on a large dataset and adapting it for a related task with limited new data.
A model trained on millions of images already understands edges, textures, faces, animals, etc.
You fine-tune it to detect:
- cancer cells
- defective products
- cats vs dogs
- traffic signs
Advantages
- Faster training
- Less data required (1000 vs 1 million)
- Better accuracy
- Lower compute cost
- Works well for small datasets
Disadvantage
- Source task and target task should be somewhat related
- Biases from pretrained data can transfer
- Large models may still be expensive
Popular models
Computer Vision
- ResNet
- VGGNet
- EfficientNet
- YOLO
NLP
- BERT
- GPT
- T5
🎛 Fine-Tuning
Fine-tuning adapts a model to a specific task.
- Tune model to understand domain-specific language eg medical, legal, finance
- Adapt model using smaller domain dataset
Transfer learning vs Fine-tuning
- Transfer learning = broader concept
- Fine-tuning = one implementation approach
Use cases:
- domain-specific language
- structured outputs
- company-specific style
🧗 Pretraining
Train on massive internet text
- Only makes sense for large organizations with unique data and resources
When Should You Pretrain a Model?
Pretraining an LLM is extremely expensive.
Typical requirements:
- hundreds of billions of tokens
- months of training
- tens of millions of dollars
For most application teams, pretraining should be an option of last resort. It only makes sense when the domain is highly specialized and existing models cannot be adapted effectively.
Typical scale:
| Stage | Data Size |
|---|---|
| Pretraining | billions of tokens |
| Fine-tuning | thousands of examples |
⚗️ Knowledge Distillation
Large, powerful model (
Teacher) transfers learned behavior to a smaller model (Student), enabling similar performance with lower compute and memory usage.
- Knowledge Distillation → senior employee mentoring a junior employee
Example
- Using a large GPT model to train a lightweight chatbot model for mobile devices.
Use case:
- Mainly used to deploy efficient models on edge/mobile devices?
| Concept | Main Goal |
|---|---|
| Transfer Learning | Reuse learned knowledge |
| Fine-Tuning | Adapt pretrained model to specific task |
| Knowledge Distillation | Compress knowledge into smaller model |
Decision Ladder
flowchart TD
A[Start with prompting] --> B{Good enough?}
B -- Yes --> Z[Deploy]
B -- No --> C[Try RAG]
C --> D{Good enough?}
D -- Yes --> Z
D -- No --> E[Try fine-tuning]
E --> F{Good enough?}
F -- Yes --> Z
F -- No --> G[Consider pretraining as last resort]
Therefore, it should be considered a last resort.
Most applications use:
- Prompting
- RAG
- Fine-tuning
RLHF (Reinforcement Learning From Human Feedback
RLHF trains a reward model that scores answers.
- Higher scores go to responses that are more helpful, honest, and harmless.
We can describe the reward idea as:
Then the model is optimized to produce responses with higher expected reward:
where is the model’s response policy.
RLHF Flow Diagram
flowchart TD
P[Prompt] --> G[Model generates candidate responses]
G --> H[Humans score responses]
H --> RM[Train reward model]
RM --> FT[Further train model to prefer high-reward responses]
This is one reason chat systems feel more aligned, polite, and useful than raw base models.
🕵🏻 Agents
Agents use LLMs to perform multi-step reasoning and actions.
Example task:
Research BetterBurgers competitors
Agent plan:
- Search competitors
- Visit websites
- Summarize each company
Agent Workflow Diagram
flowchart TD
U[User goal] --> P[LLM plans steps]
P --> S[Search]
S --> V[Visit websites]
V --> R[Read content]
R --> M[Summarize findings]
M --> O[Return final answer]
Agents are still an active research area, but the core idea is already useful: combine reasoning, planning, and tools to solve multi-step tasks.
The LLM acts as a controller that decides which tools to use.
