Model
A model is a program that has been trained on a set of data to recognize certain patterns or make certain decisions without further human intervention.
Model = Trained Algorithm + Data
Inferences
Process of running unseen data through a trained AI model to make a prediction or solve a task Inference is an ML model in action.
Foundation Models (FMs)
Large-scale models trained on broad data that can be adapted to a wide range of downstream tasks.
- Examples:
GPT-3,BERT,DALL-E,Stable Diffusion - Characteristics:
- Trained on massive datasets (text, images, code)
- Capable of zero-shot and few-shot learning
- Serve as a base for fine-tuning on specific tasks
Large Language Model (LLM)
A type of foundation model specifically designed to understand and generate human language.
- Examples:
GPT-3,BERT,T5 - Characteristics:
- Trained on vast amounts of text data
- Able to recognize and interpret human language
- Flexible: can perform tasks like text generation, translation, summarization, and question-answering
Word Embedding
Technique used to represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.
- Examples:
Word2Vec,GloVe,FastText - Characteristics:
- Each word is represented as a high-dimensional vector
- Similar words have similar vector representations
- Enables models to understand context and relationships between words
- Foundation for many NLP tasks and models, including LLMs
Transformer Models
Deep learning models that understand relationships within sequences.
- White Paper: "Attention is All You Need" (Vaswani et al., 2017)
- Examples:
BERT,GPT-3,T5
Key Components:
- Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence when making predictions.
- Multi-Head Attention: Enables the model to focus on different parts of the input simultaneously.
- Feed-Forward Networks: Processes the output of the attention mechanism to produce the final output.
- Characteristics:
- Highly parallelizable, making it efficient to train on large datasets
- Capable of capturing long-range dependencies in text
- Has become the standard architecture for LLMs and many other NLP tasks.
Transformer Key Components
1. Tokenization
Break text into tokens.
Example: unbelievable → un + believ + able
Why? Reduces vocabulary size.
Common tokenizers: BPE WordPiece SentencePiece
2. Embeddings
Convert token IDs → vectors (numbers with meaning).
Without embeddings: Tokens are just numbers.
3. Positional Encoding
Adds order information.
Transformers do NOT understand sequence order naturally. Position is injected manually.
4. Self-Attention (Extremely Important)
Allows model to:
- Look at all words at once
- Determine which words are important
Example: "The movie had a slow start but was amazing."
Model focuses more on: “amazing” due to contrast word “but”.
5. Multi-Head Attention
Multiple attention mechanisms running in parallel.
Each head learns different relationships: Syntax Emotion Topic
Long-distance dependency
Exam question may ask:
Q: Why multi-head instead of single head? Answer: To learn multiple representation subspaces simultaneously.
Encoder vs Decoder
Encoder-only (BERT)
If task = understand input → Encoder-only
- Understand text
- Classification
- Sentiment
- Search ranking
Decoder-only (GPT)
If task = generate output → Decoder-only
- Generate text
- Chat
- Story writing
- Code
Encoder-Decoder (T5, BART)
If task = both understand + generate → Encoder-Decoder
- Translation
- Summarization
- Text transformation
