What is AI Models and How to pick the right one?
Retrieval-Augmented Generation (RAG) for AI Applications
Transformer Models 📟
Deep learning models that understand relationships within sequences.
- White Paper: "Attention is All You Need" (Vaswani et al., 2017)
- Examples:
BERT,GPT-3,T5 - Transformers are the Backbone of LLMs and Generative AI
Key Components:
- Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence when making predictions.
- Multi-Head Attention: Enables the model to focus on different parts of the input simultaneously.
- Feed-Forward Networks: Processes the output of the attention mechanism to produce the final output.
- Characteristics:
- Highly parallelizable, making it efficient to train on large datasets
- Capable of capturing long-range dependencies in text
- Has become the standard architecture for LLMs and many other NLP tasks.
Transformer Key Components
Tokenization 🔠
Break text into tokens.
Example: unbelievable → un + believe + able
Why? Reduces vocabulary size.
Common tokenizers:
- BPE
- WordPiece
- SentencePiece
Word Embedding 🔢
Technique used to represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.
- Each word is represented as a high-dimensional vector
Convert token IDs → vectors (numbers with meaning)
- Similar words have similar vector representations
- Enables models to understand context and relationships between words
Examples:
Word2VecGloVeFastText
graph TD
A[Token IDs] --> B[Embedding Layer]
B --> C[Word Vectors]
Positional Encoding ↗️
Adds order information.
Transformers do NOT understand sequence order naturally. Position is injected manually.
Self-Attention (Extremely Important)
Allows model to:
- Look at all words at once
- Determine which words are important
Example: "The movie had a slow start but was amazing."
Model focuses more on: “amazing” due to contrast word “but”.
Multi-Head Attention
Multiple attention mechanisms running in parallel.
Each head learns different relationships:
- Syntax
- Emotion
- Topic
Long-distance dependency
Q: Why multi-head instead of single head? Answer: To learn multiple representation subspaces simultaneously.
Multi Level Perceptron (MLP)
Kind of modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions
- MLP is used in transformers to process the output of the attention mechanism and produce the final output.
- Attention captures relationships, MLP transforms those relationships into meaningful representations for the next layer or final output.
MLP typically consists of two linear transformations with a nonlinear activation function (like ReLU) in between.
graph LR
A[Attention Output] --> B[Linear Layer 1]
B --> C[ReLU Activation]
C --> D[Linear Layer 2]
Transformers stack multiple layers of attention and MLP to build deep representations of the input data, enabling them to perform complex tasks like language understanding and generation.
Encoder vs Decoder
Encoder-only (BERT)
If task = understand input → Encoder-only
- Understand text
- Classification
- Sentiment
- Search ranking
Decoder-only (GPT)
If task = generate output → Decoder-only
- Generate text
- Chat
- Story writing
- Code
Encoder-Decoder (T5, BART)
If task = both understand + generate → Encoder-Decoder
- Translation
- Summarization
- Text transformation
