What is AI Models and How to pick the right one?
Retrieval-Augmented Generation (RAG) for AI Applications
Transformer Models 📟
Deep learning models that understand relationships within sequences.
- White Paper: "Attention is All You Need" (Vaswani et al., 2017)
Transformers are the Backbone of LLMs and Generative AI
Characteristics:
- Highly parallelizable, making it efficient to train on large datasets
- Capable of capturing long-range dependencies in text
They are foundation for LLMs and many other NLP tasks:
- Transformer Neural Networks
- GPT
- BERT
- CNNs
- MLPs
- almost all deep learning models
🤖 Transformer Encoder-Decoder Architecture
flowchart TD
%% =========================
%% Encoder
%% =========================
subgraph Encoder["🧠 Encoder"]
subgraph EncoderEmbedder["Encoder Embedder"]
A[📝 Input Tokens]
B[🔤 Token Embedding]
C[📍 Positional Encoding]
A --> B --> C
end
D>🎯 Multi-Head Self Attention]
E[➕ Add & Layer Norm]
F[[⚙️ Feed Forward Network]]
G[➕ Add & Layer Norm]
EncoderEmbedder --> D
D --> E --> F --> G
end
%% Encoder Output
H{{📦 Encoder Output}}
G --> H
%% =========================
%% Decoder
%% =========================
subgraph Decoder["🧩 Decoder"]
subgraph DecoderEmbedder[" 📤 Decoder Embedder"]
O[📤 Previous Output Tokens]
P[🔤 Token Embedding]
Q[📍 Positional Encoding]
O --> P --> Q
end
I>🚫 Masked Multi-Head Self Attention]
J[➕ Add & Layer Norm]
K>🔄 Cross Attention]
L[➕ Add & Layer Norm]
M[[⚙️ Feed Forward Network]]
N[➕ Add & Layer Norm]
DecoderEmbedder --> I
I --> J --> K --> L --> M --> N
end
%% Cross Attention
H --> K
%% Final Prediction
N --> R[📈 Linear Layer]
R --> S[🎲 Softmax]
S --> T[✅ Predicted Token]
Transformer Key Components
1. Tokenization 🔠
Break text into tokens.
Example: unbelievable → un + believe + able
Why? Reduces vocabulary size.
Common tokenizers:
- BPE
- WordPiece
- SentencePiece
Text Preprocessing
Their goal is to reduce words to a simpler base form.
| Technique | Output Quality | Speed | Uses Dictionary | Context Aware |
|---|---|---|---|---|
Stemming |
Rough root | Faster | No | NO |
Lemmatization |
Real word | Slower | Yes | YES |
1. Stemming
Stemming removes prefixes or suffixes using heuristic rules.
- Reduces words to their root form by chops off word endings, often resulting in non-words.(e.g., "running" → "run").
- The resulting word may NOT be a valid dictionary word.
Stemming Example
| Word | Stemmed Output |
|---|---|
| playing | play |
| played | play |
| studies | studi |
| running | run |
Common Algorithms
| Algorithm | Description |
|---|---|
Porter Stemmer |
Most popular |
Snowball Stemmer |
Improved Porter |
Lancaster Stemmer |
Aggressive stemming |
Example
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem("studies")
Output:
studi
Advantages of Stemming
- Fast : Rule-based not grammar based
- Lightweight : Low computational cost |
- Good for search systems
Limitation of Stemming
- Produces invalid words : Example:
studi - Less accurate : Ignores context and grammar
- Aggressive reduction : May lose meaning |
2. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma) using vocabulary and morphological analysis.
- It considers the context and part of speech to produce valid words (e.g., "running" → "run", "better" → "good").
- The resulting word is always a valid dictionary word.
| Word | Lemmatized Output |
|---|---|
| playing | play |
| studies | study |
| better | good |
| running | run |
Example:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("studies")
Output:
study
When to Use Lemmatization
Use lemmatization when:
- accuracy matters
- semantic understanding is important
- NLP tasks require context awareness
2. Word Embedding 🔢
Technique used to represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.
- Each word is represented as a high-dimensional vector
Convert token IDs → vectors (numbers with meaning)
- Similar words have similar vector representations
- Enables models to understand context and relationships between words
Examples:
Word2VecGloVeFastText
graph TD
A[Token IDs] --> B[Embedding Layer]
B --> C[Word Vectors]
Common techniques
1. Word2Vec: Static Embeddings
It is a deep learning algorithm that uses a shallow neural network to learn semantic relationships and generate real-number vectors where words with similar meanings are positioned close together in a vector space
Static Representation: Word2Vec generates static embeddings, meaning every word in its dictionary is assigned exactly one fixed vector regardless of how it is used in a sentence
- Lack of Context: Because the embeddings are static, the model cannot differentiate between different meanings of the same word
- For example, Word2Vec would produce the exact same vector for the word "bank" in both "river bank" and "investment bank"
from gensim.models import Word2Vec
sentences = [["the", "cat", "sat", "on", "the", "mat"], ["the", "dog", "barked"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word_vector = model.wv['cat']
print(word_vector)
WordNet vs Word2Vec
WordNet is a hand-crafted database (no executable code)
- WordNet mimics human logic, focusing on word senses and connections between real-world entities.
| Aspect | WordNet |
Word2Vec |
|---|---|---|
| Type | Hand-crafted lexical-semantic database | Neural embedding model |
| Built By | Linguists and manual curation | Machine learning from text corpora |
| Representation | Symbolic semantic network | Dense numerical vectors |
| Focus | Word senses and semantic relations | Distributional similarity from usage |
| Meaning Handling | Separates multiple senses explicitly | Classic model merges senses into one vector |
| Interpretability | Human-readable and inspectable | Latent vectors, not directly interpretable |
| Semantic Relations | Explicit: hypernyms, hyponyms, meronyms, synonyms | Implicit statistical relationships |
| Algebraic Operations | Not supported naturally | Supports vector arithmetic |
| Example Capability | dog → animal hierarchy |
king - man + woman ≈ queen |
| Training | Static curated resource | Trainable on any corpus |
| Domain Adaptation | Limited manual expansion | Easily adapted to domain-specific corpora |
| Cross-Lingual Support | Linked multilingual lexical networks | Possible through aligned multilingual embeddings |
| Scalability | Limited by manual maintenance | Scales with data and compute |
| Knowledge Source | Human knowledge engineering | Statistical language patterns |
| AI Paradigm | Symbolic AI | Statistical AI |
| Strength | Precise semantic structure | Captures contextual usage patterns |
| Weakness | Limited coverage and flexibility | Weak interpretability and sense ambiguity |
| Modern Role | Useful for semantic reasoning and NLP resources | Historical foundation for modern embeddings |
| Successors / Evolution | Semantic graphs and ontologies | Contextual embeddings like BERT and GPT |
| Analogy | Curated semantic encyclopedia | Geometric map of language usage |
2. BERT: Contextualized Embeddings
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that generates contextualized word embeddings, meaning the vector representation of a word can change based on the context in which it appears.
White Paper: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018)
Contextualized Representation: BERT is designed to generate contextualized word representations, allowing it to understand the nuanced meaning of words based on their specific surroundings
Nuanced Understanding: Using the same example, BERT would produce very different vectors for the word "bank" in "river bank" versus "investment bank" because it captures the surrounding context to understand the intended meaning
Model Category: BERT is categorized as an autoencoding model, which makes it highly effective for language understanding tasks like text classification and question answering
from transformers import BertTokenizer, BertModel
sentence = "The bank is on the river."
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(sentence, return_tensors='pt')
model = BertModel.from_pretrained('bert-base-uncased')
outputs = model(**inputs)
word_embeddings = outputs.last_hidden_state
print(word_embeddings)
3. Doc2Vec
Doc2Vec is an extension of Word2Vec that generates vector representations for entire documents or paragraphs, rather than just individual words.
- Document-Level Representation: Doc2Vec captures the overall meaning of a document, making it useful for tasks like document classification, sentiment analysis, and information retrieval
- It learns to represent documents in a continuous vector space, allowing for comparisons between documents based on their semantic content
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
# Sample documents
documents = ["I love machine learning.", "Transformers are powerful models.", "Natural language processing
] is fascinating."]
# Tagging documents
tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]
# Training Doc2Vec model
model = Doc2Vec(tagged_docs, vector_size=50, window=2, min_count=1, workers=4)
# Getting document vectors
doc_vector = model.infer_vector(["I", "love", "machine", "learning."])
print(doc_vector)
3. Positional Encoding ↗️
Adds order information.
Transformers do NOT understand sequence order naturally. Position is injected manually.
4. 👁️ Self-Attention (Extremely Important)
Allows the model to weigh the importance of different words in a sentence when making predictions.
Allows model to:
- Look at all words at once
- Determine which words are important
- Addign weight to each word based on relevance to the current word being processed.
Example: "The movie had a slow start but was amazing."
Model focuses more on: “amazing” due to contrast word “but”.
5. 👀 Multi-Head Attention
Enables the model to focus on different parts of the input simultaneously.
Multiple attention mechanisms running in parallel. Each head learns different relationships:
- Syntax
- Emotion
- Topic
Long-distance dependency
Q: Why multi-head instead of single head? Answer: To learn multiple representation subspaces simultaneously.
6. 🧠 Multi Level Perceptron (MLP)
Kind of modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions
- MLP is used in transformers to process the output of the attention mechanism and produce the final output.
- Attention captures relationships, MLP transforms those relationships into meaningful representations for the next layer or final output.
Feed Forward Network
Processes the output of the attention mechanism to produce the final output.
MLP typically consists of two linear transformations with a nonlinear activation function (like ReLU) in between.
graph LR
A[Attention Output] --> B[Linear Layer 1]
B --> C[ReLU Activation]
C --> D[Linear Layer 2]
Transformers stack multiple layers of attention and MLP to build deep representations of the input data, enabling them to perform complex tasks like language understanding and generation.
7. 🔣 Encoder (BERT)
- Processes the input sequence
- Learns contextual representations
- Uses self-attention to understand relationships between words
Use cases
- Understand text
- Classification
- Sentiment
- Search ranking
8. 💬 Decoder
- Generates output tokens one by one
- Uses masked attention to hide future tokens
- Uses cross-attention to focus on encoder outputs
Use Case
- Generate text
- Chat
- Story writing
- Code
Encoder-Decoder (T5, BART)
If task = both understand + generate → Encoder-Decoder
Use Case
- Translation
- Summarization
- Text transformation
9. Add & LayerNorm 📐
It consists of 2 parts
9.1. Residual connection (skip connection)
Allows the input of a layer to bypass some operations and be added directly to the output.
Instead of learning:
the network learns:
So the final output becomes:
The original input is preserved and added back after transformation.
or more commonly written as:
This helps:
- prevent vanishing gradients
- stabilize deep networks
- improve training speed
- retain important information
9.2. Layer Normalization
Layer Normalization is a technique used in neural networks to normalize the activations of a layer for each individual training example.
it transforms activations so they have:
- mean approximately
- variance approximately
Most values often fall within a few standard deviations of zero, which is why ranges like to are commonly observed.
It helps:
- stabilize training
- speed up convergence
- reduce sensitivity to initialization
- prevent activations from becoming too large or too small
10 Final Prediction 🎲
The decoder output passes through:
10.1 Linear Layer 📈
A Linear Layer is a neural network layer that performs a weighted transformation of the input.
It is also called:
- Fully Connected Layer
- Dense Layer
The layer learns:
- weights
- biases
to transform input features into new representations.
Linear Layer Formula
Where:
- = input vector
- = weight matrix
- = bias vector
- = output vector
How it Helps?
The linear layer:
- mixes information from the input features
- learns important patterns
- changes feature dimensions
10.2 Softmax 🎲
Softmax is an activation function that converts raw model outputs (called logits) into probabilities.
The probabilities:
- are between and
- add up to
This makes Softmax useful for:
- classification
- next-token prediction
- choosing the most likely output
Softmax Formula
Where:
- = input logit
- = exponential transformation
- denominator = sum of exponential of all logits
How it helps
Softmax amplifies larger values and suppresses smaller ones.
Example logits:
After Softmax:
The largest logit gets the highest probability.
