Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 1 Transformers

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.
AI-GenAI

  • AI-GenAI Index

  • NVIDIA AI-LLM Developers Certification Path

  • Understanding Generative AI

  • What is AI Models and How to pick the right one?

  • How to Choose the Right AI Model for Your Use Case

  • What are Transformer Models?

  • Retrieval-Augmented Generation (RAG) for AI Applications

  • LLMs & Foundation Models Explained

  • Using LLMs in Development

  • Using LLMs in Production

  • Ethical AI vs Responsible AI vs Trustworthy AI

  • Generative Adversarial Networks (GANs) Explained

  • U-Net Explained

  • Understanding CLIP: Connecting Images and Text in Generative AI

  • Diffusion Models Explained

  • The Economic Impact of Generative AI

  • NVIDIA Certified Associate Generative AI (NCA-GENL) Practice Questions

Cover Image for What are Transformer Models?

What are Transformer Models?

Comprehensive overview of transformer models, including their architecture, key components, and their role in powering large language models and generative AI applications.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

What is AI Models and How to pick the right one?

Next →

Retrieval-Augmented Generation (RAG) for AI Applications

Transformer Models 📟

Deep learning models that understand relationships within sequences.

  • White Paper: "Attention is All You Need" (Vaswani et al., 2017)

Transformers are the Backbone of LLMs and Generative AI

Characteristics:

  • Highly parallelizable, making it efficient to train on large datasets
  • Capable of capturing long-range dependencies in text

They are foundation for LLMs and many other NLP tasks:

  • Transformer Neural Networks
  • GPT
  • BERT
  • CNNs
  • MLPs
  • almost all deep learning models

🤖 Transformer Encoder-Decoder Architecture

flowchart TD

    %% =========================
    %% Encoder
    %% =========================

    subgraph Encoder["🧠 Encoder"]


subgraph EncoderEmbedder["Encoder Embedder"]
          A[📝 Input Tokens]
          B[🔤 Token Embedding]
          C[📍 Positional Encoding]
          A --> B --> C
        end
        
        D>🎯 Multi-Head Self Attention]
        E[➕ Add & Layer Norm]
        F[[⚙️ Feed Forward Network]]
        G[➕ Add & Layer Norm]

        EncoderEmbedder --> D
        D --> E --> F --> G
    end

    %% Encoder Output
    H{{📦 Encoder Output}}
    G --> H


    %% =========================
    %% Decoder
    %% =========================
    

    subgraph Decoder["🧩 Decoder"]

        subgraph DecoderEmbedder[" 📤 Decoder Embedder"]
          O[📤 Previous Output Tokens]
          P[🔤 Token Embedding]
          Q[📍 Positional Encoding]
          
          O --> P --> Q
        end

        I>🚫 Masked Multi-Head Self Attention]
        J[➕ Add & Layer Norm]
        K>🔄 Cross Attention]
        L[➕ Add & Layer Norm]
        M[[⚙️ Feed Forward Network]]
        N[➕ Add & Layer Norm]

        DecoderEmbedder -->  I
        I --> J --> K --> L --> M --> N

    end

    %% Cross Attention
    H --> K

    %% Final Prediction
    N --> R[📈 Linear Layer]
    R --> S[🎲 Softmax]
    S --> T[✅ Predicted Token]

Transformer Key Components

1. Tokenization 🔠

Break text into tokens.

Example: unbelievable → un + believe + able

Why? Reduces vocabulary size.

Common tokenizers:

  • BPE
  • WordPiece
  • SentencePiece

Text Preprocessing

Their goal is to reduce words to a simpler base form.

Technique Output Quality Speed Uses Dictionary Context Aware
Stemming Rough root Faster No NO
Lemmatization Real word Slower Yes YES

1. Stemming

Stemming removes prefixes or suffixes using heuristic rules.

  • Reduces words to their root form by chops off word endings, often resulting in non-words.(e.g., "running" → "run").
  • The resulting word may NOT be a valid dictionary word.

Stemming Example

Word Stemmed Output
playing play
played play
studies studi
running run

Common Algorithms

Algorithm Description
Porter Stemmer Most popular
Snowball Stemmer Improved Porter
Lancaster Stemmer Aggressive stemming

Example

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmer.stem("studies")

Output:

studi

Advantages of Stemming

  • Fast : Rule-based not grammar based
  • Lightweight : Low computational cost |
  • Good for search systems

Limitation of Stemming

  • Produces invalid words : Example: studi
  • Less accurate : Ignores context and grammar
  • Aggressive reduction : May lose meaning |

2. Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma) using vocabulary and morphological analysis.

  • It considers the context and part of speech to produce valid words (e.g., "running" → "run", "better" → "good").
  • The resulting word is always a valid dictionary word.
Word Lemmatized Output
playing play
studies study
better good
running run

Example:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize("studies")

Output:

study

When to Use Lemmatization

Use lemmatization when:

  • accuracy matters
  • semantic understanding is important
  • NLP tasks require context awareness

2. Word Embedding 🔢

Technique used to represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.

  • Each word is represented as a high-dimensional vector

    Convert token IDs → vectors (numbers with meaning)

  • Similar words have similar vector representations
  • Enables models to understand context and relationships between words

Examples:

  • Word2Vec
  • GloVe
  • FastText

graph TD
A[Token IDs] --> B[Embedding Layer]
B --> C[Word Vectors]   

Common techniques

1. Word2Vec: Static Embeddings

It is a deep learning algorithm that uses a shallow neural network to learn semantic relationships and generate real-number vectors where words with similar meanings are positioned close together in a vector space

Static Representation: Word2Vec generates static embeddings, meaning every word in its dictionary is assigned exactly one fixed vector regardless of how it is used in a sentence

  • Lack of Context: Because the embeddings are static, the model cannot differentiate between different meanings of the same word
  • For example, Word2Vec would produce the exact same vector for the word "bank" in both "river bank" and "investment bank"
from gensim.models import Word2Vec

sentences = [["the", "cat", "sat", "on", "the", "mat"], ["the", "dog", "barked"]]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word_vector = model.wv['cat']

print(word_vector)

WordNet vs Word2Vec

WordNet is a hand-crafted database (no executable code)

  • WordNet mimics human logic, focusing on word senses and connections between real-world entities.
Aspect WordNet Word2Vec
Type Hand-crafted lexical-semantic database Neural embedding model
Built By Linguists and manual curation Machine learning from text corpora
Representation Symbolic semantic network Dense numerical vectors
Focus Word senses and semantic relations Distributional similarity from usage
Meaning Handling Separates multiple senses explicitly Classic model merges senses into one vector
Interpretability Human-readable and inspectable Latent vectors, not directly interpretable
Semantic Relations Explicit: hypernyms, hyponyms, meronyms, synonyms Implicit statistical relationships
Algebraic Operations Not supported naturally Supports vector arithmetic
Example Capability dog → animal hierarchy king - man + woman ≈ queen
Training Static curated resource Trainable on any corpus
Domain Adaptation Limited manual expansion Easily adapted to domain-specific corpora
Cross-Lingual Support Linked multilingual lexical networks Possible through aligned multilingual embeddings
Scalability Limited by manual maintenance Scales with data and compute
Knowledge Source Human knowledge engineering Statistical language patterns
AI Paradigm Symbolic AI Statistical AI
Strength Precise semantic structure Captures contextual usage patterns
Weakness Limited coverage and flexibility Weak interpretability and sense ambiguity
Modern Role Useful for semantic reasoning and NLP resources Historical foundation for modern embeddings
Successors / Evolution Semantic graphs and ontologies Contextual embeddings like BERT and GPT
Analogy Curated semantic encyclopedia Geometric map of language usage

2. BERT: Contextualized Embeddings

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that generates contextualized word embeddings, meaning the vector representation of a word can change based on the context in which it appears.

White Paper: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018)

Contextualized Representation: BERT is designed to generate contextualized word representations, allowing it to understand the nuanced meaning of words based on their specific surroundings

Nuanced Understanding: Using the same example, BERT would produce very different vectors for the word "bank" in "river bank" versus "investment bank" because it captures the surrounding context to understand the intended meaning

Model Category: BERT is categorized as an autoencoding model, which makes it highly effective for language understanding tasks like text classification and question answering

from transformers import BertTokenizer, BertModel 

sentence = "The bank is on the river."

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(sentence, return_tensors='pt')

model = BertModel.from_pretrained('bert-base-uncased')
outputs = model(**inputs)

word_embeddings = outputs.last_hidden_state
print(word_embeddings)

3. Doc2Vec

Doc2Vec is an extension of Word2Vec that generates vector representations for entire documents or paragraphs, rather than just individual words.

  • Document-Level Representation: Doc2Vec captures the overall meaning of a document, making it useful for tasks like document classification, sentiment analysis, and information retrieval
  • It learns to represent documents in a continuous vector space, allowing for comparisons between documents based on their semantic content
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

# Sample documents
documents = ["I love machine learning.", "Transformers are powerful models.", "Natural language processing
] is fascinating."]

# Tagging documents
tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]

# Training Doc2Vec model
model = Doc2Vec(tagged_docs, vector_size=50, window=2, min_count=1, workers=4)

# Getting document vectors
doc_vector = model.infer_vector(["I", "love", "machine", "learning."])
print(doc_vector)

3. Positional Encoding ↗️

Adds order information.

Transformers do NOT understand sequence order naturally. Position is injected manually.

4. 👁️ Self-Attention (Extremely Important)

Allows the model to weigh the importance of different words in a sentence when making predictions.

Allows model to:

  • Look at all words at once
  • Determine which words are important
  • Addign weight to each word based on relevance to the current word being processed.

Example: "The movie had a slow start but was amazing."

Model focuses more on: “amazing” due to contrast word “but”.

5. 👀 Multi-Head Attention

Enables the model to focus on different parts of the input simultaneously.

Multiple attention mechanisms running in parallel. Each head learns different relationships:

  • Syntax
  • Emotion
  • Topic

Long-distance dependency

Q: Why multi-head instead of single head? Answer: To learn multiple representation subspaces simultaneously.

6. 🧠 Multi Level Perceptron (MLP)

Kind of modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions

  • MLP is used in transformers to process the output of the attention mechanism and produce the final output.
  • Attention captures relationships, MLP transforms those relationships into meaningful representations for the next layer or final output.

Feed Forward Network

Processes the output of the attention mechanism to produce the final output.

MLP typically consists of two linear transformations with a nonlinear activation function (like ReLU) in between.

graph LR
A[Attention Output] --> B[Linear Layer 1]
B --> C[ReLU Activation]
C --> D[Linear Layer 2]

Transformers stack multiple layers of attention and MLP to build deep representations of the input data, enabling them to perform complex tasks like language understanding and generation.

7. 🔣 Encoder (BERT)

  • Processes the input sequence
  • Learns contextual representations
  • Uses self-attention to understand relationships between words

Use cases

  • Understand text
  • Classification
  • Sentiment
  • Search ranking

8. 💬 Decoder

  • Generates output tokens one by one
  • Uses masked attention to hide future tokens
  • Uses cross-attention to focus on encoder outputs

Use Case

  • Generate text
  • Chat
  • Story writing
  • Code

Encoder-Decoder (T5, BART)

If task = both understand + generate → Encoder-Decoder

Use Case

  • Translation
  • Summarization
  • Text transformation

9. Add & LayerNorm 📐

It consists of 2 parts

9.1. Residual connection (skip connection)

Allows the input of a layer to bypass some operations and be added directly to the output.

Instead of learning:

H(x)H(x)H(x)

the network learns:

F(x)=H(x)−xF(x) = H(x) - xF(x)=H(x)−x

So the final output becomes:

The original input is preserved and added back after transformation.

H(x)=F(x)+xH(x) = F(x) + xH(x)=F(x)+x

or more commonly written as:

Output=x+Sublayer(x)\text{Output} = x + \text{Sublayer}(x)Output=x+Sublayer(x)

This helps:

  • prevent vanishing gradients
  • stabilize deep networks
  • improve training speed
  • retain important information

9.2. Layer Normalization

Layer Normalization is a technique used in neural networks to normalize the activations of a layer for each individual training example.

it transforms activations so they have:

  • mean approximately 000
  • variance approximately 111

Most values often fall within a few standard deviations of zero, which is why ranges like −3-3−3 to +3+3+3 are commonly observed.

It helps:

  • stabilize training
  • speed up convergence
  • reduce sensitivity to initialization
  • prevent activations from becoming too large or too small

10 Final Prediction 🎲

The decoder output passes through:

10.1 Linear Layer 📈

A Linear Layer is a neural network layer that performs a weighted transformation of the input.

It is also called:

  • Fully Connected Layer
  • Dense Layer

The layer learns:

  • weights
  • biases

to transform input features into new representations.

Linear Layer Formula

y=Wx+by = Wx + by=Wx+b

Where:

  • xxx = input vector
  • WWW = weight matrix
  • bbb = bias vector
  • yyy = output vector

How it Helps?

The linear layer:

  • mixes information from the input features
  • learns important patterns
  • changes feature dimensions

10.2 Softmax 🎲

Softmax is an activation function that converts raw model outputs (called logits) into probabilities.

The probabilities:

  • are between 000 and 111
  • add up to 111

This makes Softmax useful for:

  • classification
  • next-token prediction
  • choosing the most likely output

Softmax Formula

Softmax(xi)=exi∑j=1nexj\text{Softmax}(x_i) = \frac{e^{x_i}} {\sum_{j=1}^{n} e^{x_j}}Softmax(xi​)=∑j=1n​exj​exi​​

Where:

  • xix_ixi​ = input logit
  • exie^{x_i}exi​ = exponential transformation
  • denominator = sum of exponential of all logits

How it helps

Softmax amplifies larger values and suppresses smaller ones.

Example logits:

[2.0, 1.0, 0.1][2.0,\ 1.0,\ 0.1][2.0, 1.0, 0.1]

After Softmax:

[0.66, 0.24, 0.10][0.66,\ 0.24,\ 0.10][0.66, 0.24, 0.10]

The largest logit gets the highest probability.

← Previous

What is AI Models and How to pick the right one?

Next →

Retrieval-Augmented Generation (RAG) for AI Applications

AI-GenAI/2-1-Transformers
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.