Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 1 Transformers

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for What are Transformer Models?

What are Transformer Models?

Comprehensive overview of transformer models, including their architecture, key components, and their role in powering large language models and generative AI applications.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

What is AI Models and How to pick the right one?

Next →

Retrieval-Augmented Generation (RAG) for AI Applications

Transformer Models 📟

Deep learning models that understand relationships within sequences.

  • White Paper: "Attention is All You Need" (Vaswani et al., 2017)
  • Examples: BERT, GPT-3, T5
  • Transformers are the Backbone of LLMs and Generative AI

Key Components:

  • Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence when making predictions.
  • Multi-Head Attention: Enables the model to focus on different parts of the input simultaneously.
  • Feed-Forward Networks: Processes the output of the attention mechanism to produce the final output.
  • Characteristics:
    • Highly parallelizable, making it efficient to train on large datasets
    • Capable of capturing long-range dependencies in text
  • Has become the standard architecture for LLMs and many other NLP tasks.

Transformer Key Components

Tokenization 🔠

Break text into tokens.

Example: unbelievable → un + believe + able

Why? Reduces vocabulary size.

Common tokenizers:

  • BPE
  • WordPiece
  • SentencePiece

Word Embedding 🔢

Technique used to represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.

  • Each word is represented as a high-dimensional vector

    Convert token IDs → vectors (numbers with meaning)

  • Similar words have similar vector representations
  • Enables models to understand context and relationships between words

Examples:

  • Word2Vec
  • GloVe
  • FastText

graph TD
A[Token IDs] --> B[Embedding Layer]
B --> C[Word Vectors]   

Positional Encoding ↗️

Adds order information.

Transformers do NOT understand sequence order naturally. Position is injected manually.

Self-Attention (Extremely Important)

Allows model to:

  • Look at all words at once
  • Determine which words are important

Example: "The movie had a slow start but was amazing."

Model focuses more on: “amazing” due to contrast word “but”.

Multi-Head Attention

Multiple attention mechanisms running in parallel.

Each head learns different relationships:

  • Syntax
  • Emotion
  • Topic

Long-distance dependency

Q: Why multi-head instead of single head? Answer: To learn multiple representation subspaces simultaneously.

Multi Level Perceptron (MLP)

Kind of modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions

  • MLP is used in transformers to process the output of the attention mechanism and produce the final output.
  • Attention captures relationships, MLP transforms those relationships into meaningful representations for the next layer or final output.

MLP typically consists of two linear transformations with a nonlinear activation function (like ReLU) in between.

graph LR
A[Attention Output] --> B[Linear Layer 1]
B --> C[ReLU Activation]
C --> D[Linear Layer 2]

Transformers stack multiple layers of attention and MLP to build deep representations of the input data, enabling them to perform complex tasks like language understanding and generation.


Encoder vs Decoder

Encoder-only (BERT)

If task = understand input → Encoder-only

  • Understand text
  • Classification
  • Sentiment
  • Search ranking

Decoder-only (GPT)

If task = generate output → Decoder-only

  • Generate text
  • Chat
  • Story writing
  • Code

Encoder-Decoder (T5, BART)

If task = both understand + generate → Encoder-Decoder

  • Translation
  • Summarization
  • Text transformation
AI-GenAI/2-1-Transformers
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.