Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. β€Ί
  3. posts
  4. β€Ί
  5. …

  6. β€Ί
  7. 2 2 RAG

Loading ⏳
Fetching content, this won’t take long…


πŸ’‘ Did you know?

🦈 Sharks existed before trees 🌳.

πŸͺ This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Retrieval-Augmented Generation (RAG) for AI Applications

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

What are Transformer Models?

Next β†’

The Economic Impact of Generative AI

🧼 RAG (Retrieval-Augmented Generation)

What is RAG?

Helps LLM generate answers grounded in retrieved knowledge based on Vector DB of existing knowledge.

RAG combines two components:

  1. Retriever
  2. Generator (LLM)

Core idea:

LLM+Retrieval=UsefulAISystemLLM + Retrieval = Useful AI SystemLLM+Retrieval=UsefulAISystem

Without RAG we ask LLM directly:

User β†’ LLM β†’ Answer

With RAG we add a context retrieval step to improve the answer:

Query→Embedding→VectorSearch→ContextRetrieval→LLMGenerationQuery \rightarrow Embedding \rightarrow Vector Search \rightarrow Context Retrieval \rightarrow LLM GenerationQuery→Embedding→VectorSearch→ContextRetrieval→LLMGeneration

RAG is becoming the default architecture for AI products. But building it reliably in production requires careful engineering.

Advantages of RAG

  • Access private knowledge: This allows models to answer questions about private or up-to-date data.
  • Reduce hallucinations: By grounding the model in retrieved documents, it reduces the chance of generating false information.
  • Stay up-to-date: Give the model access to external knowledge.

How RAG Works

RAG works in three steps:

  1. Search relevant documents for an answer
  2. Insert retrieved text into the prompt
  3. Generate the answer from the updated prompt.

Given:

q=userΒ queryq = \text{user query}q=userΒ query

DDD represents the document set.

D={d1,d2,...,dn}D = \{d_1, d_2, ..., d_n\}D={d1​,d2​,...,dn​}

The RAG system retrieves the most relevant document

dβˆ—=arg⁑max⁑di∈Dβ€…β€Šsimilarity(q,di)d^* = \arg\max_{d_i \in D} \; \text{similarity}(q, d_i)dβˆ—=argdiβ€‹βˆˆDmax​similarity(q,di​)

Then the LLM generates a response conditioned on (q) and (d^*).

flowchart TD
    Q[User question ❓] --> R1[Retrieve relevant documents πŸ“]
    R1 --> R2[Insert retrieved context into prompt ℹ️]
    R2 --> LLM[LLM generates answer πŸ“„]
    LLM --> A[Grounded response πŸ’¬]

Conceptually, the prompt becomes:

Prompt=Instruction+RetrievedΒ Context+Question\text{Prompt} = \text{Instruction} + \text{Retrieved Context} + \text{Question}Prompt=Instruction+RetrievedΒ Context+Question

For example:

Answer=LLM(Instruction+ParkingΒ Policy+Question)\text{Answer} = \text{LLM}(\text{Instruction} + \text{Parking Policy} + \text{Question})Answer=LLM(Instruction+ParkingΒ Policy+Question)

This is powerful because the LLM is being used more as a reasoning engine than as a pure source of facts.

It reads relevant text and uses that text to formulate an answer

Building a Production RAG System Step-by-Step

Large Language Models are powerful, but they have one major limitation: they don't know your private data.

If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.

Retrieval Augmented Generation (RAG) solves this.

Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.

In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.

Step 1 β€” Data Collection πŸ“‚

Your RAG system is only as good as the documents you feed it.

We need to collect and index all relevant documents that the model can retrieve from.

Rag will search through this documents vector DB to find relevant context for the user query.

Typical sources:

  • Confluence Pages
  • Slack threads
  • GitHub repos
  • Product docs & Wiki
  • Policy Docs eg. PDFs

Example pipeline:

flowchart TD
    A[Data Sources πŸ“š] --> B[Document Loader πŸ“•]
    B --> C[Text Cleaning πŸ“–]
    C --> D[Chunking πŸ“‘]

Python example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()

Step 2 β€” Chunking Documents πŸ“‘

Instead of embedding an entire document, we split it into chunks.

LLMs have context limits (e.g. 4k tokens), so we need to break documents into smaller pieces.

Chunk Size Tradeoff
Small (200 tokens) Better retrieval
Large (1000 tokens) More context

A common heuristic:

1 chunk size = 300 token

Where overlap helps maintain context across chunks.

overlap=50overlap = 50overlap=50

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)

chunks = splitter.split_documents(documents)

Step 3 β€” Embedding the Data ↗️

Embeddings convert text into vectors.

  • Also called vectorization or encoding.
  • Converts text into a high-dimensional vector that captures semantic meaning.

Example embedding:

"What is Kubernetes?"
β†’ [0.12, -0.44, 0.88, ...]

Similar meaning β†’ similar vectors.

Example code:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)

Step 4 β€” Store in a Vector Database πŸ”’

Embeddings must be stored in a vector index.

Popular Vector DB options:

Database Use Case
Pinecone Fully managed vector database.
Weaviate Supports hybrid search
FAISS Opensource Vector DB by Meta
Qdrant Lightweight, embedded vector search engine for in-process retrieval

Example architecture:

flowchart TD
    C[Chunks πŸ“‘] --> E[Embedding Model ↗️]
    E --> V[Vector DB πŸ”’]

Python:

vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)

Step 5 β€” Query Time Retrieval πŸ”Ž

flowchart TD
    Q[User Query ❓] --> E[Embedding ↗️]
    E --> S[Vector Similarity Search πŸ”Ž]
    S --> D[Top-K Documents πŸ“]
    

Mathematically we search using cosine similarity:

Python:

    results = vector_db.search(
    query_embedding,
    k=5
)

Step 6 β€” Prompt Construction πŸ’¬

Now we inject retrieved documents into the prompt.

Example prompt template:


You are a helpful assistant.

Use the context below to answer the question.

Context:
{retrieved_docs}

Question:
{user_query}

Example

prompt = f"""
Answer the question using the context below.

Context:
{docs}

Question:
{query}
"""

Step 7 β€” Generate Answer with LLM πŸ“ƒ

Now the LLM generates the answer grounded in retrieved knowledge.

flowchart TD
    P[Prompt + Retrieved Context ℹ️] --> LLM[LLM Generation πŸ“ƒ]
    LLM --> A[Answer πŸ’¬]
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": prompt}]
)

Production Architecture

A scalable RAG architecture looks like this:


                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   User App  β”‚
                β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  API Server β”‚
                β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β–Ό                   β–Ό
     Vector Database         LLM API
         (Retrieval)        (Generation)
             β”‚                   β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β–Ό
                 Response

Example Tech Stack

Layer Tools
Ingestion Airflow
Embeddings OpenAI
Vector DB Pinecone
Orchestration LangChain
API FastAPI
AI-GenAI/2-2-RAG
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich πŸ₯¨, Germany πŸ‡©πŸ‡ͺ, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
Β  Home/About
Β  Skills
Β  Work/Projects
Β  Lab/Experiments
Β  Contribution
Β  Awards
Β  Art/Sketches
Β  Thoughts
Β  Contact
Links
Β  Sitemap
Β  Legal Notice
Β  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| Β© 2026 All rights reserved.