Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 3 RAG

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦥 Sloths can hold their breath longer than dolphins 🐬.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Retrieval-Augmented Generation (RAG) for AI Applications

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

AI Models and LLM Development with NVIDIA

Next →

The Economic Impact of Generative AI

RAG (Retrieval-Augmented Generation)

RAG is becoming the default architecture for AI products.

It allows LLMs to:

  • access private knowledge
  • reduce hallucinations
  • stay up-to-date

Core idea:

LLM+Retrieval=UsefulAISystemLLM + Retrieval = Useful AI SystemLLM+Retrieval=UsefulAISystem

But building it reliably in production requires careful engineering.


What is RAG?

RAG combines two components:

  1. Retriever
  2. Generator (LLM)

Pipeline:

Query→Embedding→VectorSearch→ContextRetrieval→LLMGenerationQuery \rightarrow Embedding \rightarrow Vector Search \rightarrow Context Retrieval \rightarrow LLM GenerationQuery→Embedding→VectorSearch→ContextRetrieval→LLMGeneration

Instead of asking the LLM directly:

User → LLM → Answer

flowchart TD
    Q[User question] --> R1[Retrieve relevant documents]
    R1 --> R2[Insert retrieved context into prompt]
    R2 --> LLM[LLM generates answer]
    LLM --> A[Grounded response]

we do:


User Query
↓
Embedding Model
↓
Vector Database
↓
Top-K Documents
↓
Prompt + Context
↓
LLM
↓
Answer


The LLM now answers grounded in retrieved knowledge.

Building a Production RAG System Step-by-Step

Large Language Models are powerful, but they have one major limitation: they don't know your private data.

If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.

Retrieval Augmented Generation (RAG) solves this.

Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.

In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.


Step 1 — Data Collection

Your RAG system is only as good as the documents you feed it.

Typical sources:

  • PDFs
  • Notion pages
  • Confluence
  • Slack threads
  • support tickets
  • GitHub repos
  • product docs

Example pipeline:

Example pipeline:

Data Sources ↓ Document Loader ↓ Text Cleaning ↓ Chunking

Python example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()
Chunk Size Tradeoff
Small (200 tokens) Better retrieval
Large (1000 tokens) More context

Step 2 — Chunking Documents

LLMs have context limits.

Instead of embedding an entire document, we split it into chunks.

Example:

Chunk Size Tradeoff
Small (200 tokens) Better retrieval
Large (1000 tokens) More context

A common heuristic:

chunksize=300token chunk_size = 300 tokenchunks​ize=300token

with

overlap=50overlap = 50overlap=50

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)

chunks = splitter.split_documents(documents)

Step 3 — Embedding the Data

Embeddings convert text into vectors.

Example embedding:

"What is Kubernetes?"
→ [0.12, -0.44, 0.88, ...]

Similar meaning → similar vectors.

Example code:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)

Step 4 — Store in a Vector Database

Embeddings must be stored in a vector index.

Popular options:

Database Use Case
Pinecone managed
Weaviate hybrid search
FAISS local
Qdrant open source

Example architecture:

Chunks
  ↓
Embedding Model
  ↓
Vector DB

Python:

vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)

Step 5 — Query Time Retrieval

User Query
   ↓
Embedding
   ↓
Vector Similarity Search
   ↓
Top-K Documents

Mathematically we search using cosine similarity:

Python:

    results = vector_db.search(
    query_embedding,
    k=5
)

Step 6 — Prompt Construction

Now we inject retrieved documents into the prompt.

Example prompt template:


You are a helpful assistant.

Use the context below to answer the question.

Context:
{retrieved_docs}

Question:
{user_query}

Example

prompt = f"""
Answer the question using the context below.

Context:
{docs}

Question:
{query}
"""

Step 7 — Generate Answer with LLM

Now the LLM generates the answer grounded in retrieved knowledge.

Prompt + Context
↓
LLM
↓
Answer

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": prompt}]
)

Production Architecture

A scalable RAG architecture looks like this:


                ┌─────────────┐
                │   User App  │
                └──────┬──────┘
                       │
                       ▼
                ┌─────────────┐
                │  API Server │
                └──────┬──────┘
                       │
             ┌─────────┴─────────┐
             ▼                   ▼
     Vector Database         LLM API
         (Retrieval)        (Generation)
             │                   │
             └───────┬───────────┘
                     ▼
                 Response

Layer Tools
Ingestion Airflow
Embeddings OpenAI
Vector DB Pinecone
Orchestration LangChain
API FastAPI
AI-GenAI/3-RAG
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.