Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-GenAI

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

RAG

Retrieval-Augmented Generation

LLM

Embeddings

Vector Database

Semantic Search

← Previous

What are Transformer Models?

The Economic Impact of Generative AI

Retrieval-Augmented Generation (`RAG`) 🧼

Large Language Models are powerful, but they have one major limitation: they don't know your private data.

If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.

Retrieval Augmented Generation (RAG) solves this.

Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.

In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.

What is RAG?

Helps LLM generate answers grounded in retrieved knowledge based on Vector DB of existing knowledge.

RAG combines two components:

Retriever
Generator (LLM)

Core idea:

LLM + Retrieval = Useful AI System

Without RAG we ask LLM directly:

User → LLM → Answer

With RAG we add a context retrieval step to improve the answer:

Query \rightarrow Embedding \rightarrow Vector Search \rightarrow Context Retrieval \rightarrow LLM Generation

RAG is becoming the default architecture for AI products. But building it reliably in production requires careful engineering.

Advantages of RAG

Access private knowledge: This allows models to answer questions about private or up-to-date data.
Reduce hallucinations: By grounding the model in retrieved documents, it reduces the chance of generating false information.
Stay up-to-date: Give the model access to external knowledge.

How RAG Works

RAG works in three steps:

Search relevant documents for an answer
Insert retrieved text into the prompt
Generate the answer from the updated prompt.

Given:

q = \text{user query}

$D$ represents the document set.

D = \{d_1, d_2, ..., d_n\}

The RAG system retrieves the most relevant document

d^* = \arg\max_{d_i \in D} \; \text{similarity}(q, d_i)

Then the LLM generates a response conditioned on (q) and (d^*).

flowchart TD
    Q[User question ❓] --> R1[Retrieve relevant documents 📁]
    R1 --> R2[Insert retrieved context into prompt ℹ️]
    R2 --> LLM[LLM generates answer 📄]
    LLM --> A[Grounded response 💬]

Conceptually, the prompt becomes:

\text{Prompt} = \text{Instruction} + \text{Retrieved Context} + \text{Question}

For example:

\text{Answer} = \text{LLM}(\text{Instruction} + \text{Parking Policy} + \text{Question})

This is powerful because the LLM is being used more as a reasoning engine than as a pure source of facts.

It reads relevant text and uses that text to formulate an answer

Building a Production RAG System

Step 1 — Data Collection 📂

Your RAG system is only as good as the documents you feed it.

We need to collect and index all relevant documents that the model can retrieve from.

Rag will search through this documents vector DB to find relevant context for the user query.

Typical sources:

Confluence Pages
Slack threads
GitHub repos
Product docs & Wiki
Policy Docs eg. PDFs

Example pipeline:

flowchart TD
    A["Data Sources 📚"] --> B["Document Loader 📕"]
    B --> C[Text Cleaning 📖]
    C --> D[Chunking 📑]

Python example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()

Step 2 — `Chunking` Documents 📑

Instead of embedding an entire document, we split it into chunks.

Even with today's large (often 1M+ token) context windows, chunking still matters: smaller chunks improve retrieval precision (only the relevant passage gets pulled in, not a whole document) and reduce cost/latency per query, so we still break documents into smaller pieces.

Chunk Size	Tradeoff
Small (200 tokens)	Better retrieval
Large (1000 tokens)	More context

A common heuristic:

$\text{Chunk Size} = 300 \text{ tokens}$

Where overlap helps maintain context across chunks.

$overlap = 50$

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)

chunks = splitter.split_documents(documents)

Step 3 — `Embedding` the Data ↗️

Embeddings convert text into a high-dimensional vector that captures semantic meaning.

Also called vectorization or encoding.
Similar meaning → similar vectors.

Example embedding:

"What is Kubernetes?"
→ [0.12, -0.44, 0.88, ...]

Example code:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)

Step 4 — Store in a `Vector Database` 🔢

Embeddings must be stored in a vector index.

Popular Vector DB options:

Database	Use Case
Pinecone	Fully managed vector database.
Weaviate	Supports hybrid search
FAISS	Opensource Vector DB by Meta
Qdrant	Lightweight, embedded vector search engine for in-process retrieval

Example architecture:

flowchart TD
    C["Chunks 📑"] --> E["Embedding Model ↗"️]
    E --> V["Vector DB 🔢"]

Python:

vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)

Step 5 — Query Time Retrieval 🔎

5.1 `Top-K Documents`

Top-K Documents refers to selecting the K most relevant documents from a larger collection based on a similarity or ranking score.

K Value	Effect
Small K	Faster, more precise
Large K	More context, but more noise

flowchart TD
    Q["User Query ❓"] --> E["Embedding ↗️"]
    E --> S["Vector Similarity Search 🔎"]
    S --> D["Top-K Documents 📁"]

Mathematically we search using cosine similarity:

Python:

    results = vector_db.search(
    query_embedding,
    k=5
)

Shortcomings

Normal Vector Search can have Keyword mismatch and may Retrieve wrong chunk

5.2. Hybrid Search

Hybrid Search solve it by adding Keyword search:

Vector Search + Keyword Search (BM25)

Vector search excels at: Semantic similarity

Example: "car" ≈ "automobile"

BM25 excels at: Exact keywords

Example: "workspace-id" --> "aws-managed-grafana-workspace-id"

flowchart TD
    Q["User Query ❓"] --> E["Embedding ↗️"]
    E --> S["Vector Similarity Search 🔎"]
    S --> D["Top-K Documents 📁"]

Advantages

Better recall
Handles exact identifiers
Better for code and technical docs
Industry standard

5.3 Hypothetical Document Embeddings (`HyDE` )

Question: How do circuit breakers prevent cascading failures?

Standard RAG

Embedding is generated from the question.

HyDE

Instead of embedding the user's question directly:

First generate:

Circuit breakers prevent cascading failures
by temporarily stopping requests to unhealthy
services and allowing recovery probes.

Then embed THAT answer.

Embedding a hypothetical answer creates a vector closer to the target documents.

Feature	Basic RAG	Hybrid Search	HyDE
Semantic retrieval	✅	✅	✅
Keyword matching	❌	✅	❌
Recall quality	Medium	High	High
Cost	Low	Medium	Higher
Latency	Low	Medium	Higher
Additional LLM call	❌	❌	✅
Works with technical identifiers	Weak	Excellent	Moderate
Production adoption	Very High	Extremely High	Growing

Step 6 — Prompt Construction 💬

Now we inject retrieved documents into the prompt.

Example prompt template:


You are a helpful assistant.

Use the context below to answer the question.

Context:
{retrieved_docs}

Question:
{user_query}

Example

prompt = f"""
Answer the question using the context below.

Context:
{docs}

Question:
{query}
"""

Step 7 — Generate Answer with LLM 📃

Now the LLM generates the answer grounded in retrieved knowledge.

flowchart TD
    P[Prompt + Retrieved Context ℹ️] --> LLM["LLM Generation 📃"]
    LLM --> A["Answer 💬"]

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": prompt}]

Production Architecture

A scalable RAG architecture looks like this:

flowchart TD

    A["User App"]--> B["API Server"]

    B --> C["Vector Database<br/>(Retrieval)"]

    B --> D["LLM API<br/>(Generation)"]

    C --> E["Response"]
    D --> E

Example Tech Stack

Layer	Tools
Ingestion	Airflow
Embeddings	OpenAI
Vector DB	Pinecone
Orchestration	LangChain
API	FastAPI

Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

What are Transformer Models?

The Economic Impact of Generative AI

AI-GenAI/2-2-RAG

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-GenAI

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

RAG

Retrieval-Augmented Generation

LLM

Embeddings

Vector Database

Semantic Search

← Previous

What are Transformer Models?

The Economic Impact of Generative AI

Retrieval-Augmented Generation (`RAG`) 🧼

Large Language Models are powerful, but they have one major limitation: they don't know your private data.

If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.

Retrieval Augmented Generation (RAG) solves this.

Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.

In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.

What is RAG?

Helps LLM generate answers grounded in retrieved knowledge based on Vector DB of existing knowledge.

RAG combines two components:

Retriever
Generator (LLM)

Core idea:

LLM + Retrieval = Useful AI System

Without RAG we ask LLM directly:

User → LLM → Answer

With RAG we add a context retrieval step to improve the answer:

Query \rightarrow Embedding \rightarrow Vector Search \rightarrow Context Retrieval \rightarrow LLM Generation

RAG is becoming the default architecture for AI products. But building it reliably in production requires careful engineering.

Advantages of RAG

Access private knowledge: This allows models to answer questions about private or up-to-date data.
Reduce hallucinations: By grounding the model in retrieved documents, it reduces the chance of generating false information.
Stay up-to-date: Give the model access to external knowledge.

How RAG Works

RAG works in three steps:

Search relevant documents for an answer
Insert retrieved text into the prompt
Generate the answer from the updated prompt.

Given:

q = \text{user query}

$D$ represents the document set.

D = \{d_1, d_2, ..., d_n\}

The RAG system retrieves the most relevant document

d^* = \arg\max_{d_i \in D} \; \text{similarity}(q, d_i)

Then the LLM generates a response conditioned on (q) and (d^*).

flowchart TD
    Q[User question ❓] --> R1[Retrieve relevant documents 📁]
    R1 --> R2[Insert retrieved context into prompt ℹ️]
    R2 --> LLM[LLM generates answer 📄]
    LLM --> A[Grounded response 💬]

Conceptually, the prompt becomes:

\text{Prompt} = \text{Instruction} + \text{Retrieved Context} + \text{Question}

For example:

\text{Answer} = \text{LLM}(\text{Instruction} + \text{Parking Policy} + \text{Question})

This is powerful because the LLM is being used more as a reasoning engine than as a pure source of facts.

It reads relevant text and uses that text to formulate an answer

Building a Production RAG System

Step 1 — Data Collection 📂

Your RAG system is only as good as the documents you feed it.

We need to collect and index all relevant documents that the model can retrieve from.

Rag will search through this documents vector DB to find relevant context for the user query.

Typical sources:

Confluence Pages
Slack threads
GitHub repos
Product docs & Wiki
Policy Docs eg. PDFs

Example pipeline:

flowchart TD
    A["Data Sources 📚"] --> B["Document Loader 📕"]
    B --> C[Text Cleaning 📖]
    C --> D[Chunking 📑]

Python example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()

Step 2 — `Chunking` Documents 📑

Instead of embedding an entire document, we split it into chunks.

Chunk Size	Tradeoff
Small (200 tokens)	Better retrieval
Large (1000 tokens)	More context

A common heuristic:

$\text{Chunk Size} = 300 \text{ tokens}$

Where overlap helps maintain context across chunks.

$overlap = 50$

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)

chunks = splitter.split_documents(documents)

Step 3 — `Embedding` the Data ↗️

Embeddings convert text into a high-dimensional vector that captures semantic meaning.

Also called vectorization or encoding.
Similar meaning → similar vectors.

Example embedding:

"What is Kubernetes?"
→ [0.12, -0.44, 0.88, ...]

Example code:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)

Step 4 — Store in a `Vector Database` 🔢

Embeddings must be stored in a vector index.

Popular Vector DB options:

Database	Use Case
Pinecone	Fully managed vector database.
Weaviate	Supports hybrid search
FAISS	Opensource Vector DB by Meta
Qdrant	Lightweight, embedded vector search engine for in-process retrieval

Example architecture:

flowchart TD
    C["Chunks 📑"] --> E["Embedding Model ↗"️]
    E --> V["Vector DB 🔢"]

Python:

vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)

Step 5 — Query Time Retrieval 🔎

5.1 `Top-K Documents`

Top-K Documents refers to selecting the K most relevant documents from a larger collection based on a similarity or ranking score.

K Value	Effect
Small K	Faster, more precise
Large K	More context, but more noise

flowchart TD
    Q["User Query ❓"] --> E["Embedding ↗️"]
    E --> S["Vector Similarity Search 🔎"]
    S --> D["Top-K Documents 📁"]

Mathematically we search using cosine similarity:

Python:

    results = vector_db.search(
    query_embedding,
    k=5
)

Shortcomings

Normal Vector Search can have Keyword mismatch and may Retrieve wrong chunk

5.2. Hybrid Search

Hybrid Search solve it by adding Keyword search:

Vector Search + Keyword Search (BM25)

Vector search excels at: Semantic similarity

Example: "car" ≈ "automobile"

BM25 excels at: Exact keywords

Example: "workspace-id" --> "aws-managed-grafana-workspace-id"

flowchart TD
    Q["User Query ❓"] --> E["Embedding ↗️"]
    E --> S["Vector Similarity Search 🔎"]
    S --> D["Top-K Documents 📁"]

Advantages

Better recall
Handles exact identifiers
Better for code and technical docs
Industry standard

5.3 Hypothetical Document Embeddings (`HyDE` )

Question: How do circuit breakers prevent cascading failures?

Standard RAG

Embedding is generated from the question.

HyDE

Instead of embedding the user's question directly:

First generate:

Circuit breakers prevent cascading failures
by temporarily stopping requests to unhealthy
services and allowing recovery probes.

Then embed THAT answer.

Embedding a hypothetical answer creates a vector closer to the target documents.

Feature	Basic RAG	Hybrid Search	HyDE
Semantic retrieval	✅	✅	✅
Keyword matching	❌	✅	❌
Recall quality	Medium	High	High
Cost	Low	Medium	Higher
Latency	Low	Medium	Higher
Additional LLM call	❌	❌	✅
Works with technical identifiers	Weak	Excellent	Moderate
Production adoption	Very High	Extremely High	Growing

Step 6 — Prompt Construction 💬

Now we inject retrieved documents into the prompt.

Example prompt template:


You are a helpful assistant.

Use the context below to answer the question.

Context:
{retrieved_docs}

Question:
{user_query}

Example

prompt = f"""
Answer the question using the context below.

Context:
{docs}

Question:
{query}
"""

Step 7 — Generate Answer with LLM 📃

Now the LLM generates the answer grounded in retrieved knowledge.

flowchart TD
    P[Prompt + Retrieved Context ℹ️] --> LLM["LLM Generation 📃"]
    LLM --> A["Answer 💬"]

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": prompt}]

Production Architecture

A scalable RAG architecture looks like this:

flowchart TD

    A["User App"]--> B["API Server"]

    B --> C["Vector Database<br/>(Retrieval)"]

    B --> D["LLM API<br/>(Generation)"]

    C --> E["Response"]
    D --> E

Example Tech Stack

Layer	Tools
Ingestion	Airflow
Embeddings	OpenAI
Vector DB	Pinecone
Orchestration	LangChain
API	FastAPI

Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

What are Transformer Models?

The Economic Impact of Generative AI

AI-GenAI/2-2-RAG

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Retrieval-Augmented Generation (RAG) 🧼

What is RAG?

Advantages of RAG

How RAG Works

Building a Production RAG System

Step 1 — Data Collection 📂

Step 2 — Chunking Documents 📑

Step 3 — Embedding the Data ↗️

Step 4 — Store in a Vector Database 🔢

Step 5 — Query Time Retrieval 🔎

5.1 Top-K Documents

Shortcomings

5.2. Hybrid Search

5.3 Hypothetical Document Embeddings (HyDE )

Standard RAG

HyDE

Step 6 — Prompt Construction 💬

Step 7 — Generate Answer with LLM 📃

Production Architecture

Example Tech Stack

Written by Hitesh Sahu, a passionate developer and blogger.

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

AI-GenAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Retrieval-Augmented Generation (RAG) 🧼

What is RAG?

Advantages of RAG

How RAG Works

Building a Production RAG System

Step 1 — Data Collection 📂

Step 2 — Chunking Documents 📑

Step 3 — Embedding the Data ↗️

Step 4 — Store in a Vector Database 🔢

Step 5 — Query Time Retrieval 🔎

5.1 Top-K Documents

Shortcomings

5.2. Hybrid Search

5.3 Hypothetical Document Embeddings (HyDE )

Standard RAG

HyDE

Step 6 — Prompt Construction 💬

Retrieval-Augmented Generation (`RAG`) 🧼

Step 2 — `Chunking` Documents 📑

Step 3 — `Embedding` the Data ↗️

Step 4 — Store in a `Vector Database` 🔢

5.1 `Top-K Documents`

5.3 Hypothetical Document Embeddings (`HyDE` )

Retrieval-Augmented Generation (`RAG`) 🧼

Step 2 — `Chunking` Documents 📑

Step 3 — `Embedding` the Data ↗️

Step 4 — Store in a `Vector Database` 🔢

5.1 `Top-K Documents`

5.3 Hypothetical Document Embeddings (`HyDE` )