Retrieval-Augmented Generation (RAG) for AI Applications
Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.
Retrieval-Augmented Generation (RAG) 🧼
Large Language Models are powerful, but they have one major limitation: they don't know your private data.
If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.
Retrieval Augmented Generation (RAG) solves this.
Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.
In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.
What is RAG?
Helps LLM generate answers grounded in retrieved knowledge based on Vector DB of existing knowledge.
RAG combines two components:
- Retriever
- Generator (LLM)
Core idea:
Without RAG we ask LLM directly:
User → LLM → Answer
With RAG we add a context retrieval step to improve the answer:
RAG is becoming the default architecture for AI products. But building it reliably in production requires careful engineering.
Advantages of RAG
- Access private knowledge: This allows models to answer questions about private or up-to-date data.
- Reduce hallucinations: By grounding the model in retrieved documents, it reduces the chance of generating false information.
- Stay up-to-date: Give the model access to external knowledge.
How RAG Works
RAG works in three steps:
- Search relevant documents for an answer
- Insert retrieved text into the prompt
- Generate the answer from the updated prompt.
Given:
represents the document set.
The RAG system retrieves the most relevant document
Then the LLM generates a response conditioned on (q) and (d^*).
flowchart TD
Q[User question ❓] --> R1[Retrieve relevant documents 📁]
R1 --> R2[Insert retrieved context into prompt ℹ️]
R2 --> LLM[LLM generates answer 📄]
LLM --> A[Grounded response 💬]
Conceptually, the prompt becomes:
For example:
This is powerful because the LLM is being used more as a reasoning engine than as a pure source of facts.
It reads relevant text and uses that text to formulate an answer
Building a Production RAG System
Step 1 — Data Collection 📂
Your RAG system is only as good as the documents you feed it.
We need to collect and index all relevant documents that the model can retrieve from.
Rag will search through this documents vector DB to find relevant context for the user query.
Typical sources:
- Confluence Pages
- Slack threads
- GitHub repos
- Product docs & Wiki
- Policy Docs eg. PDFs
Example pipeline:
flowchart TD
A["Data Sources 📚"] --> B["Document Loader 📕"]
B --> C[Text Cleaning 📖]
C --> D[Chunking 📑]
Python example:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()
Step 2 — Chunking Documents 📑
Instead of embedding an entire document, we split it into
chunks.
LLMs have context limits (e.g. 4k tokens), so we need to break documents into smaller pieces.
| Chunk Size | Tradeoff |
|---|---|
| Small (200 tokens) | Better retrieval |
| Large (1000 tokens) | More context |
A common heuristic:
Where overlap helps maintain context across chunks.
Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
Step 3 — Embedding the Data ↗️
Embeddingsconvert text into a high-dimensional vector that captures semantic meaning.
- Also called
vectorizationorencoding. - Similar meaning → similar vectors.
Example embedding:
"What is Kubernetes?"
→ [0.12, -0.44, 0.88, ...]
Example code:
from openai import OpenAI
client = OpenAI()
embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)
Step 4 — Store in a Vector Database 🔢
Embeddings must be stored in a vector index.
Popular Vector DB options:
| Database | Use Case |
|---|---|
| Pinecone | Fully managed vector database. |
| Weaviate | Supports hybrid search |
| FAISS | Opensource Vector DB by Meta |
| Qdrant | Lightweight, embedded vector search engine for in-process retrieval |
Example architecture:
flowchart TD
C["Chunks 📑"] --> E["Embedding Model ↗"️]
E --> V["Vector DB 🔢"]
Python:
vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)
Step 5 — Query Time Retrieval 🔎
5.1 Top-K Documents
Top-K Documents refers to selecting the K most relevant documents from a larger collection based on a similarity or ranking score.
| K Value | Effect |
|---|---|
| Small K | Faster, more precise |
| Large K | More context, but more noise |
flowchart TD
Q["User Query ❓"] --> E["Embedding ↗️"]
E --> S["Vector Similarity Search 🔎"]
S --> D["Top-K Documents 📁"]
Mathematically we search using cosine similarity:
Python:
results = vector_db.search(
query_embedding,
k=5
)
Shortcomings
Normal Vector Search can have Keyword mismatch and may Retrieve wrong chunk
5.2. Hybrid Search
Hybrid Search solve it by adding Keyword search:
Vector Search + Keyword Search (
BM25)
Vector search excels at: Semantic similarity
Example: "car" ≈ "automobile"
BM25 excels at: Exact keywords
Example: "workspace-id" --> "aws-managed-grafana-workspace-id"
flowchart TD
Q["User Query ❓"] --> E["Embedding ↗️"]
E --> S["Vector Similarity Search 🔎"]
S --> D["Top-K Documents 📁"]
Advantages
- Better recall
- Handles exact identifiers
- Better for code and technical docs
- Industry standard
5.3 Hypothetical Document Embeddings (HyDE )
Question: How do circuit breakers prevent cascading failures?
Standard RAG
Embedding is generated from the question.
HyDE
Instead of embedding the user's question directly:
First generate:
Circuit breakers prevent cascading failures
by temporarily stopping requests to unhealthy
services and allowing recovery probes.
Then embed THAT answer.
Embedding a hypothetical answer creates a vector closer to the target documents.
| Feature | Basic RAG | Hybrid Search | HyDE |
|---|---|---|---|
| Semantic retrieval | ✅ | ✅ | ✅ |
| Keyword matching | ❌ | ✅ | ❌ |
| Recall quality | Medium | High | High |
| Cost | Low | Medium | Higher |
| Latency | Low | Medium | Higher |
| Additional LLM call | ❌ | ❌ | ✅ |
| Works with technical identifiers | Weak | Excellent | Moderate |
| Production adoption | Very High | Extremely High | Growing |
Step 6 — Prompt Construction 💬
Now we inject retrieved documents into the prompt.
Example prompt template:
You are a helpful assistant.
Use the context below to answer the question.
Context:
{retrieved_docs}
Question:
{user_query}
Example
prompt = f"""
Answer the question using the context below.
Context:
{docs}
Question:
{query}
"""
Step 7 — Generate Answer with LLM 📃
Now the LLM generates the answer grounded in retrieved knowledge.
flowchart TD
P[Prompt + Retrieved Context ℹ️] --> LLM["LLM Generation 📃"]
LLM --> A["Answer 💬"]
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}]
Production Architecture
A scalable RAG architecture looks like this:
flowchart TD
A["User App"]--> B["API Server"]
B --> C["Vector Database<br/>(Retrieval)"]
B --> D["LLM API<br/>(Generation)"]
C --> E["Response"]
D --> E
Example Tech Stack
| Layer | Tools |
|---|---|
| Ingestion | Airflow |
| Embeddings | OpenAI |
| Vector DB | Pinecone |
| Orchestration | LangChain |
| API | FastAPI |
