Level 3 · RAG & Agents
11 min

RAG: Retrieval-Augmented Generation

The single most important pattern for grounding LLMs in your data.

LLMs don't know your company's docs, your codebase, your private wiki, or anything that happened after their training cutoff. RAG fixes that — without fine-tuning.

The pattern is simple. When a question comes in:

  1. Retrieve the most relevant chunks from a knowledge base
  2. Augment the prompt by stuffing those chunks into context
  3. Generate the answer, now grounded in real information

Retrieval → Augmentation → Generation

?
User Query
Embed
🔍
Search Vectors
📄
Top-K Chunks
🤖
LLM + Context
Answer

Three steps. Each one can be optimised independently — embeddings, vector DB, retrieval, prompt template, generation.

The retrieval half

Step 1 of every RAG system: turn your documents into something searchable.

Chunking — split docs into chunks of 200–800 tokens. Too small and you lose context; too large and retrieval becomes imprecise. Overlapping chunks help preserve continuity across boundaries.

Embedding — convert each chunk to a vector using an embedding model (e.g., text-embedding-3-large, bge-large-en). Store the vector + chunk in a vector database (Pinecone, Weaviate, Qdrant, Chroma, or Postgres with pgvector).

Querying — when the user asks something, embed the question, find the k nearest chunks in vector space (k is usually 4–10). Cosine similarity is the default distance.

The augmentation half

You now have a few relevant chunks. Stuff them into the prompt with a strict guardrail: "Use only the context below. If the context doesn't contain the answer, say 'I don't know.'" That single line dramatically reduces hallucinations.

Why pure RAG often disappoints

Real-world RAG is harder than the diagram. Common failure modes: bad retrieval (right chunk not in top-k), "lost in the middle" (long context drops middle positions), hallucinated answers despite guardrails, and stale embeddings.

The fixes (modern RAG)

  • Hybrid search — combine vector similarity with keyword (BM25). Catches matches one alone misses.
  • Reranking — retrieve 30 candidates with cheap vector search, then use a cross-encoder reranker to pick the best 3–5.
  • Query expansion — rewrite the user's question into multiple search queries.
  • HyDE — let the LLM hallucinate an ideal answer, embed that, search with it.
  • Citations — return chunk IDs alongside the answer so users can verify.

When NOT to use RAG

RAG is right for: company docs, code search, knowledge bases, news. Wrong for: tasks needing deep reasoning over the entire dataset, tasks where data fits in a 1M context window, or tasks where the model already knows the answer.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.What is the core flow of RAG?

Q2.What is reranking?

Q3.What is the 'lost in the middle' phenomenon?

Q4.When is RAG generally the wrong tool?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs