Back to Learn
Architectures
Intermediate
12 min read

Retrieval-Augmented Generation (RAG)

Give an LLM external knowledge without retraining it.

Retrieval-Augmented Generation (RAG)

LLMs forget. They also hallucinate. RAG is the architecture that solves both, and it's the single most-deployed AI pattern in production today.

The pipeline (4 steps)

  1. Chunk — split your documents into small pieces (~500 tokens each).
  2. Embed — turn each chunk into a vector using an embedding model.
  3. Retrieve — at query time, embed the user's question and find the top-k nearest chunks via cosine similarity.
  4. Generate — pass those chunks as context to the LLM along with the question.

Why it works

The LLM doesn't need to know the answer — it just needs the relevant text in its context window. This shifts the hard problem from training to retrieval, and retrieval is much cheaper.

Common pitfalls

  • Bad chunking — splitting in the middle of a sentence destroys meaning. Use semantic chunking or overlap.
  • Wrong embedding model — use one trained on similar data (e.g., nomic-embed-text for English, multilingual ones for other languages).
  • No re-ranking — the top-k from cosine similarity isn't always optimal. A cheap re-ranker (e.g., a cross-encoder) on the top-50 → top-5 dramatically improves quality.

Try it

Open the Playground, paste a long document into the system prompt, and ask questions. You've just simulated RAG by hand.

LLMAtlas — The Open Ecosystem Workspace for LLMs