Retrieval-Augmented Generation (RAG)

LLMs forget. They also hallucinate. RAG is the architecture that solves both, and it's the single most-deployed AI pattern in production today.

The pipeline (4 steps)

Chunk — split your documents into small pieces (~500 tokens each).
Embed — turn each chunk into a vector using an embedding model.
Retrieve — at query time, embed the user's question and find the top-k nearest chunks via cosine similarity.
Generate — pass those chunks as context to the LLM along with the question.

Why it works

The LLM doesn't need to know the answer — it just needs the relevant text in its context window. This shifts the hard problem from training to retrieval, and retrieval is much cheaper.

Common pitfalls

Bad chunking — splitting in the middle of a sentence destroys meaning. Use semantic chunking or overlap.
Wrong embedding model — use one trained on similar data (e.g., nomic-embed-text for English, multilingual ones for other languages).
No re-ranking — the top-k from cosine similarity isn't always optimal. A cheap re-ranker (e.g., a cross-encoder) on the top-50 → top-5 dramatically improves quality.

Try it

Open the Playground, paste a long document into the system prompt, and ask questions. You've just simulated RAG by hand.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

The pipeline (4 steps)

Why it works

Common pitfalls

Try it

Try it in the Playground

Browse all lessons