Retrieval-Augmented Generation (RAG)
LLMs forget. They also hallucinate. RAG is the architecture that solves both, and it's the single most-deployed AI pattern in production today.
The pipeline (4 steps)
- Chunk — split your documents into small pieces (~500 tokens each).
- Embed — turn each chunk into a vector using an embedding model.
- Retrieve — at query time, embed the user's question and find the top-k nearest chunks via cosine similarity.
- Generate — pass those chunks as context to the LLM along with the question.
Why it works
The LLM doesn't need to know the answer — it just needs the relevant text in its context window. This shifts the hard problem from training to retrieval, and retrieval is much cheaper.
Common pitfalls
- Bad chunking — splitting in the middle of a sentence destroys meaning. Use semantic chunking or overlap.
- Wrong embedding model — use one trained on similar data (e.g.,
nomic-embed-textfor English, multilingual ones for other languages). - No re-ranking — the top-k from cosine similarity isn't always optimal. A cheap re-ranker (e.g., a cross-encoder) on the top-50 → top-5 dramatically improves quality.
Try it
Open the Playground, paste a long document into the system prompt, and ask questions. You've just simulated RAG by hand.