Level 3 · RAG & Agents
9 min

Embedding Models & Vector Databases

The infrastructure that makes RAG actually work at scale.

RAG runs on two pieces of infra: an embedding model and a vector database. Both have meaningful trade-offs.

Embedding models in 2026

The current leaders: OpenAI text-embedding-3-large (3072 dim, $0.13/1M), OpenAI 3-small (1536 dim, $0.02/1M, 5× cheaper), Cohere embed-v4 (multilingual + multimodal), BGE bge-large-en-v1.5 (free, open, top open-weight English), Nomic nomic-embed-v1.5 (free, Matryoshka), Voyage voyage-3-large (strong on technical/code).

Dimension trade-off: Bigger vectors capture more nuance but cost more to store and search. Many models support Matryoshka truncation — use the first 512 dims of a 1024-dim vector with only modest quality loss.

Vector databases

A vector DB does one job: given a query vector, find the k nearest neighbours from millions of stored vectors, fast.

  • pgvector (Postgres extension) — Start here. Add a column, get embeddings + filters + transactions in one DB. Scales to ~10M vectors.
  • Qdrant / Weaviate / Milvus — Dedicated open-source vector DBs. Better recall, richer filtering, scales to billions.
  • Pinecone / Turbopuffer — Hosted SaaS. Zero ops, pay per query.
  • Chroma / LanceDB — Lightweight, embeddable. Good for local apps.
  • Elasticsearch / OpenSearch — Add vector search to keyword pipelines. Best for hybrid.

Approximate vs exact nearest-neighbour

Exact NN search scales poorly. Production uses ANN (approximate nearest neighbour) with HNSW or IVF indices — 95%+ recall at 100× the speed. Trade-off knob: index build time + memory vs query speed + recall.

Filtering — the often-overlooked half

In production, you rarely want pure semantic search. You want filtered semantic search: "Find the 5 most relevant chunks from this user's account, in the last 90 days, excluding archived." Vector DBs that pre-filter (not post-filter) win. Post-filtering after retrieval can leave you with 0 results.

Practical recipe

A solid starting stack:

  • Embedding: OpenAI text-embedding-3-small OR bge-large-en
  • Vector DB: Postgres + pgvector
  • Index: HNSW with cosine distance
  • Filtering: pre-filter by tenant_id, document_type, updated_at
  • Retrieval: top-30 → rerank → top-5

This serves 90% of production RAG.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.What is Matryoshka embedding truncation?

Q2.Why is pre-filtering better than post-filtering?

Q3.What does HNSW provide?

0 / 3 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs