Embedding Models & Vector Databases
The infrastructure that makes RAG actually work at scale.
RAG runs on two pieces of infra: an embedding model and a vector database. Both have meaningful trade-offs.
Embedding models in 2026
The current leaders: OpenAI text-embedding-3-large (3072 dim, $0.13/1M), OpenAI 3-small (1536 dim, $0.02/1M, 5× cheaper), Cohere embed-v4 (multilingual + multimodal), BGE bge-large-en-v1.5 (free, open, top open-weight English), Nomic nomic-embed-v1.5 (free, Matryoshka), Voyage voyage-3-large (strong on technical/code).
Dimension trade-off: Bigger vectors capture more nuance but cost more to store and search. Many models support Matryoshka truncation — use the first 512 dims of a 1024-dim vector with only modest quality loss.
Vector databases
A vector DB does one job: given a query vector, find the k nearest neighbours from millions of stored vectors, fast.
- pgvector (Postgres extension) — Start here. Add a column, get embeddings + filters + transactions in one DB. Scales to ~10M vectors.
- Qdrant / Weaviate / Milvus — Dedicated open-source vector DBs. Better recall, richer filtering, scales to billions.
- Pinecone / Turbopuffer — Hosted SaaS. Zero ops, pay per query.
- Chroma / LanceDB — Lightweight, embeddable. Good for local apps.
- Elasticsearch / OpenSearch — Add vector search to keyword pipelines. Best for hybrid.
Approximate vs exact nearest-neighbour
Exact NN search scales poorly. Production uses ANN (approximate nearest neighbour) with HNSW or IVF indices — 95%+ recall at 100× the speed. Trade-off knob: index build time + memory vs query speed + recall.
Filtering — the often-overlooked half
In production, you rarely want pure semantic search. You want filtered semantic search: "Find the 5 most relevant chunks from this user's account, in the last 90 days, excluding archived." Vector DBs that pre-filter (not post-filter) win. Post-filtering after retrieval can leave you with 0 results.
Practical recipe
A solid starting stack:
- Embedding: OpenAI text-embedding-3-small OR bge-large-en
- Vector DB: Postgres + pgvector
- Index: HNSW with cosine distance
- Filtering: pre-filter by tenant_id, document_type, updated_at
- Retrieval: top-30 → rerank → top-5
This serves 90% of production RAG.