Tokens, Embeddings & the Vector World
Models don't see words. They see numbers. Here's how text becomes math.
LLMs don't read English. They read token IDs — integers that index into a giant lookup table. Before any prediction happens, your text gets chopped into tokens, each token mapped to a number, each number mapped to a vector (a list of ~4096 floating-point numbers).
What's a token?
A token is usually a piece of a word, not a whole word. Common tokenizers (like GPT's BPE or SentencePiece) split text into sub-word chunks chosen by frequency analysis on the training corpus.
Text → Tokens → Token IDs
Each token (including its leading space) becomes a single integer the model can look up.
Rule of thumb: 1 token ≈ 4 characters of English ≈ 0.75 words. So 1,000 tokens is about 750 words. Non-English languages tokenize less efficiently (sometimes 2–3× more tokens for the same meaning).
This matters because:
- You pay per token. APIs bill on input + output tokens.
- Context windows are token budgets. A 128K context window means 128,000 tokens, not 128,000 words.
- Tokenization is lossy. "GPT-4" might be one token; "Gpt-4" might be three. Capitalisation, whitespace, and Unicode all change the count.
Embeddings: meaning becomes geometry
Once you have a token ID, the model looks up its embedding vector — a high-dimensional point in "meaning space." The remarkable thing is that this space has structure:
- Similar words sit near each other (
dogandpuppyare close) - Relationships become vector arithmetic:
king - man + woman ≈ queen - Concepts cluster geometrically (animals over here, colours over there)
2D projection of embedding space — similar words cluster
Meaning becomes geometry. Words near each other in this space share semantic content.
The model doesn't reason about words. It does linear algebra on these vectors. That's how a single architecture can handle translation, summarisation, code, and math — they all become geometry problems in the same vector space.
Why this matters in practice
Embeddings are the foundation of RAG (retrieval-augmented generation). When you build a "chat with your docs" feature, you're embedding chunks of your documents into vectors, embedding the user's question into another vector, and finding the chunks whose vectors are nearest to the question vector. Meaning is similarity in this space.
You'll meet embeddings again in Level 3. For now: tokens are integers, embeddings are vectors, and meaning is geometry.