Learn/Foundations/Chapter 2
Level 1 · Foundations
9 min

Tokens, Embeddings & the Vector World

Models don't see words. They see numbers. Here's how text becomes math.

LLMs don't read English. They read token IDs — integers that index into a giant lookup table. Before any prediction happens, your text gets chopped into tokens, each token mapped to a number, each number mapped to a vector (a list of ~4096 floating-point numbers).

What's a token?

A token is usually a piece of a word, not a whole word. Common tokenizers (like GPT's BPE or SentencePiece) split text into sub-word chunks chosen by frequency analysis on the training corpus.

Text → Tokens → Token IDs

"The cat sat on the mat."
The·cat·sat·on·the·mat.
46437977731319262260313

Each token (including its leading space) becomes a single integer the model can look up.

Rule of thumb: 1 token ≈ 4 characters of English ≈ 0.75 words. So 1,000 tokens is about 750 words. Non-English languages tokenize less efficiently (sometimes 2–3× more tokens for the same meaning).

This matters because:

  • You pay per token. APIs bill on input + output tokens.
  • Context windows are token budgets. A 128K context window means 128,000 tokens, not 128,000 words.
  • Tokenization is lossy. "GPT-4" might be one token; "Gpt-4" might be three. Capitalisation, whitespace, and Unicode all change the count.

Embeddings: meaning becomes geometry

Once you have a token ID, the model looks up its embedding vector — a high-dimensional point in "meaning space." The remarkable thing is that this space has structure:

  • Similar words sit near each other (dog and puppy are close)
  • Relationships become vector arithmetic: king - man + woman ≈ queen
  • Concepts cluster geometrically (animals over here, colours over there)

2D projection of embedding space — similar words cluster

animalscoloursverbsdogcatpuppylionredbluegreenrunwalkjump

Meaning becomes geometry. Words near each other in this space share semantic content.

The model doesn't reason about words. It does linear algebra on these vectors. That's how a single architecture can handle translation, summarisation, code, and math — they all become geometry problems in the same vector space.

Why this matters in practice

Embeddings are the foundation of RAG (retrieval-augmented generation). When you build a "chat with your docs" feature, you're embedding chunks of your documents into vectors, embedding the user's question into another vector, and finding the chunks whose vectors are nearest to the question vector. Meaning is similarity in this space.

You'll meet embeddings again in Level 3. For now: tokens are integers, embeddings are vectors, and meaning is geometry.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.Approximately how many words is 1,000 tokens of English text?

Q2.What is an embedding?

Q3.Why might non-English text cost more to process?

Q4.Why is the famous example king - man + woman ≈ queen significant?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs