Learn/Foundations/Chapter 3
Level 1 · Foundations
10 min

The Transformer, in Plain English

Attention is all you need — but what does that actually mean?

The transformer is the architecture behind every modern LLM — GPT, Claude, Llama, Gemini, DeepSeek, all of them. Its job is to take a sequence of token embeddings and output a prediction for the next token. The mechanism that makes this work is attention.

The problem attention solves

Imagine reading the sentence: "The trophy didn't fit in the suitcase because it was too big."

What does it refer to? Trophy. Now: "The trophy didn't fit in the suitcase because it was too small." Now it refers to the suitcase. The same word, in the same position, with totally different meaning depending on context.

A model that processes words one at a time (like older RNNs) struggles with this. The transformer doesn't process words in sequence — it lets every word look at every other word simultaneously and decide what's relevant.

How attention works

For each token, the model computes three vectors:

  • Query (Q) — what am I looking for?
  • Key (K) — what do I contain?
  • Value (V) — what information do I carry?

Self-attention from sat to all other tokens

The
0.12
cat
0.62
sat
1.00
on
0.18
the
0.08
mat
0.45

For each token, the model computes how much it should "look at" every other token. Strong weights are mixed in; weak weights are ignored.

Then for every pair of tokens, it computes Q · K — how relevant is token B's content to token A's question? The result becomes a weight. Tokens with high relevance get their values mixed strongly into the output; irrelevant tokens are ignored.

In our trophy/suitcase example, the attention head learns that it should attend strongly to whichever noun makes the sentence coherent given the adjective.

Stacking the layers

A single attention operation isn't enough. Real transformers stack:

  • Multi-head attention — many parallel attention "heads," each learning a different pattern (one tracks pronouns, another tracks syntax, another tracks semantics)
  • Feed-forward networks — fully-connected layers between attention blocks, where the model stores most of its "knowledge"
  • Residual connections + layer norm — engineering tricks that make deep stacks trainable
  • Dozens of layers — GPT-3 has 96, Llama 3 70B has 80, frontier models 100+

Each layer refines the representation. Early layers detect syntax (parts of speech, parsing). Middle layers track entities and relationships. Late layers compute the final next-token distribution.

Why this won

Transformers parallelise beautifully on GPUs (no sequential RNN bottleneck), scale predictably (more layers + more data = better), and handle long contexts well. Every other architecture got swept aside in the 2017–2023 cycle. Until something fundamentally better appears — every LLM you'll use is a transformer.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.What is the core mechanism that distinguishes transformers from older architectures?

Q2.What do Query, Key, and Value represent in attention?

Q3.Why are multiple attention heads used?

Q4.Why did transformers replace RNNs for language modelling?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs