The Transformer, in Plain English
Attention is all you need — but what does that actually mean?
The transformer is the architecture behind every modern LLM — GPT, Claude, Llama, Gemini, DeepSeek, all of them. Its job is to take a sequence of token embeddings and output a prediction for the next token. The mechanism that makes this work is attention.
The problem attention solves
Imagine reading the sentence: "The trophy didn't fit in the suitcase because it was too big."
What does it refer to? Trophy. Now: "The trophy didn't fit in the suitcase because it was too small." Now it refers to the suitcase. The same word, in the same position, with totally different meaning depending on context.
A model that processes words one at a time (like older RNNs) struggles with this. The transformer doesn't process words in sequence — it lets every word look at every other word simultaneously and decide what's relevant.
How attention works
For each token, the model computes three vectors:
- Query (Q) — what am I looking for?
- Key (K) — what do I contain?
- Value (V) — what information do I carry?
Self-attention from sat to all other tokens
For each token, the model computes how much it should "look at" every other token. Strong weights are mixed in; weak weights are ignored.
Then for every pair of tokens, it computes Q · K — how relevant is token B's content to token A's question? The result becomes a weight. Tokens with high relevance get their values mixed strongly into the output; irrelevant tokens are ignored.
In our trophy/suitcase example, the attention head learns that it should attend strongly to whichever noun makes the sentence coherent given the adjective.
Stacking the layers
A single attention operation isn't enough. Real transformers stack:
- Multi-head attention — many parallel attention "heads," each learning a different pattern (one tracks pronouns, another tracks syntax, another tracks semantics)
- Feed-forward networks — fully-connected layers between attention blocks, where the model stores most of its "knowledge"
- Residual connections + layer norm — engineering tricks that make deep stacks trainable
- Dozens of layers — GPT-3 has 96, Llama 3 70B has 80, frontier models 100+
Each layer refines the representation. Early layers detect syntax (parts of speech, parsing). Middle layers track entities and relationships. Late layers compute the final next-token distribution.
Why this won
Transformers parallelise beautifully on GPUs (no sequential RNN bottleneck), scale predictably (more layers + more data = better), and handle long contexts well. Every other architecture got swept aside in the 2017–2023 cycle. Until something fundamentally better appears — every LLM you'll use is a transformer.