Transformer
- Definition
- The neural network architecture behind virtually all modern language and multi-modal models. Introduced in Google's 2017 'Attention Is All You Need' paper, transformers use self-attention to process sequences in parallel.
- Why it matters
- The transformer is the most consequential neural network architecture ever designed. It enabled the scaling that produced GPT, Claude, Gemini, and every other modern LLM. Before transformers, neural networks processed text sequentially, limiting parallelism and making it impossible to train on massive datasets efficiently. The transformer's self-attention mechanism processes all positions simultaneously, unlocking massive parallelization on GPUs. Understanding transformers helps you understand why AI works the way it does: why context windows matter (attention scales quadratically), why GPUs are needed (matrix multiplication is inherently parallel), and why model size correlates with capability (more parameters encode more knowledge).
- In practice
- The 'Attention Is All You Need' paper by Vaswani et al. (2017) introduced the transformer at Google Brain. Within five years, transformers replaced all previous architectures for language tasks and expanded to vision (ViT), audio (Whisper), and multi-modal (Gemini). Every major LLM, including GPT-4, Claude, Gemini, Llama, and Mistral, is a transformer. The architecture has proven remarkably resilient: despite extensive research into alternatives (state-space models like Mamba, hybrid architectures), no replacement has demonstrated clear superiority at scale. Modifications like grouped-query attention, rotary position embeddings, and SwiGLU activation have improved transformer efficiency, but the core architecture remains unchanged from 2017.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Attention mechanism
The core innovation inside transformers that lets a model weigh the relevance of every token against every other token in a sequence. Attention is what makes modern LLMs understand context and long-range dependencies.
Neural network
A computing architecture inspired by the brain, made of layers of interconnected nodes (neurons) that learn patterns from data. Neural networks are the fundamental building block of all modern AI.
LLM (Large Language Model)
A neural network trained on massive text corpora to predict and generate language. LLMs like GPT-4, Claude, and Gemini are the foundation of the current AI wave, powering chatbots, coding tools, and enterprise automation.
Deep learning
A subset of machine learning that uses neural networks with many layers to learn complex patterns from data. Deep learning powers virtually all modern AI breakthroughs, from image recognition to language generation.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.