Models & ArchitectureCore

Transformer

Source
Definition
The neural network architecture behind virtually all modern language and multi-modal models. Introduced in Google's 2017 'Attention Is All You Need' paper, transformers use self-attention to process sequences in parallel.
Why it matters
The transformer is the most consequential neural network architecture ever designed. It enabled the scaling that produced GPT, Claude, Gemini, and every other modern LLM. Before transformers, neural networks processed text sequentially, limiting parallelism and making it impossible to train on massive datasets efficiently. The transformer's self-attention mechanism processes all positions simultaneously, unlocking massive parallelization on GPUs. Understanding transformers helps you understand why AI works the way it does: why context windows matter (attention scales quadratically), why GPUs are needed (matrix multiplication is inherently parallel), and why model size correlates with capability (more parameters encode more knowledge).
In practice
The 'Attention Is All You Need' paper by Vaswani et al. (2017) introduced the transformer at Google Brain. Within five years, transformers replaced all previous architectures for language tasks and expanded to vision (ViT), audio (Whisper), and multi-modal (Gemini). Every major LLM, including GPT-4, Claude, Gemini, Llama, and Mistral, is a transformer. The architecture has proven remarkably resilient: despite extensive research into alternatives (state-space models like Mamba, hybrid architectures), no replacement has demonstrated clear superiority at scale. Modifications like grouped-query attention, rotary position embeddings, and SwiGLU activation have improved transformer efficiency, but the core architecture remains unchanged from 2017.

We cover models & architecture every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.