Attention mechanism
- Definition
- The core innovation inside transformers that lets a model weigh the relevance of every token against every other token in a sequence. Attention is what makes modern LLMs understand context and long-range dependencies.
- Why it matters
- Attention is why modern AI works. Before attention, neural networks processed sequences linearly and struggled with long documents. Attention lets a model focus on the relevant parts of its input regardless of distance, which is why you can paste a 100-page contract into Claude and ask about a clause on page 87. Understanding attention matters for practitioners because it explains key trade-offs: self-attention scales quadratically with sequence length, which is why context window size, inference cost, and latency are all interconnected. Every optimization in modern AI, from flash attention to KV caching, is ultimately about making attention faster and cheaper.
- In practice
- The original 2017 Transformer paper introduced multi-head self-attention, where the model runs multiple attention computations in parallel to capture different relationship types. Since then, variants like grouped-query attention (used in Llama 2+), multi-query attention (used in PaLM), and sliding-window attention (used in Mistral) have emerged to reduce compute costs. Flash Attention, developed by Tri Dao at Stanford, made attention 2-4x faster by optimizing memory access patterns, and is now standard in virtually every training and inference stack.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Transformer
The neural network architecture behind virtually all modern language and multi-modal models. Introduced in Google's 2017 'Attention Is All You Need' paper, transformers use self-attention to process sequences in parallel.
Flash attention
An optimized implementation of the attention mechanism that reduces memory usage and increases speed by restructuring how attention computations access GPU memory, avoiding the need to materialize the full attention matrix.
KV cache
A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
Context window
The maximum number of tokens a model can process in a single request, including both the prompt and the response. Larger context windows (100K-2M tokens) let models ingest entire codebases or documents at once.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.