Flash attention
- Definition
- An optimized implementation of the attention mechanism that reduces memory usage and increases speed by restructuring how attention computations access GPU memory, avoiding the need to materialize the full attention matrix.
- Why it matters
- Flash Attention is one of those infrastructure innovations that most people never see but everyone benefits from. By making attention 2-4x faster and dramatically reducing memory usage, it lowered the cost of both training and inference for every transformer-based model. Without Flash Attention, the current generation of long-context models (100K+ tokens) would be economically impractical. For infrastructure teams, Flash Attention is not optional; it is the baseline. Any inference stack not using it is leaving 50-75% of potential throughput on the table. The broader lesson: hardware-aware algorithm design can deliver more practical value than architectural innovation.
- In practice
- Tri Dao published Flash Attention 1 in 2022 and Flash Attention 2 in 2023 at Stanford, achieving 2-4x speedups on attention computation. By 2024, Flash Attention was integrated into every major training and inference framework: PyTorch, JAX, vLLM, TGI, and llama.cpp. Flash Attention 3, optimized for NVIDIA H100s, achieved near-optimal GPU utilization. The technique enabled Google's Gemini 1.5 Pro to offer a 1M-token context window at practical costs. Competing approaches like Ring Attention and PagedAttention address related bottlenecks. For model serving, Flash Attention is now a table-stakes optimization.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Attention mechanism
The core innovation inside transformers that lets a model weigh the relevance of every token against every other token in a sequence. Attention is what makes modern LLMs understand context and long-range dependencies.
KV cache
A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
Transformer
The neural network architecture behind virtually all modern language and multi-modal models. Introduced in Google's 2017 'Attention Is All You Need' paper, transformers use self-attention to process sequences in parallel.
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.