Models & ArchitectureDeep Dive

Attention mechanism

Source
Definition
The core innovation inside transformers that lets a model weigh the relevance of every token against every other token in a sequence. Attention is what makes modern LLMs understand context and long-range dependencies.
Why it matters
Attention is why modern AI works. Before attention, neural networks processed sequences linearly and struggled with long documents. Attention lets a model focus on the relevant parts of its input regardless of distance, which is why you can paste a 100-page contract into Claude and ask about a clause on page 87. Understanding attention matters for practitioners because it explains key trade-offs: self-attention scales quadratically with sequence length, which is why context window size, inference cost, and latency are all interconnected. Every optimization in modern AI, from flash attention to KV caching, is ultimately about making attention faster and cheaper.
In practice
The original 2017 Transformer paper introduced multi-head self-attention, where the model runs multiple attention computations in parallel to capture different relationship types. Since then, variants like grouped-query attention (used in Llama 2+), multi-query attention (used in PaLM), and sliding-window attention (used in Mistral) have emerged to reduce compute costs. Flash Attention, developed by Tri Dao at Stanford, made attention 2-4x faster by optimizing memory access patterns, and is now standard in virtually every training and inference stack.

We cover models & architecture every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.