KV cache
- Definition
- A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
- Why it matters
- KV caching is one of those infrastructure details that determines whether your AI application is economically viable. Without KV caching, generating a 1,000-token response would require recomputing attention over the entire context for each new token, an O(n^2) cost explosion. With KV caching, each new token only computes attention against the cached keys and values, making generation practical. But KV caches consume significant GPU memory, especially for long-context models: a 70B model with a 128K context window needs tens of gigabytes of KV cache per active request. This memory constraint limits how many concurrent users a single GPU can serve, directly impacting inference cost and throughput.
- In practice
- vLLM introduced PagedAttention, which manages KV cache memory like an operating system manages virtual memory, achieving near-zero memory waste and enabling 2-4x more concurrent requests per GPU. Groq's LPU architecture includes dedicated SRAM for KV caching, contributing to its extreme inference speed. Techniques like GQA (grouped-query attention) and MQA (multi-query attention) reduce KV cache size by sharing key-value heads across attention groups, as used in Llama 2+, Mistral, and Gemini. For long-context models, KV cache compression and quantization are active research areas, since a 1M-token KV cache would otherwise require hundreds of gigabytes of memory.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Attention mechanism
The core innovation inside transformers that lets a model weigh the relevance of every token against every other token in a sequence. Attention is what makes modern LLMs understand context and long-range dependencies.
Inference
The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
Context window
The maximum number of tokens a model can process in a single request, including both the prompt and the response. Larger context windows (100K-2M tokens) let models ingest entire codebases or documents at once.
Flash attention
An optimized implementation of the attention mechanism that reduces memory usage and increases speed by restructuring how attention computations access GPU memory, avoiding the need to materialize the full attention matrix.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.