Infrastructure & ComputeDeep Dive

KV cache

Definition
A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
Why it matters
KV caching is one of those infrastructure details that determines whether your AI application is economically viable. Without KV caching, generating a 1,000-token response would require recomputing attention over the entire context for each new token, an O(n^2) cost explosion. With KV caching, each new token only computes attention against the cached keys and values, making generation practical. But KV caches consume significant GPU memory, especially for long-context models: a 70B model with a 128K context window needs tens of gigabytes of KV cache per active request. This memory constraint limits how many concurrent users a single GPU can serve, directly impacting inference cost and throughput.
In practice
vLLM introduced PagedAttention, which manages KV cache memory like an operating system manages virtual memory, achieving near-zero memory waste and enabling 2-4x more concurrent requests per GPU. Groq's LPU architecture includes dedicated SRAM for KV caching, contributing to its extreme inference speed. Techniques like GQA (grouped-query attention) and MQA (multi-query attention) reduce KV cache size by sharing key-value heads across attention groups, as used in Llama 2+, Mistral, and Gemini. For long-context models, KV cache compression and quantization are active research areas, since a 1M-token KV cache would otherwise require hundreds of gigabytes of memory.

We cover infrastructure & compute every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.