Question 1

What is KV cache?

Accepted Answer

A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.

Question 2

Why does KV cache matter for business?

Accepted Answer

KV caching is one of those infrastructure details that determines whether your AI application is economically viable. Without KV caching, generating a 1,000-token response would require recomputing attention over the entire context for each new token, an O(n^2) cost explosion. With KV caching, each new token only computes attention against the cached keys and values, making generation practical. But KV caches consume significant GPU memory, especially for long-context models: a 70B model with a 128K context window needs tens of gigabytes of KV cache per active request. This memory constraint limits how many concurrent users a single GPU can serve, directly impacting inference cost and throughput.

KV cache

Related terms

Know the terms. Know the moves.