Prompt Caching
- Definition
- A technique that stores and reuses the processed representation of frequently repeated prompt prefixes — system prompts, few-shot examples, document context — so the model does not recompute them on every request. Prompt caching can reduce latency by up to 85% and cost by up to 90% for repetitive workloads.
- Why it matters
- Every production AI app sends the same system prompt thousands of times a day. Without caching, you are paying to process identical tokens over and over — it is the single biggest hidden cost in deployed AI. Prompt caching is the first optimization any team running AI at scale should implement, before model routing, before quantization, before anything else. The math is simple: if your system prompt is 4,000 tokens and you make 100,000 calls per day, you are processing 400 million redundant input tokens daily. At $3/M input tokens, that is $1,200/day in waste. Caching eliminates it. If your AI vendor does not offer prompt caching, their pricing is not competitive.
- In practice
- Anthropic launched prompt caching for Claude in August 2024, delivering up to 90% cost reduction and 85% latency reduction on cached prefixes — the cached portion of a prompt costs $0.30/M tokens versus $3/M for uncached input on Claude 3.5 Sonnet. OpenAI followed with automatic prompt caching in October 2024, offering 50% discounts on cached input tokens with zero code changes required. Google's Gemini API provides context caching with a minimum 32K token threshold and per-hour storage fees. In production, engineering teams report that prompt caching reduces their monthly AI spend by 40-70% for applications with stable system prompts, making previously cost-prohibitive use cases viable.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
KV cache
A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
Token pricing
The cost model used by AI API providers, charging per million input and output tokens. Prices have fallen dramatically, from $60/M tokens (GPT-4, 2023) to under $1/M tokens for many models in 2026.
System prompt
A set of instructions prepended to every conversation that defines the AI model's persona, constraints, and behavior. System prompts are how companies customize foundation models for specific products and brands.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.