Speculative decoding
- Definition
- An inference optimization where a small, fast 'draft' model generates candidate tokens that a larger 'verifier' model checks in parallel, speeding up generation without changing output quality.
- Why it matters
- Speculative decoding is a free lunch for inference speed. By having a small model draft tokens and a large model verify them in parallel, you can achieve 2-3x speedups with mathematically identical output quality. This matters because autoregressive generation is inherently sequential: each token depends on the previous one, creating a latency bottleneck. Speculative decoding partially circumvents this limitation. For infrastructure teams running large models, speculative decoding can double throughput without additional hardware investment. The technique is becoming standard in production inference stacks and is one of several innovations making real-time applications (voice agents, coding assistants) practical with large models.
- In practice
- Google introduced speculative decoding in 2023 and uses it in production for Gemini. The technique works best when the draft model (typically 1-7B parameters) has a high acceptance rate from the verifier (the full-size model). In practice, 70-90% of drafted tokens are accepted, meaning the large model rarely needs to regenerate. Medusa and Eagle are open-source variants that use additional prediction heads instead of a separate draft model. vLLM and TGI both support speculative decoding. The speedup is most dramatic for long outputs and large models: generating 1,000 tokens from a 70B model can go from 30 seconds to 10 seconds without any quality degradation.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Autoregressive model
A model that generates output one token at a time, with each new token conditioned on all previous tokens. GPT, Claude, and Gemini are all autoregressive, which is why they stream responses word by word.
Inference
The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
Latency
The time between sending a request to an AI model and receiving the first token of the response. Low latency is critical for real-time applications like coding assistants, voice agents, and live customer support.
Throughput
The number of tokens or requests an AI system can process per second. High throughput is essential for batch processing, high-traffic applications, and cost-efficient inference at scale.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.