Infrastructure & ComputeDeep Dive

Speculative decoding

Definition
An inference optimization where a small, fast 'draft' model generates candidate tokens that a larger 'verifier' model checks in parallel, speeding up generation without changing output quality.
Why it matters
Speculative decoding is a free lunch for inference speed. By having a small model draft tokens and a large model verify them in parallel, you can achieve 2-3x speedups with mathematically identical output quality. This matters because autoregressive generation is inherently sequential: each token depends on the previous one, creating a latency bottleneck. Speculative decoding partially circumvents this limitation. For infrastructure teams running large models, speculative decoding can double throughput without additional hardware investment. The technique is becoming standard in production inference stacks and is one of several innovations making real-time applications (voice agents, coding assistants) practical with large models.
In practice
Google introduced speculative decoding in 2023 and uses it in production for Gemini. The technique works best when the draft model (typically 1-7B parameters) has a high acceptance rate from the verifier (the full-size model). In practice, 70-90% of drafted tokens are accepted, meaning the large model rarely needs to regenerate. Medusa and Eagle are open-source variants that use additional prediction heads instead of a separate draft model. vLLM and TGI both support speculative decoding. The speedup is most dramatic for long outputs and large models: generating 1,000 tokens from a 70B model can go from 30 seconds to 10 seconds without any quality degradation.

We cover infrastructure & compute every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.