Distillation
- Definition
- The process of training a smaller, cheaper model to mimic the behavior of a larger, more capable one. Distillation is how companies ship AI to edge devices and reduce inference costs without sacrificing too much quality.
- Why it matters
- Distillation is the bridge between frontier capability and practical deployment economics. A 405B-parameter model might be state-of-the-art, but running it costs 50x more than a distilled 7B model that retains 90% of its performance on your specific use case. This is why distillation is now a core part of every AI deployment strategy. The economics are stark: for high-volume production workloads, inference cost dominates total cost of ownership, and distillation is the most effective way to reduce it. Companies that master distillation can deliver AI products at margins that competitors running full-size models cannot match.
- In practice
- When Meta released Llama 3 70B, dozens of startups distilled it into 7B variants within weeks, undercutting inference costs by 10x. OpenAI's GPT-4o Mini is widely understood to be a distillation of GPT-4o, offering 80-90% of the quality at roughly 1/30th the price. DeepSeek's R1 distilled models achieved remarkable reasoning performance at small sizes by distilling from the full R1 model. Google's Gemini Nano, designed for on-device inference, uses distillation from larger Gemini models. The pattern is consistent: frontier models set the capability ceiling, and distillation makes that capability economically deployable at scale.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Fine-tuning
The process of continuing to train a pre-trained model on a smaller, task-specific dataset. Fine-tuning customizes model behavior for specific domains or formats and is a key part of most enterprise AI deployments.
Quantization
Reducing the numerical precision of a model's weights (e.g., from 32-bit to 4-bit) to shrink its memory footprint and speed up inference. Quantization makes it possible to run large models on consumer hardware.
Efficient model
A model designed to deliver strong performance at a fraction of the compute cost of frontier models, through architectural innovations, aggressive distillation, or better training data curation. Efficient models prioritize the performance-per-dollar ratio.
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.