Sparse model
- Definition
- A neural network where only a fraction of parameters are activated for any given input, reducing compute requirements compared to dense models of the same total size. Mixture of Experts is the most common sparse architecture.
- Why it matters
- Sparsity is one of the most promising paths to efficient AI. Dense models waste compute by running every parameter for every input, even when most parameters are irrelevant to the current task. Sparse models route each input to the most relevant subset of parameters, achieving large-model quality at small-model compute cost. This is significant because it partially breaks the trade-off between capability and cost. For the AI industry, sparsity suggests that the future of model scaling may not be about making every parameter bigger but about making models better at selecting which parameters to use. The architectural shift from dense to sparse could reduce inference costs by an order of magnitude.
- In practice
- GPT-4 and Mixtral are the most prominent sparse (MoE) models. Switch Transformer from Google demonstrated that sparse models could scale to trillions of parameters while activating only billions per token. DeepSeek's V2 and V3 models use an innovative MoE design that achieves frontier performance at a fraction of typical training cost. In benchmarks, sparse models consistently match dense models of equivalent active parameters while requiring less compute per token. The challenge is memory: even though only a subset of parameters activates per token, all parameters must be loaded into memory. This makes sparse models memory-bound rather than compute-bound, which favors hardware with high memory bandwidth.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Mixture of Experts (MoE)
A model architecture that routes each input to a subset of specialized sub-networks (experts) rather than using the full model. MoE dramatically reduces inference cost while maintaining quality, and is used in GPT-4 and Mixtral.
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Parameter
A learnable value inside a neural network that gets adjusted during training. Model size is measured in parameters (e.g., 70B, 405B), which roughly correlates with capability and cost.
Efficient model
A model designed to deliver strong performance at a fraction of the compute cost of frontier models, through architectural innovations, aggressive distillation, or better training data curation. Efficient models prioritize the performance-per-dollar ratio.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.