Mixture of Experts (MoE)
- Definition
- A model architecture that routes each input to a subset of specialized sub-networks (experts) rather than using the full model. MoE dramatically reduces inference cost while maintaining quality, and is used in GPT-4 and Mixtral.
- Why it matters
- MoE is the architectural trick that lets models be large without being expensive. A 200B-parameter MoE model might only activate 20B parameters per token, achieving large-model quality at small-model inference cost. This is significant because it partially breaks the scaling laws' trade-off between capability and cost. For buyers, MoE models offer better performance-per-dollar than dense models of the same total parameter count. For labs, MoE enables training larger models without proportionally increasing inference costs. The trade-off: MoE models require more total memory (all parameters must be loaded) and can be less consistent across domains if experts specialize too narrowly.
- In practice
- GPT-4 is widely reported to use a mixture-of-experts architecture, though OpenAI has not confirmed details. Mistral's Mixtral 8x7B, launched in December 2023, was the first major open-weight MoE model, achieving near-GPT-3.5 quality at much lower inference cost. Mixtral 8x22B followed with near-GPT-4 quality. Google's Switch Transformer and GLaM pioneered the approach in research. DeepSeek's models use MoE architectures extensively. The trend is clear: MoE is becoming the default architecture for production models where inference cost matters more than parameter efficiency, which is to say, virtually all production models.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Transformer
The neural network architecture behind virtually all modern language and multi-modal models. Introduced in Google's 2017 'Attention Is All You Need' paper, transformers use self-attention to process sequences in parallel.
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Sparse model
A neural network where only a fraction of parameters are activated for any given input, reducing compute requirements compared to dense models of the same total size. Mixture of Experts is the most common sparse architecture.
Parameter
A learnable value inside a neural network that gets adjusted during training. Model size is measured in parameters (e.g., 70B, 405B), which roughly correlates with capability and cost.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.