Models & ArchitectureDeep Dive

Mixture of Experts (MoE)

Definition
A model architecture that routes each input to a subset of specialized sub-networks (experts) rather than using the full model. MoE dramatically reduces inference cost while maintaining quality, and is used in GPT-4 and Mixtral.
Why it matters
MoE is the architectural trick that lets models be large without being expensive. A 200B-parameter MoE model might only activate 20B parameters per token, achieving large-model quality at small-model inference cost. This is significant because it partially breaks the scaling laws' trade-off between capability and cost. For buyers, MoE models offer better performance-per-dollar than dense models of the same total parameter count. For labs, MoE enables training larger models without proportionally increasing inference costs. The trade-off: MoE models require more total memory (all parameters must be loaded) and can be less consistent across domains if experts specialize too narrowly.
In practice
GPT-4 is widely reported to use a mixture-of-experts architecture, though OpenAI has not confirmed details. Mistral's Mixtral 8x7B, launched in December 2023, was the first major open-weight MoE model, achieving near-GPT-3.5 quality at much lower inference cost. Mixtral 8x22B followed with near-GPT-4 quality. Google's Switch Transformer and GLaM pioneered the approach in research. DeepSeek's models use MoE architectures extensively. The trend is clear: MoE is becoming the default architecture for production models where inference cost matters more than parameter efficiency, which is to say, virtually all production models.

We cover models & architecture every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.