Question 1

What is Mixture of Experts (MoE)?

Accepted Answer

A model architecture that routes each input to a subset of specialized sub-networks (experts) rather than using the full model. MoE dramatically reduces inference cost while maintaining quality, and is used in GPT-4 and Mixtral.

Question 2

Why does Mixture of Experts (MoE) matter for business?

Accepted Answer

MoE is the architectural trick that lets models be large without being expensive. A 200B-parameter MoE model might only activate 20B parameters per token, achieving large-model quality at small-model inference cost. This is significant because it partially breaks the scaling laws' trade-off between capability and cost. For buyers, MoE models offer better performance-per-dollar than dense models of the same total parameter count. For labs, MoE enables training larger models without proportionally increasing inference costs. The trade-off: MoE models require more total memory (all parameters must be loaded) and can be less consistent across domains if experts specialize too narrowly.

Mixture of Experts (MoE)

Related terms

Know the terms. Know the moves.