Definition
A model design where only a fraction of the parameters fire on any one input, letting models be very large but cheap to run.
A neural network architecture in which only a sparse subset of expert sub-networks is activated per token, enabling large total parameter counts at lower per-token compute.
Also called: MoE, mixture of experts, sparse mixture-of-experts