mixture-of-experts · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A model design where only a fraction of the parameters fire on any one input, letting models be very large but cheap to run.

As stated in the literature

A neural network architecture in which only a sparse subset of expert sub-networks is activated per token, enabling large total parameter counts at lower per-token compute.

Also called: MoE, mixture of experts, sparse mixture-of-experts

Why it matters: It lets models grow in total knowledge without proportionally growing the compute needed per token at inference.

For example, a 200-billion-parameter MoE model might only activate ~10 billion parameters when answering a single question.

Heard on the show

“There's a kernel inside a real training system called VeOmni — a weight-gradient kernel for a mixture-of-experts model — that had been hand-tuned in Triton by expert engineers.”

Episode 177 — Why Raw Profiler Data Made an AI Worse at Writing GPU Code

Mentioned in 7 episodes

Related terms

neural network parameter token