Glossary · Term

multi-head attention

← all terms

Definition

The standard way modern language models pay attention to earlier text, splitting the work across many parallel attention units.

The canonical transformer attention block in which multiple attention heads operate in parallel over the same residual stream, each with its own learned projections.

Mentioned in 1 episode

  1. 053
    An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script