multi-head attention · Glossary · AI Papers: A Deep Dive

Definition

Plain language

The standard way modern language models pay attention to earlier text, splitting the work across many parallel attention units.

As stated in the literature

The canonical transformer attention block in which multiple attention heads operate in parallel over the same residual stream, each with its own learned projections.

Why it matters: Splitting attention across heads lets one layer model several relationships at once, which is core to how modern transformers work.

For example, in one layer different heads might simultaneously track grammatical agreement, recent named entities, and quoted speakers.

Heard on the show

“… You take a small GPT training script — multi-head attention, rotary embeddings, flash attention, the Muon-AdamW optimizer combo — and you have five minutes …”

Episode 053 — An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script

Mentioned in 1 episode

053
An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script

Related terms

attention attention head residual stream transformer