Definition
The standard way modern language models pay attention to earlier text, splitting the work across many parallel attention units.
The canonical transformer attention block in which multiple attention heads operate in parallel over the same residual stream, each with its own learned projections.