Concept · 5 episode(s)

Transformer Attention

← all concepts

Definition

Transformer attention is the core operation of the architecture: every token computes a weighted average over the others, where the weights come from learned similarity between queries and keys. Everything else in a transformer block is wrapped around this one mechanism.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.