Definition
Transformer attention is the core operation of the architecture: every token computes a weighted average over the others, where the weights come from learned similarity between queries and keys. Everything else in a transformer block is wrapped around this one mechanism.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.