attention · Glossary · AI Papers: A Deep Dive

Definition

Plain language

The mechanism a language model uses to decide which earlier words to pay close attention to when figuring out the next word.

As stated in the literature

A weighted-sum operation in a transformer that mixes information across tokens; weights come from learned dot products between query and key vectors. The defining building block of every modern LLM.

Also called: Attention

Why it matters: It's the core mechanism that made modern language models possible and is the reason transformers handle long-range dependencies so much better than older recurrent networks.

For example, when predicting the next word in 'The cat that the dog chased was ___', attention lets the model focus on 'cat' rather than the nearer 'dog'.

Heard on the show

“… four-player wiring is almost simple after that: the four views are tiled into one grid, so the model's attention spans all four perspectives at once — that's what keeps one event consistent across cameras — and …”

Episode 206 — How Four-Second Clips Become Hours of Playable AI Soccer

Mentioned in 54 episodes

Related concepts

Attention Analysis Attention Heads Hybrid SSM/Attention KV Cache Token-Level Analysis Transformer Attention Unicode Steganography

Related terms

dot product token transformer weights