transformer · Glossary · AI Papers: A Deep Dive

Definition

Plain language

The dominant neural network design behind modern language models, built around attention.

As stated in the literature

The architecture introduced in 2017 combining self-attention with feedforward blocks and residual connections, underlying nearly all current frontier LLMs.

Also called: transformers, Transformer

Why it matters: Nearly every advance in modern language and multimodal AI has come on top of this one architecture, so understanding it underlies understanding the field.

For example, GPT, Claude, and Gemini are all transformer-based, processing each token by attending to every other token in their context.

Heard on the show

“And this doesn't contradict the theorems saying transformers can't exactly count: a coarse, revisable estimate of remaining length is a much more forgiving object than exact counting.”

Episode 204 — The Length Estimate Hiding Inside a Word-by-Word Model

Mentioned in 27 episodes

Related concepts

Attention Analysis Attention Heads Hybrid SSM/Attention In-Context Learning KV Cache Residual Stream Transformer Attention

Related terms

attention MLP layer residual