Glossary · Term

logit lens

Definition

Plain language

A way to peek at what a model would say if it had to commit to an answer at each layer, instead of just at the end.

As stated in the literature

An interpretability technique applying the final unembedding to intermediate residual stream states to reveal how predictions evolve through depth.

Why it matters: It gives a cheap, intuitive picture of how a prediction forms across depth, which is useful for both research and debugging.

For example, applying the logit lens to a model's tenth layer might show that it's already nearly committed to 'Paris' as the next word, with later layers just sharpening that prediction.

Heard on the show

“The logit lens has been around for years.”

Episode 203 — The Thought a Model Doesn't Say — and the Lens That Reads It

Mentioned in 5 episodes

203
The Thought a Model Doesn't Say — and the Lens That Reads It
141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
074
How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning
018
Language Models Compute the Rational Move, Then Override It

Related concepts

Logit Lens

Related terms

residual stream unembedding