Glossary · Term

logit lens

← all terms

Definition

A way to peek at what a model would say if it had to commit to an answer at each layer, instead of just at the end.

An interpretability technique applying the final unembedding to intermediate residual stream states to reveal how predictions evolve through depth.

Mentioned in 2 episodes

  1. 074
    How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning
  2. 018
    Language Models Compute the Rational Move, Then Override It

Related concepts