Concept · 2 episode(s)

Linear Probing

Definition

Linear probing trains a linear classifier on a model’s frozen internal activations to test whether a target concept is linearly readable from them. It’s the cheapest interpretability tool that actually tells you something, and a sanity check for stronger claims.

Episodes covering this

204
The Length Estimate Hiding Inside a Word-by-Word Model
How Much is Left? LLMs Linearly Encode Their Remaining Output Length
· ·14 min·Jul 07, 2026
018
Language Models Compute the Rational Move, Then Override It
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Discovering Latent Knowledge in Language Models Without Supervision