linear probe · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A small simple classifier trained on a model's internal states to test what information they contain.

As stated in the literature

A linear classifier trained on frozen intermediate activations to detect whether a particular concept is linearly decodable from the representation.

Also called: linear probing, linear classifier, linear probes, probe, probes

Why it matters: Linear probes are a cheap, standard way to ask 'is this concept actually represented here?' without invasive interventions.

For example, training a one-layer classifier on a transformer's middle layer can reveal whether the model already 'knows' the part of speech of each word at that depth.

Heard on the show

“Second, the linear probe, which reads them in the weakest possible way: multiply by a fixed set of weights, add them up, and output a guess.”

Episode 204 — The Length Estimate Hiding Inside a Word-by-Word Model

Mentioned in 44 episodes

Related concepts

Creation-Audit Loop Emotion Vectors Linear Probing Perplexity Probe Probing Test-Time Auditing

Related terms

freeze