Concept · 3 episode(s)

Probing

← all concepts

Definition

Probing is the family of techniques that trains a small classifier on a model’s internal states to test whether some property is encoded there. Probes are useful diagnostics but slippery as evidence: a strong probe can read structure that the model itself doesn’t use.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.