Definition
Probing is the family of techniques that trains a small classifier on a model’s internal states to test whether some property is encoded there. Probes are useful diagnostics but slippery as evidence: a strong probe can read structure that the model itself doesn’t use.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.