Concept · 1 episode(s)

Introspective Probing

← all concepts

Definition

Introspective probing uses a model’s own self-reports — or activations conditioned on self-referential prompts — to estimate what it “knows” about its own beliefs and states. The hard part is distinguishing introspection from plausible-sounding generation.

Episodes covering this