Definition
Introspective probing uses a model’s own self-reports — or activations conditioned on self-referential prompts — to estimate what it “knows” about its own beliefs and states. The hard part is distinguishing introspection from plausible-sounding generation.