Concept · 1 episode(s)

Capability Elicitation

← all concepts

Definition

Capability elicitation is the practice of drawing out a model’s true maximum capabilities — what it can do under favorable conditions — rather than what it produces by default. Researchers fine-tune, reinforcement-learn, prompt, or scaffold models to surface latent skills, especially in dangerous-capability evaluations where a safety-trained model might otherwise refuse, sandbag, or simply fail to try its hardest. The core question: does the model lack the capability, or is it just not showing it?

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.