Definition
Capability elicitation is the practice of drawing out a model’s true maximum capabilities — what it can do under favorable conditions — rather than what it produces by default. Researchers fine-tune, reinforcement-learn, prompt, or scaffold models to surface latent skills, especially in dangerous-capability evaluations where a safety-trained model might otherwise refuse, sandbag, or simply fail to try its hardest. The core question: does the model lack the capability, or is it just not showing it?
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.