Concept · 3 episode(s)

Capability Elicitation

Definition

Capability elicitation is the practice of drawing out a model’s true maximum capabilities — what it can do under favorable conditions — rather than what it produces by default. Researchers fine-tune, reinforcement-learn, prompt, or scaffold models to surface latent skills, especially in dangerous-capability evaluations where a safety-trained model might otherwise refuse, sandbag, or simply fail to try its hardest. The core question: does the model lack the capability, or is it just not showing it?

Episodes covering this

199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
Mechanistically Eliciting Latent Behaviors in Language Models
Mack, Panickssery, Turner · Principles of Intelligence·15 min·Jul 04, 2026
184
An AI Built an Undetectable Secret Channel, And Another AI Couldn't Find It
Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems
Rippin, Marshall, Africa et al. · Oxford University·19 min·Jun 30, 2026
007
Exploration Hacking: When Models Sabotage Their Own RL Training
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Stress-Testing Capability Elicitation With Password-Locked Models
Auditing Language Models for Hidden Objectives