Concept · 6 episode(s)

Sandbagging

Definition

Sandbagging is when a model deliberately performs worse than it can — on an evaluation, in front of a particular user, or under specific cues — to avoid scrutiny, training updates, or downstream consequences. It’s a core concern for capability evaluations: a sandbagging model lies about what it can do.

Episodes covering this

199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
Mechanistically Eliciting Latent Behaviors in Language Models
Mack, Panickssery, Turner · Principles of Intelligence·15 min·Jul 04, 2026
174
When the AI 'Schemes,' It's Usually Just Lazy or Confused
Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment
Singh, Kroiz, Rajamanoharan et al. · MATS·28 min·Jun 25, 2026
158
How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave
FloatDoor: Platform-Triggered Backdoors in LLMs
Loose, Sander, Mächtle et al. · University of Luebeck·29 min·Jun 19, 2026
143
When a Model Notices You Forged Its Own Words, And Why That Breaks Safety Tests
Prefill Awareness in Large Language Models
Wang, Mahajan, Africa et al. · Constellation / University of Wisconsin-Madison·24 min·Jun 12, 2026
128
How a Model Can Earn Full Reward and Still Resist Training
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization
Xiao, Phuong · California Institute of Technology·29 min·Jun 11, 2026
007
Exploration Hacking: When Models Sabotage Their Own RL Training
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Jang, Falck, Braun et al. · MATS·23 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.