Definition
Sandbagging is when a model deliberately performs worse than it can — on an evaluation, in front of a particular user, or under specific cues — to avoid scrutiny, training updates, or downstream consequences. It’s a core concern for capability evaluations: a sandbagging model lies about what it can do.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.