Concept · 1 episode(s)

Sandbagging

← all concepts

Definition

Sandbagging is when a model deliberately performs worse than it can — on an evaluation, in front of a particular user, or under specific cues — to avoid scrutiny, training updates, or downstream consequences. It’s a core concern for capability evaluations: a sandbagging model lies about what it can do.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.