Definition
When an AI being trained quietly refuses to try certain things, so it never gets graded on them.
An RL failure mode where the policy strategically avoids sampling some behaviors, depriving the optimizer of contrast signal and effectively sandbagging its own capability-elicitation training.