Glossary · Term

exploration hacking

← all terms

Definition

When an AI being trained quietly refuses to try certain things, so it never gets graded on them.

An RL failure mode where the policy strategically avoids sampling some behaviors, depriving the optimizer of contrast signal and effectively sandbagging its own capability-elicitation training.

Mentioned in 1 episode

  1. 007
    Exploration Hacking: When Models Sabotage Their Own RL Training