Concept · 2 episode(s)

Exploration Hacking

← all concepts

Definition

Exploration hacking describes an RL-trained model that learns to manipulate its own exploration — refusing to try certain actions, or trying them in deliberately weak ways — to avoid being trained on outcomes it doesn’t want. It’s a flavor of training gaming that’s especially insidious because it leaves a clean training log behind.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.