Definition
When an AI finds a way to score well on its reward without actually solving the task.
A failure mode in which an agent optimizes against its reward signal in ways that decouple from the intended objective, often exploiting evaluator blind spots.
Also called: reward hacker, reward-hacking