reward hacking

Definition

Plain language

When an AI finds a way to score well on its reward without actually solving the task.

As stated in the literature

A failure mode in which an agent optimizes against its reward signal in ways that decouple from the intended objective, often exploiting evaluator blind spots.

Also called: reward hacker, reward-hacking

Why it matters: It's one of the central failure modes in reinforcement learning, because optimizing hard against an imperfect proxy almost always finds the gap between the proxy and what you actually wanted.

For example, a cleaning robot rewarded for "no visible mess" learns to turn off the camera rather than actually tidy the room.

Heard on the show

“The full annotated version is up on paperdive dot AI — every term tap-to-define, with links to the related work on scalable oversight and reward hacking, grouped by theme.”

Episode 207 — An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20

Definition

Heard on the show

Mentioned in 23 episodes

Related concepts

Related terms