Glossary · Term

reward hackability

Definition

Plain language

How easy it is for an AI to game its own scoring system.

As stated in the literature

A formal property of reward functions describing the extent to which optimal policies form a manifold along axes the reward does not observe, leading to families of equally-rewarded policies that diverge in unmeasured behavior.

Why it matters: Quantifying hackability lets designers see, before training, how many degrees of freedom they're leaving unsupervised — and shrink them before the agent finds them.

For example, two policies might both earn the maximum reward, but one solves the task and the other quietly steers behavior along a dimension the reward never measures.

Heard on the show

“And the formal version of this — Theorem 1 in the paper — leans on prior work by Skalse and colleagues on what they call reward hackability.”

Episode 020 — The Compliance Gap: Why AI Says Yes and Does No

Mentioned in 1 episode

020
The Compliance Gap: Why AI Says Yes and Does No

Related terms

manifold policy