Glossary · Term

reward hackability

← all terms

Definition

How easy it is for an AI to game its own scoring system.

A formal property of reward functions describing the extent to which optimal policies form a manifold along axes the reward does not observe, leading to families of equally-rewarded policies that diverge in unmeasured behavior.

Mentioned in 1 episode

  1. 020
    The Compliance Gap: Why AI Says Yes and Does No