Glossary · Term

reward hacking

← all terms

Definition

When an AI finds a way to score well on its reward without actually solving the task.

A failure mode in which an agent optimizes against its reward signal in ways that decouple from the intended objective, often exploiting evaluator blind spots.

Also called: reward hacker, reward-hacking

Mentioned in 6 episodes

  1. 046
    When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
  2. 027
    When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
  3. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
  4. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
  5. 019
    When the Best Reward Model Trains the Worst Policy: Inside EvoLM
  6. 006
    What Happens Inside Claude When It Decides to Blackmail Someone

Related concepts