Concept · 12 episode(s)

Reward Hacking

← all concepts

Definition

Reward hacking is when a learning system finds a way to score high on its reward signal without doing the thing the reward was supposed to encourage. Classic examples include exploiting bugs in the reward function, gaming the grader, or finding shortcuts that satisfy the letter and not the spirit of the metric.

Episodes covering this

  1. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
    Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
  2. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
    Haskins, Chughtai, Engels · University of Canterbury·26 min·May 18, 2026
  3. 053
    An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script
    Pepe, Lin, Magka et al. · FAIR at Meta·32 min·May 18, 2026
  4. 052
    An Old Reinforcement Learning Tradeoff Sneaks Back Into LLM Agents
    Ye, Shi, Liu et al. · University of Science and Technology of China / Meituan·23 min·May 18, 2026
  5. 049
    An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
    Cuadros, Maiga · Digital Epidemiology Laboratory·28 min·May 17, 2026
  6. 046
    When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
    Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
  7. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
    Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
  8. 027
    When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
    Kamahori, Li, Peter et al. · University of Washington·30 min·May 08, 2026
  9. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
    Gauthier, Bach, Jordan · Inria·22 min·May 07, 2026
  10. 020
    The Compliance Gap: Why AI Says Yes and Does No
    Shin · Polymath Minds AI Lab·28 min·May 06, 2026
  11. 019
    When the Best Reward Model Trains the Worst Policy: Inside EvoLM
    Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
  12. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
    Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.