Definition
Reward hacking is when a learning system finds a way to score high on its reward signal without doing the thing the reward was supposed to encourage. Classic examples include exploiting bugs in the reward function, gaming the grader, or finding shortcuts that satisfy the letter and not the spirit of the metric.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.