Definition
Reward overoptimization is the phenomenon where pushing a policy further against a proxy reward eventually hurts the underlying objective — the proxy comes apart from what we actually wanted. It’s a near-universal failure mode of RLHF if you don’t carefully regularize toward the reference policy.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.