Concept · 2 episode(s)

Reward Overoptimization

← all concepts

Definition

Reward overoptimization is the phenomenon where pushing a policy further against a proxy reward eventually hurts the underlying objective — the proxy comes apart from what we actually wanted. It’s a near-universal failure mode of RLHF if you don’t carefully regularize toward the reference policy.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.