Concept · 8 episode(s)

Reward Overoptimization

Definition

Reward overoptimization is the phenomenon where pushing a policy further against a proxy reward eventually hurts the underlying objective — the proxy comes apart from what we actually wanted. It’s a near-universal failure mode of RLHF if you don’t carefully regularize toward the reference policy.

Episodes covering this

207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
183
Why You Can't Fine-Tune Foresight Into an AI Agent
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Zhang, Zhou, Qiao et al. · Fudan University / Shanghai Innovation Institute / Tencent Youtu Lab·23 min·Jun 29, 2026
178
How an AI Reviewer Learned to Stop Going Easy on AI Writing
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Iacob, Jovanović, Shen et al. · University of Cambridge·23 min·Jun 26, 2026
148
Why Letting an AI Watch Its Own Scoreboard Can Quietly Overwrite Its Safety
Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
Che, Wu · NVIDIA Research·26 min·Jun 16, 2026
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Jin, Hu, Qiu et al. · Renmin University of China·33 min·Jun 11, 2026
084
Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away
ECHO: Terminal Agents Learn World Models for Free
Shrivastava, Kauffmann, Awadallah et al. · Microsoft Research·26 min·May 26, 2026
070
When Models Know the Answer But Say the Wrong Thing Anyway
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.