Concept · 22 episode(s)

Reward Model

Definition

A reward model is a learned function that scores model outputs, used to provide a training signal in RLHF and related setups. It stands in for a stable population of human preferences and inherits, faithfully, whatever biases that population had.

Episodes covering this

182
How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80%
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
Song, Cai · Emory University·17 min·Jun 29, 2026
176
An AI Designed Its Own Psychology Studies, Then Confirmed What It Found
Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist
Jagadish, Strittmatter, Jacoby et al. · Princeton University·31 min·Jun 26, 2026
173
The Free Step-Level Grader Hiding in Every RL Training Run
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Oh, Li, Park et al. · University of Wisconsin–Madison·22 min·Jun 25, 2026
170
When a One-Liner Beats Your Agent's Clever Verification Logic
Bayesian control for coding agents
Papamarkou, Smirnov, Mazanov et al. · PolyShape / National Technical University of Athens·26 min·Jun 24, 2026
159
Can a Coding Agent Run Its Own Robot Experiments Overnight, With No Human Resetting the Scene?
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Xiao, Xie, Zhang et al. · NVIDIA·23 min·Jun 19, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
133
How MiniMax Turned a Reward-Hacking Disaster Into Olympiad Gold
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Chen, Zhang, Zhang et al. · MiniMax / The Chinese University of Hong Kong·34 min·Jun 12, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
119
Beating Reinforcement Learning Without Ever Touching the Model's Weights
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Hwang, Suri, Villecroze et al. · Layer6 AI·22 min·Jun 05, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
099
How an Open-Book Trick Teaches a Model to Catch Its Own Mistakes
Self-Trained Verification for Training- and Test-Time Self-Improvement
Wu, Raghunathan · Carnegie Mellon University·21 min·May 29, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Wang, Lu, Wang et al. · The University of Hong Kong·32 min·May 26, 2026
078
Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yang, Gong, Huang et al. · Microsoft·28 min·May 25, 2026
069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Merrill, Lee, Karger · Forecasting Research Institute / UC Berkeley·30 min·May 22, 2026
065
One Loop to Optimize Them All: A Universal API for LLM-Driven Discovery
optimize_anything: A Universal API for Optimizing any Text Parameter
Agrawal, Lee, Tan et al. · UC Berkeley·27 min·May 22, 2026
059
Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Li, Zhan, Zhang et al. · Shanghai AI Laboratory / The Chinese University of Hong Kong·31 min·May 16, 2026
025
The Missing Gradient Term That Predicts Sycophancy in RLHF
Explaining and Preventing Alignment Collapse in Iterative RLHF
Gauthier, Bach, Jordan · Inria·22 min·May 07, 2026
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.