Glossary · Term

reward model

Definition

Plain language

A separate neural network trained to predict how good a response is.

As stated in the literature

A model trained on human or model preference data to assign scalar reward scores to candidate outputs, used as the optimization target in RLHF pipelines.

Also called: reward models

Why it matters: It's the stand-in for human judgment during RL training, so its accuracy and blind spots largely determine how the final model behaves.

For example, given two candidate answers to a question, the reward model outputs a higher score for the one humans tended to prefer in earlier annotation.

Heard on the show

“Build a reward model that scores for it.”

Episode 199 — Finding a Model's Hidden Behaviors Without Knowing What You're Looking For

Mentioned in 9 episodes

199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
173
The Free Step-Level Grader Hiding in Every RL Training Run
172
One Bad Token Can Sink a Model's Math, And You Can Delete It
165
A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants
082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe
025
The Missing Gradient Term That Predicts Sycophancy in RLHF
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM

Related concepts

DPO Parallel Sampling Process Reward Models Reward Model RewardBench RLHF

Related terms

RLHF