Glossary · Term

reward model

← all terms

Definition

A separate neural network trained to predict how good a response is.

A model trained on human or model preference data to assign scalar reward scores to candidate outputs, used as the optimization target in RLHF pipelines.

Also called: reward models

Mentioned in 5 episodes

  1. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
  2. 048
    How a 30B Open Model Reached Olympiad Gold With the Right Recipe
  3. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
  4. 019
    When the Best Reward Model Trains the Worst Policy: Inside EvoLM
  5. 008
    Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps

Related concepts