Concept · 1 episode(s)

RewardBench

Definition

RewardBench is a benchmark for evaluating reward models — how well a model that’s supposed to predict human preference actually predicts it across categories like chat, reasoning, and safety. It’s a useful sanity check before deploying a reward model in serious RLHF.

Episodes covering this

019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026