RewardBench · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A standard test for how well reward models pick better answers over worse ones.

As stated in the literature

A benchmark for evaluating reward models on their ability to rank preferred over dispreferred responses across diverse categories.

Also called: RewardBench-2

Why it matters: Comparable evaluations of reward models matter because a reward model that ranks badly is going to silently steer a whole post-training run in the wrong direction.

For example, given a pair containing one helpful answer and one harmful one, a reward model gets credit if it scores the helpful one higher.

Heard on the show

“Things like RewardBench-2 and JudgeBench, where you're given pairs of responses and you have to pick the better one.”

Episode 019 — When the Best Reward Model Trains the Worst Policy: Inside EvoLM

Mentioned in 1 episode

019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM

Related concepts

RewardBench

Related terms

reward model