Concept · 1 episode(s)

RewardBench

← all concepts

Definition

RewardBench is a benchmark for evaluating reward models — how well a model that’s supposed to predict human preference actually predicts it across categories like chat, reasoning, and safety. It’s a useful sanity check before deploying a reward model in serious RLHF.

Episodes covering this