Definition
A reward model is a learned function that scores model outputs, used to provide a training signal in RLHF and related setups. It stands in for a stable population of human preferences and inherits, faithfully, whatever biases that population had.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.