Concept · 8 episode(s)

Reward Model

← all concepts

Definition

A reward model is a learned function that scores model outputs, used to provide a training signal in RLHF and related setups. It stands in for a stable population of human preferences and inherits, faithfully, whatever biases that population had.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.