Concept · 8 episode(s)

RLHF

← all concepts

Definition

RLHF (Reinforcement Learning from Human Feedback) trains a model against a reward model fitted on human preference comparisons, producing the helpful-assistant behavior characteristic of modern chat models. It’s also the source of many of their characteristic failure modes — sycophancy, hedging, refusing on suspicion.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.