Definition
RLHF (Reinforcement Learning from Human Feedback) trains a model against a reward model fitted on human preference comparisons, producing the helpful-assistant behavior characteristic of modern chat models. It’s also the source of many of their characteristic failure modes — sycophancy, hedging, refusing on suspicion.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.