RLHF · Glossary · AI Papers: A Deep Dive

Definition

Plain language

Training a model with human feedback so it learns to answer the way humans prefer.

As stated in the literature

Reinforcement Learning from Human Feedback, the post-training pipeline that fits a reward model to human preferences and then optimizes a policy against it.

Why it matters: It's the post-training step that turned raw next-token predictors into usable chat assistants, and it remains the dominant recipe for aligning model outputs with human taste.

For example, human annotators rank pairs of chatbot responses, a reward model learns from those rankings, and the chatbot is then nudged to produce answers the reward model scores highly.

Heard on the show

“Frontier models are getting updated continuously — RLHF passes, post-training, knowledge injection.”

Episode 092 — When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks

Mentioned in 13 episodes

Related concepts

Agentic RL DPO Game Theory KL Divergence Post-Training Reviewer-Pleasing Bias Reward Model Reward Overoptimization RewardBench RLHF Sycophancy

Related terms

pipelining policy post-training reward model