Glossary · Term

RLHF

← all terms

Definition

Training a model with human feedback so it learns to answer the way humans prefer.

Reinforcement Learning from Human Feedback, the post-training pipeline that fits a reward model to human preferences and then optimizes a policy against it.

Mentioned in 11 episodes

  1. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
  2. 070
    When Models Know the Answer But Say the Wrong Thing Anyway
  3. 069
    When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
  4. 060
    When Splitting One Model Across Three Agents Doubles Its Accuracy
  5. 058
    Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
  6. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
  7. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
  8. 020
    The Compliance Gap: Why AI Says Yes and Does No
  9. 018
    Language Models Compute the Rational Move, Then Override It
  10. 008
    Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
  11. 004
    The Sycophancy Circuit That Survives Alignment Training

Related concepts