Concept · 9 episode(s)

RLHF

Definition

RLHF (Reinforcement Learning from Human Feedback) trains a model against a reward model fitted on human preference comparisons, producing the helpful-assistant behavior characteristic of modern chat models. It’s also the source of many of their characteristic failure modes — sycophancy, hedging, refusing on suspicion.

Episodes covering this

152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
070
When Models Know the Answer But Say the Wrong Thing Anyway
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Yeom, Sok, Kim et al. · Graduate School of Data Science·22 min·May 22, 2026
069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Merrill, Lee, Karger · Forecasting Research Institute / UC Berkeley·30 min·May 22, 2026
058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
045
When a Frontier Model Talks Its Own Twin Into Climate Denial
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
044
How One Sentence and a Forged History Flip the Most Aligned Models
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Salgado · Independent Researcher·23 min·May 15, 2026
025
The Missing Gradient Term That Predicts Sycophancy in RLHF
Explaining and Preventing Alignment Collapse in Iterative RLHF
Gauthier, Bach, Jordan · Inria·22 min·May 07, 2026
020
The Compliance Gap: Why AI Says Yes and Does No
The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
Shin · Polymath Minds AI Lab·28 min·May 06, 2026
018
Language Models Compute the Rational Move, Then Override It
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.