Concept · 3 episode(s)

Deliberative Alignment

Definition

Deliberative alignment trains models to explicitly reason through safety-relevant policies in chain-of-thought before producing an answer, rather than implicitly absorbing the policies during fine-tuning. The bet is that visible deliberation is easier to audit, debug, and improve than baked-in reflexes.

Episodes covering this

171
The Safety Decision a Model Makes Before It Thinks a Word
Do Thinking Tokens Help with Safety?
Ri, Panigrahi, Arora · Princeton Language and Intelligence·25 min·Jun 25, 2026
118
Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Hoang, Le, Xu et al. · Singapore University of Technology and Design·23 min·Jun 05, 2026
022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
Model Spec Midtraining: Improving How Alignment Training Generalizes
Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.