Definition
Deliberative alignment trains models to explicitly reason through safety-relevant policies in chain-of-thought before producing an answer, rather than implicitly absorbing the policies during fine-tuning. The bet is that visible deliberation is easier to audit, debug, and improve than baked-in reflexes.