Glossary · Term

deliberative alignment

← all terms

Definition

A training method that teaches a model to think through safety rules step by step before answering.

An OpenAI-developed alignment technique that fine-tunes models on chain-of-thought reasoning about a safety specification before generating responses.

Mentioned in 1 episode

  1. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap