deliberative alignment · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A training method that teaches a model to think through safety rules step by step before answering.

As stated in the literature

An OpenAI-developed alignment technique that fine-tunes models on chain-of-thought reasoning about a safety specification before generating responses.

Why it matters: It moves safety from pattern-matching on prompts toward reasoning about rules, which scales better as harms become more nuanced.

For example, asked a borderline request, the model first writes out which clauses of the safety policy apply and how they interact before deciding to comply or refuse.

Heard on the show

“… It's that an entire generation of safety methods — Constitutional AI, OpenAI's deliberative alignment — is built on the opposite assumption: that if you train a model to reason over safety principles …”

Episode 171 — The Safety Decision a Model Makes Before It Thinks a Word

Mentioned in 3 episodes

Related terms

alignment training chain of thought