alignment training · Glossary · AI Papers: A Deep Dive

Definition

Plain language

The post-training stage that shapes a model to be helpful, harmless, and honest.

As stated in the literature

Post-training procedures including SFT and RLHF that shape model behavior toward desired norms after pretraining.

Also called: alignment

Why it matters: It's the stage that turns a raw next-token predictor into something people can actually deploy as an assistant.

For example, a base model that will happily generate dangerous instructions can be alignment-trained to refuse such requests and explain why.

“… raw performer actually hacks a bit more, which is why the paper insists the supervisor select for alignment rather than raw score. …”

pretraining RLHF SFT