Glossary · Term

alignment tax

← all terms

Definition

The cost in raw ability or accuracy a model pays from being trained to be helpful and safe.

The observed drop in capability, calibration, or output diversity that often accompanies RLHF and instruction tuning, originally named by Ouyang et al. in the InstructGPT paper.

Mentioned in 1 episode

  1. 070
    When Models Know the Answer But Say the Wrong Thing Anyway