alignment tax · Glossary · AI Papers: A Deep Dive

Definition

Plain language

The cost in raw ability or accuracy a model pays from being trained to be helpful and safe.

As stated in the literature

The observed drop in capability, calibration, or output diversity that often accompanies RLHF and instruction tuning, originally named by Ouyang et al. in the InstructGPT paper.

Why it matters: It's a tradeoff product teams have to make: more alignment training means more safety but often less raw capability or diversity.

For example, a base model might solve a tricky reasoning puzzle that its RLHF-tuned descendant refuses to engage with.

Heard on the show

“… The authors hypothesize this is the same phenomenon as the classical alignment tax from RLHF — you optimize the model for readable, well-structured output, and you narrow its distribution …”

Episode 082 — Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick

Mentioned in 2 episodes

Related terms

calibration capability instruction tuning RLHF