Definition
The cost in raw ability or accuracy a model pays from being trained to be helpful and safe.
The observed drop in capability, calibration, or output diversity that often accompanies RLHF and instruction tuning, originally named by Ouyang et al. in the InstructGPT paper.