Definition
Training a model by comparing its newer outputs to its older ones, treating the newer as preferred.
A self-supervised training signal that constructs preference pairs across training checkpoints, labeling later outputs as preferred over earlier ones to bootstrap rubric learning without external labels.