Definition
A separate neural network trained to predict how good a response is.
A model trained on human or model preference data to assign scalar reward scores to candidate outputs, used as the optimization target in RLHF pipelines.
Also called: reward models