Definition
A training technique that rewards an AI for building up confidence gradually rather than locking in an answer too soon.
A confidence-trajectory-shaped reward term layered onto GRPO that penalizes front-loaded confidence (favoring monotonically increasing mid-chain commitment) to discourage premature confidence and improve faithfulness.