Definition
A reinforcement-learning recipe that compares several attempts at the same task to figure out which ones to reinforce.
Group Relative Policy Optimization, an RL method that computes advantages by comparing a group of rollouts on the same prompt without a separate value model.
Also called: G-R-P-O