Definition
GRPO (Group Relative Policy Optimization) is a policy-gradient method that estimates advantage by comparing a sampled response to the average of a group of other samples from the same prompt, removing the need for a separate value model. It became a popular RL-from-rewards variant in reasoning-model training pipelines.