GRPO

Definition

Plain language

A reinforcement-learning recipe that compares several attempts at the same task to figure out which ones to reinforce.

As stated in the literature

Group Relative Policy Optimization, an RL method that computes advantages by comparing a group of rollouts on the same prompt without a separate value model.

Also called: G-R-P-O

Why it matters: Comparing rollouts within a group removes the need for a separate value network and is now the default RL recipe behind much of recent reasoning-model training.

For example, GRPO has a model produce eight attempts at the same math problem, then reinforces the ones that scored above the group average and discourages those below it.

Heard on the show

“… GRPO is reinforcement-learning fine-tuning where the model's answers get scored by a hand-built reward …”

Episode 199 — Finding a Model's Hidden Behaviors Without Knowing What You're Looking For

Definition

Heard on the show

Mentioned in 26 episodes

Related concepts

Related terms