Glossary · Term

GSPO

Definition

Plain language

A reinforcement-learning variant that grades a whole generated sequence rather than each token.

As stated in the literature

Group Sequence Policy Optimization, a sequence-level adaptation of GRPO that avoids per-token gradient noise; useful for mixture-of-experts models where token-level routing makes per-token signals unstable.

Why it matters: For mixture-of-experts models where routing makes token-level gradients noisy, sequence-level optimization is often the difference between stable and unstable training.

For example, GSPO scores each whole generated solution as one unit and updates the policy from the sequence-level reward, rather than tying every token to its own gradient signal.

Heard on the show

“One is the RL objective itself — GSPO.”

Episode 189 — Why Phone Agents Ace the Test and Crash on Your Actual Phone

Mentioned in 2 episodes

189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe

Related terms

gradient GRPO mixture-of-experts token