Glossary · Term

GSPO

← all terms

Definition

A reinforcement-learning variant that grades a whole generated sequence rather than each token.

Group Sequence Policy Optimization, a sequence-level adaptation of GRPO that avoids per-token gradient noise; useful for mixture-of-experts models where token-level routing makes per-token signals unstable.

Mentioned in 1 episode

  1. 048
    How a 30B Open Model Reached Olympiad Gold With the Right Recipe