Definition
A reinforcement-learning variant that grades a whole generated sequence rather than each token.
Group Sequence Policy Optimization, a sequence-level adaptation of GRPO that avoids per-token gradient noise; useful for mixture-of-experts models where token-level routing makes per-token signals unstable.