Concept · 2 episode(s)

Reward Variance

Definition

Reward variance is the noisiness of the reward signal across runs or samples — high variance makes credit assignment harder and policy gradients sloppier. It’s why a lot of effective RL work is really variance-reduction work in disguise.

Episodes covering this

179
How DeepSeek Made One User Faster Without Slowing Down the Crowd
DSpark: Confidence-Scheduled Speculative Decoding with
XinCheng, XingkaiYu, ChenzeShao et al. · Peking University / DeepSeek-AI·23 min·Jun 27, 2026
010
When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL
RAGEN-2: Reasoning Collapse in Agentic RL
Wang, Gui, Jin et al. · Northwestern University·22 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

GRPO: Group Relative Policy Optimization for Mathematical Reasoning