Concept · 7 episode(s)

Policy Gradient

Definition

Policy gradient methods train a policy directly by following the gradient of expected reward with respect to its parameters, rather than learning a value function and acting greedy. REINFORCE, PPO, and GRPO are all variants tuned for different variance/bias trade-offs.

Episodes covering this

165
A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning
Wang, Song, Zhang et al. · Peking University·22 min·Jun 23, 2026
163
Why Training Only on Perfect Solutions Cripples a Model's Reasoning
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
Wei, Kim · Princeton University·22 min·Jun 23, 2026
114
Agents That Rewrite Their Own Weights Instead of Just Taking Notes
Scaling Self-Evolving Agents via Parametric Memory
Ren, Luo, Yang et al. · Peking University / Alibaba Group·26 min·Jun 04, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
060
When Splitting One Model Across Three Agents Doubles Its Accuracy
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
Lu, Fang, Zhong et al. · University of Georgia·26 min·May 20, 2026
028
Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization
Recursive Agent Optimization
Gandhi, Chakraborty, Wang et al. · Carnegie Mellon University·23 min·May 08, 2026
025
The Missing Gradient Term That Predicts Sycophancy in RLHF
Explaining and Preventing Alignment Collapse in Iterative RLHF
Gauthier, Bach, Jordan · Inria·22 min·May 07, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters