Concept · 3 episode(s)

Policy Gradient

← all concepts

Definition

Policy gradient methods train a policy directly by following the gradient of expected reward with respect to its parameters, rather than learning a value function and acting greedy. REINFORCE, PPO, and GRPO are all variants tuned for different variance/bias trade-offs.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.